Learn by Directing AI
Unit 1

Meet Katrine and understand the turbine data

Step 1: Read Katrine's email

Katrine Moller runs performance analytics at VindKraft Analytics in Esbjerg, Denmark. She monitors 14 onshore wind farms across Denmark and northern Germany -- about 340 turbines sending SCADA data every 10 minutes.

Open the Katrine Moller chat. She sent an email:

From: Katrine Moller katrine.moller@vindkraft-analytics.dk

Hi,

I run performance analytics at VindKraft Analytics -- we monitor 14 onshore wind farms, about 340 turbines, across Denmark and northern Germany.

Every turbine sends SCADA data every 10 minutes. We have years of this data. The problem is that we can't use it properly because turbines change over time -- gearbox replacements, blade upgrades, software updates -- and our data doesn't track what configuration a turbine had when it generated each data point.

When a farm owner asks "did that 2 million DKK blade upgrade actually improve power output?" I should be able to answer that definitively. Right now I can't, because I'm comparing post-upgrade data to a baseline that includes a mix of configurations.

The second problem is metrics. We report "availability" and "capacity factor" to farm owners, but they don't all agree on what those terms mean. One client counts scheduled maintenance as downtime, another doesn't. We need standardized definitions.

I have:

  • SCADA data via API from each farm's monitoring system (10-minute intervals)
  • A spreadsheet tracking major component changes (manually maintained, not always up to date)
  • Maintenance logs that include technician names (these need to be handled with some care)

I need a proper data infrastructure that tracks turbine configurations over time, links every performance reading to the right configuration, and standardizes our metric calculations.

Happy to walk you through the SCADA data format and our current process.

Best, Katrine

Two distinct problems. The first is temporal: turbines change, but the data doesn't record what configuration was active at each reading. The second is definitional: "availability" means different things to different people. Both require schema-level solutions -- transformations alone won't fix either.

Notice the last bullet point. Maintenance logs that include technician names. Katrine flags this casually, but it matters. Those names are personal data. How they're handled in the pipeline is a governance decision you'll need to make.

Step 2: Review the SCADA data sample

Open materials/scada-sample.csv. This is a small sample -- about 200 rows covering 5 turbines from 3 farms over one week.

Load it into DuckDB and look at the shape:

SELECT turbine_id, farm_id, COUNT(*) as readings
FROM read_csv_auto('materials/scada-sample.csv')
GROUP BY turbine_id, farm_id
ORDER BY farm_id, turbine_id;

Five turbines across three farms. Each turbine produces readings approximately every 4 hours with these columns: turbine_id, farm_id, timestamp, power_output_kw, wind_speed_ms, rotor_rpm, nacelle_temp_c, pitch_angle, yaw_direction.

Look at the turbine_id values. Some follow a pattern like DK01-T03 (farm-assigned). One looks different: SN-V110-2847. That's a manufacturer serial number. Different identification schemes across farms -- something to note.

Step 3: Review the component change log

Open materials/component-changes.csv. This is the temporal dimension of Katrine's problem -- the record of when turbines changed.

SELECT component_type, COUNT(*) as changes, 
       SUM(CASE WHEN old_model = '' OR old_model IS NULL THEN 1 ELSE 0 END) as missing_old
FROM read_csv_auto('materials/component-changes.csv')
GROUP BY component_type
ORDER BY changes DESC;

Four component types: gearbox, blade, control_software, generator. Notice the missing_old column -- some changes were logged without recording what was replaced. This is the manually-maintained spreadsheet Katrine mentioned. Not every change got recorded completely.

The gap between these two files is Katrine's problem. SCADA data flows continuously -- there's no marker in the data stream that says "the blade was replaced today." The component change log records when changes happened, but it sits in a separate spreadsheet. Nothing links a specific SCADA reading to the configuration that was active when it was recorded.

Step 4: Message Katrine

Message Katrine. Before you plan the pipeline, you need to understand how the 14 farms are set up. Good questions focus on how turbines are identified across farms, how data is collected, and anything about measurement or recording that varies between farms.

If you ask about how turbines are identified across different farms, Katrine will mention the mapping table -- a Google Sheet her team maintains that cross-references farm-assigned IDs and manufacturer serial numbers. She'll mention it casually: "Oh right, the German farms use manufacturer serials. We have a mapping sheet but honestly it's not always current."

If you ask about measurement units or data consistency across farms, she'll mention that two German farms report wind speed in knots instead of m/s. Her team converts manually and she's used to it.

If you don't ask these questions, she won't mention either issue. The data problems only surface through the right questions.

Step 5: Notice the data challenges

Whether or not Katrine mentioned them, check the wind speed values by farm:

SELECT farm_id, MIN(wind_speed_ms), MAX(wind_speed_ms), ROUND(AVG(wind_speed_ms), 1) as avg
FROM read_csv_auto('materials/scada-sample.csv')
GROUP BY farm_id
ORDER BY farm_id;

Danish farms (DK-01, DK-02) show wind speeds in a range typical for m/s -- roughly 0 to 20. The German farm (DE-01) shows values in a higher range. A wind speed of 30 is extreme in m/s but normal in knots. That's a unit conversion waiting to be missed.

This is the kind of problem that produces plausible but wrong results. If you average wind speeds across all farms without converting units, the average is a number between the two scales -- it represents neither m/s nor knots. Every downstream calculation that uses wind speed inherits the error.

✓ Check

Check: Can you explain why comparing a turbine's power output from January to its power output from July is misleading if a blade upgrade happened in March -- even if both readings are technically correct?