Learn by Directing AI
Unit 3

Build staging models and resolve turbine identities

Step 1: Build the SCADA staging model

Time to build the first pipeline layer. The staging models source-conform the raw data -- no business logic, just clean inputs.

Direct Claude to build the staging model for SCADA data. Be specific about what you need in your context: the data dictionary, the naming convention (stg_ prefix), and the source-conform constraint. Feed Claude the relevant parts of your CLAUDE.md -- the SCADA columns, the unit conversion issue, and the turbine ID mapping problem.

The staging model needs to:

  • Convert wind speed from knots to m/s for the two German farms (DE-01, DE-02)
  • Join against the turbine mapping table to normalize all turbine IDs to a consistent format
  • Apply the stg_ naming prefix
  • Do no business logic -- source-conform only

The context you give Claude for this task matters. The data dictionary and the unit conversion issue are relevant. The SCD design decision is not -- that's for the dimension build later. Curating what Claude sees per task is a practice. Including everything dilutes attention. Including too little leaves Claude to fill gaps with defaults.

Step 2: Build the component change staging model

Direct Claude to build the staging model for component changes. This one has a data quality challenge: Katrine suspects 10-15% of changes were never logged, and some records have missing old_model or new_model values.

The staging model should clean the change log: standardize component type names, handle the missing values, and use the same turbine ID format as the SCADA staging model.

If the Jinja logic for turbine ID resolution appears in both staging models, consider whether it should be a macro. The decision of what to extract into a macro is a design decision about maintainability. If the same ID resolution logic is duplicated in two models, a change to the mapping rules needs to be applied in two places -- and the second one is easy to miss.

Step 3: Verify the staging output

Query the staging models. Run these checks against your staged data:

SELECT DISTINCT farm_id, 
       MIN(wind_speed_ms) as min_wind, 
       MAX(wind_speed_ms) as max_wind
FROM stg_scada
GROUP BY farm_id
ORDER BY farm_id;

Are all wind speeds now in m/s? The German farms should show values in the same range as the Danish farms -- roughly 0 to 25 m/s. If DE-01 or DE-02 still shows values above 30, the conversion didn't apply.

Check the turbine IDs:

SELECT DISTINCT turbine_id, farm_id
FROM stg_scada
ORDER BY farm_id, turbine_id;

Are all turbine IDs in the consistent format? If you chose to normalize to farm-assigned format, the German farm turbines should now have IDs like DE01-T01 instead of SN-V110-2847. If they still show manufacturer serials, the mapping join didn't work.

Step 4: Profile the staged data

Direct Claude to profile the staged data. The profile informs the SCD build in the next unit.

SELECT COUNT(DISTINCT turbine_id) as turbines,
       COUNT(DISTINCT farm_id) as farms,
       MIN(timestamp) as earliest,
       MAX(timestamp) as latest,
       COUNT(*) as total_readings
FROM stg_scada;

How many distinct turbines? How many farms? What's the date range? For the component changes:

SELECT turbine_id, COUNT(*) as changes
FROM stg_component_changes
GROUP BY turbine_id
ORDER BY changes DESC
LIMIT 10;

How many component changes per turbine on average? Turbines with multiple changes will have multiple SCD Type 2 records. Turbines with no changes will have exactly one.

Step 5: Check turbine ID resolution

Look for orphaned records -- SCADA data where the turbine_id doesn't match any entry in the component change log:

SELECT s.turbine_id, s.farm_id, COUNT(*) as orphaned_readings
FROM stg_scada s
LEFT JOIN stg_component_changes c ON s.turbine_id = c.turbine_id
WHERE c.turbine_id IS NULL
GROUP BY s.turbine_id, s.farm_id;

If any turbines show up here, they have SCADA data but no component change history. This doesn't mean the data is wrong -- it means the mapping is incomplete, which is what Katrine told you about the Google Sheet. These turbines had no recorded component changes in the period.

Decide how to handle them. Quarantine (exclude from analysis until mapping is fixed) or flag (include with a warning column that the component history may be incomplete). Either decision is defensible -- the important thing is making it deliberately rather than letting the data silently vanish.

✓ Check

Check: Query the staging model. Are all wind speeds in the same unit (m/s)? Are turbine IDs consistent across Danish and German farms?