VindKraft Analytics -- Turbine Performance Infrastructure
Project
Client: Katrine Moller, Head of Performance Analytics, VindKraft Analytics ApS
Location: Esbjerg, Denmark
What you're building: An SCD-tracked turbine configuration pipeline with a MetricFlow semantic layer and PII masking. The pipeline tracks turbine component changes over time, links every SCADA performance reading to the turbine's configuration at the time it was recorded, standardizes metric calculations across 14 farms, and masks technician PII across all output surfaces.
Tech stack
- dbt Core with DuckDB adapter -- transformation engine, MetricFlow semantic layer, model contracts
- Soda Core -- quality monitoring (batch-level checks)
- Dagster -- orchestration, freshness policies
- GitHub Actions -- CI/CD quality gates
- DuckDB -- local analytical warehouse
- Claude Code -- AI directing tool
- Git/GitHub -- version control
Data dictionary
SCADA data (scada-sample.csv, scada-data-full.csv)
| Column |
Type |
Notes |
| turbine_id |
string |
Farm-assigned format for Danish farms (e.g., DK01-T03), manufacturer serial for German farms (e.g., SN-V110-2847) |
| farm_id |
string |
Farm identifier (DK-01 through DK-04, DE-01, DE-02) |
| timestamp |
datetime |
ISO 8601, 4-hour intervals (sample) or hourly (full) |
| power_output_kw |
float |
Non-negative. 0 during downtime. Rated capacity 2000-3000 kW depending on model |
| wind_speed_ms |
float |
Danish farms: m/s (3-25 range). German farms DE-01 and DE-02: knots (6-48 range) -- needs conversion |
| rotor_rpm |
float |
0 when not generating, 8-16 RPM during generation |
| nacelle_temp_c |
float |
15-45C. ~2% nulls in full dataset (sensor dropout) |
| pitch_angle |
float |
0-25 degrees. ~1% nulls in full dataset |
| yaw_direction |
float |
0-359 degrees |
Component changes (component-changes.csv)
| Column |
Type |
Notes |
| turbine_id |
string |
Mixed format (farm-assigned and manufacturer serial) |
| change_date |
date |
ISO 8601. Range: 2023-06-01 through 2024-06-15 |
| component_type |
string |
One of: gearbox, blade, control_software, generator |
| old_model |
string (nullable) |
10-15% of rows have null (change not recorded) |
| new_model |
string (nullable) |
Some rows have null (partial record) |
Maintenance logs (maintenance-logs.csv)
| Column |
Type |
Notes |
| turbine_id |
string |
Mixed format |
| farm_id |
string |
Farm identifier |
| log_date |
date |
ISO 8601. Range: 2024-01-01 through 2024-06-30 |
| maintenance_type |
string |
"scheduled" (~70%) or "unscheduled" (~30%) |
| technician_name |
string |
PII -- must be masked with SHA-256 in staging |
| description |
string |
Short text (5-15 words) |
| duration_hours |
float |
Scheduled: 2-8 hours. Unscheduled: 1-48 hours |
Turbine mapping (turbine-mapping.csv)
| Column |
Type |
Notes |
| farm_id |
string |
Farm identifier |
| farm_assigned_id |
string (nullable) |
2-3 German farm rows have null (mapping out of date) |
| manufacturer_serial |
string |
Always populated |
| turbine_model |
string |
Vestas V110-2.0, Vestas V90-2.0, Siemens SWT-2.3-108, Siemens SWT-3.0-108 |
| farm_name |
string |
Nordsoe Vest, Ringkobing Fjord, Thy Nord, Vendsyssel, Husum Marsch, Schleswig Binnenland |
Design decisions
Fill these in as you make them:
- SCD strategy: Type 2 for turbine configuration components (gearbox, blade, control_software, generator). Type 1 for status and metadata that don't affect performance analysis.
- Identity resolution: Normalize all turbine IDs to a consistent format using
turbine-mapping.csv. Handle orphaned records (SCADA data for unmapped turbines).
- Metric definitions: "Availability" = (total hours - unscheduled downtime hours) / total hours. "Capacity factor" = actual energy output / maximum possible output at rated capacity. Define once in MetricFlow.
- PII masking: SHA-256 hash technician names in staging layer. Verify across all surfaces: marts, docs, logs, staging access.
Naming conventions
stg_ prefix for staging models
int_ prefix for intermediate models
dim_ prefix for dimension tables
fct_ prefix for fact tables
- Source-conform only in staging (no business logic)
- Conventional commits for version control
Key issues
- German farms use manufacturer serial numbers as turbine IDs; Danish farms use farm-assigned IDs
- Two German farms report wind speed in knots (not m/s)
- Component change log is manually maintained; 10-15% of changes never logged
- Turbine mapping table (Google Sheet) is sometimes out of date -- 2-3 entries have null farm_assigned_id
Verification targets
- All wind speeds in m/s after staging
- All turbine IDs in consistent format after staging
- No overlapping SCD effective date ranges for any turbine
- Every SCADA record joins to exactly one configuration (no orphans, no Cartesian products)
- MetricFlow availability excludes scheduled maintenance
- No PII in mart models, dbt docs output, debug logs, or staging table access
- dbt tests pass (SCD-specific + standard)
- Soda Core checks pass (trend analysis on configuration changes)
- Dagster freshness policies differentiated: tight for fact tables, loose for SCD dimensions
Commit convention
Commit after each significant milestone: staging models complete, SCD dimension built, fact table linked, MetricFlow defined, PII masking verified, quality tests passing.
Tickets / work breakdown
- Set up project and fill pipeline spec from Katrine's requirements
- Design SCD strategy (Type 1 vs Type 2 per dimension)
- Build staging models with unit conversion and identity resolution
- Build SCD Type 2 dimension for turbine configuration
- Build fact table with temporal joins to SCD dimension
- Define MetricFlow metrics (availability, capacity factor)
- Implement dbt model contracts
- Classify PII and build SHA-256 masking macro
- Apply masking in staging layer and verify across surfaces
- Add SCD-specific dbt tests and Soda Core checks
- Configure Dagster differential freshness policies
- Update CI/CD pipeline
- Prepare and present results to Katrine
- Evaluate scope creep (warranty tracking) and close