Learn by Directing AI
All materials

CLAUDE.md

VindKraft Analytics -- Turbine Performance Infrastructure

Project

Client: Katrine Moller, Head of Performance Analytics, VindKraft Analytics ApS Location: Esbjerg, Denmark What you're building: An SCD-tracked turbine configuration pipeline with a MetricFlow semantic layer and PII masking. The pipeline tracks turbine component changes over time, links every SCADA performance reading to the turbine's configuration at the time it was recorded, standardizes metric calculations across 14 farms, and masks technician PII across all output surfaces.

Tech stack

  • dbt Core with DuckDB adapter -- transformation engine, MetricFlow semantic layer, model contracts
  • Soda Core -- quality monitoring (batch-level checks)
  • Dagster -- orchestration, freshness policies
  • GitHub Actions -- CI/CD quality gates
  • DuckDB -- local analytical warehouse
  • Claude Code -- AI directing tool
  • Git/GitHub -- version control

Data dictionary

SCADA data (scada-sample.csv, scada-data-full.csv)

Column Type Notes
turbine_id string Farm-assigned format for Danish farms (e.g., DK01-T03), manufacturer serial for German farms (e.g., SN-V110-2847)
farm_id string Farm identifier (DK-01 through DK-04, DE-01, DE-02)
timestamp datetime ISO 8601, 4-hour intervals (sample) or hourly (full)
power_output_kw float Non-negative. 0 during downtime. Rated capacity 2000-3000 kW depending on model
wind_speed_ms float Danish farms: m/s (3-25 range). German farms DE-01 and DE-02: knots (6-48 range) -- needs conversion
rotor_rpm float 0 when not generating, 8-16 RPM during generation
nacelle_temp_c float 15-45C. ~2% nulls in full dataset (sensor dropout)
pitch_angle float 0-25 degrees. ~1% nulls in full dataset
yaw_direction float 0-359 degrees

Component changes (component-changes.csv)

Column Type Notes
turbine_id string Mixed format (farm-assigned and manufacturer serial)
change_date date ISO 8601. Range: 2023-06-01 through 2024-06-15
component_type string One of: gearbox, blade, control_software, generator
old_model string (nullable) 10-15% of rows have null (change not recorded)
new_model string (nullable) Some rows have null (partial record)

Maintenance logs (maintenance-logs.csv)

Column Type Notes
turbine_id string Mixed format
farm_id string Farm identifier
log_date date ISO 8601. Range: 2024-01-01 through 2024-06-30
maintenance_type string "scheduled" (~70%) or "unscheduled" (~30%)
technician_name string PII -- must be masked with SHA-256 in staging
description string Short text (5-15 words)
duration_hours float Scheduled: 2-8 hours. Unscheduled: 1-48 hours

Turbine mapping (turbine-mapping.csv)

Column Type Notes
farm_id string Farm identifier
farm_assigned_id string (nullable) 2-3 German farm rows have null (mapping out of date)
manufacturer_serial string Always populated
turbine_model string Vestas V110-2.0, Vestas V90-2.0, Siemens SWT-2.3-108, Siemens SWT-3.0-108
farm_name string Nordsoe Vest, Ringkobing Fjord, Thy Nord, Vendsyssel, Husum Marsch, Schleswig Binnenland

Design decisions

Fill these in as you make them:

  • SCD strategy: Type 2 for turbine configuration components (gearbox, blade, control_software, generator). Type 1 for status and metadata that don't affect performance analysis.
  • Identity resolution: Normalize all turbine IDs to a consistent format using turbine-mapping.csv. Handle orphaned records (SCADA data for unmapped turbines).
  • Metric definitions: "Availability" = (total hours - unscheduled downtime hours) / total hours. "Capacity factor" = actual energy output / maximum possible output at rated capacity. Define once in MetricFlow.
  • PII masking: SHA-256 hash technician names in staging layer. Verify across all surfaces: marts, docs, logs, staging access.

Naming conventions

  • stg_ prefix for staging models
  • int_ prefix for intermediate models
  • dim_ prefix for dimension tables
  • fct_ prefix for fact tables
  • Source-conform only in staging (no business logic)
  • Conventional commits for version control

Key issues

  • German farms use manufacturer serial numbers as turbine IDs; Danish farms use farm-assigned IDs
  • Two German farms report wind speed in knots (not m/s)
  • Component change log is manually maintained; 10-15% of changes never logged
  • Turbine mapping table (Google Sheet) is sometimes out of date -- 2-3 entries have null farm_assigned_id

Verification targets

  • All wind speeds in m/s after staging
  • All turbine IDs in consistent format after staging
  • No overlapping SCD effective date ranges for any turbine
  • Every SCADA record joins to exactly one configuration (no orphans, no Cartesian products)
  • MetricFlow availability excludes scheduled maintenance
  • No PII in mart models, dbt docs output, debug logs, or staging table access
  • dbt tests pass (SCD-specific + standard)
  • Soda Core checks pass (trend analysis on configuration changes)
  • Dagster freshness policies differentiated: tight for fact tables, loose for SCD dimensions

Commit convention

Commit after each significant milestone: staging models complete, SCD dimension built, fact table linked, MetricFlow defined, PII masking verified, quality tests passing.

Tickets / work breakdown

  1. Set up project and fill pipeline spec from Katrine's requirements
  2. Design SCD strategy (Type 1 vs Type 2 per dimension)
  3. Build staging models with unit conversion and identity resolution
  4. Build SCD Type 2 dimension for turbine configuration
  5. Build fact table with temporal joins to SCD dimension
  6. Define MetricFlow metrics (availability, capacity factor)
  7. Implement dbt model contracts
  8. Classify PII and build SHA-256 masking macro
  9. Apply masking in staging layer and verify across surfaces
  10. Add SCD-specific dbt tests and Soda Core checks
  11. Configure Dagster differential freshness policies
  12. Update CI/CD pipeline
  13. Prepare and present results to Katrine
  14. Evaluate scope creep (warranty tracking) and close