Learn by Directing AI
Unit 2

Plan the pipeline and design the SCD strategy

Step 1: Set up the project

Open a terminal and start Claude Code:

cd ~/dev
claude

Paste this setup prompt:

Create the folder ~/dev/data-engineering/p6. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p6/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude downloads the materials, extracts them, and reads the CLAUDE.md. After it finishes, check what you have:

ls materials/

You should see SCADA data files, component change logs, templates for the pipeline spec and SCD design, a CLAUDE.md template, and guides for MetricFlow and PII classification. These are your working materials for the project.

Step 2: Fill the pipeline spec

Open materials/pipeline-spec-template.md. This is the empty pipeline specification -- sections with guiding questions for you to fill from Katrine's email and your follow-up conversation.

Work through each section. The Data Sources section needs the SCADA data, the component change log, the maintenance logs, and the turbine mapping table. The Layer Architecture section needs your staging, intermediate, and mart conventions. The SCD Strategy section is the big one -- you'll fill that in the next step after reading the design template.

Ask Claude to help fill the template from the context you've gathered. Feed it the data dictionary from the CLAUDE.md and Katrine's requirements. Check what Claude produces against what Katrine actually said -- AI tends to add requirements the client didn't mention.

Step 3: Understand SCD strategies

Open materials/scd-design-template.md. This explains the design decision you're about to make.

Slowly-changing dimensions are how a data warehouse handles entities that change over time. A turbine is the same turbine after a blade upgrade, but its configuration is different. How you record that change determines what historical analysis is possible.

Type 1 overwrites. When something changes, the old value is gone. Simple. Every query returns the current state. But you lose all history -- you can never answer "what was the configuration before the upgrade?"

Type 2 tracks history. Every change creates a new row with effective_from and effective_to dates. Both the old and new configurations coexist in the table. You can answer "which configuration was active on March 10?" But every join to this table must include a date range condition. Miss that filter and you get a Cartesian product -- every reading matches every configuration.

graph LR
    subgraph "Type 1: Overwrite"
        A1[DK01-T02<br/>blade: LM 43.2<br/>No history]
    end
    subgraph "Type 2: History"
        B1[DK01-T02<br/>blade: LM 40.3<br/>2023-01 to 2024-03-15]
        B2[DK01-T02<br/>blade: LM 43.2<br/>2024-03-15 to current]
    end

Step 4: Make the SCD strategy decision

This is the most consequential schema decision in the project. For each dimension attribute, you decide: Type 1 or Type 2?

Turbine configuration components -- gearbox model, blade model, software version, generator model -- need Type 2. The entire point of Katrine's project is answering "did the blade upgrade improve performance?" That requires linking each SCADA reading to the configuration that was active when it was recorded. Type 1 would destroy the history she needs.

But not everything needs Type 2. What about turbine status (operational, maintenance, decommissioned)? What about metadata like the farm name or turbine model? AI will suggest Type 2 for everything -- more complete, more history. But completeness nobody needs is complexity everybody pays for.

Fill the SCD strategy table in the design template. For each attribute, write which type you chose and tie the rationale to Katrine's actual analytical needs. The test: would Katrine ever ask a question that requires the historical value of this attribute? If not, Type 1.

Step 5: Plan the pipeline work

Use plan mode to decompose the pipeline work. Ask Claude to create a plan for the full build:

  1. Resolve turbine ID mapping
  2. Build staging models with unit conversions
  3. Build SCD Type 2 dimension for turbine configuration
  4. Build fact table linking SCADA data to configuration via effective dates
  5. Define MetricFlow metrics
  6. Add PII masking
  7. Verify across surfaces
  8. Add quality tests

Review Claude's plan. Check that the sequencing makes sense -- staging before dimensions, dimensions before fact tables, masking before verification. If Claude suggests doing things in a different order, evaluate whether the dependencies hold.

Step 6: Create the project CLAUDE.md

Open materials/claudemd-template.md. This is the project governance file template. Fill it with what you've established so far:

  • The data dictionary (SCADA columns, component change columns, maintenance log columns)
  • The SCD design decision (which dimensions get Type 2, which get Type 1, and why)
  • The turbine ID mapping issue (German farms use manufacturer serials, Danish use farm-assigned)
  • The wind speed unit issue (two German farms report in knots)
  • Naming conventions (stg_, int_, dim_, fct_ prefixes)

This file shapes every AI interaction for the rest of the project. When you start a new Claude session, it reads this file and has the context for every prompt. The quality of what you put here determines the quality of what Claude produces downstream.

✓ Check

Check: Can you explain why you chose Type 1 for at least one dimension and Type 2 for another -- and what Katrine would lose if you had chosen differently?