Learn by Directing AI
Unit 4

Build staging models

Step 1: The source-conforming pattern

Before building any models, understand what a staging model is supposed to do -- and what it is not supposed to do.

Open materials/pipeline-spec.md and find the "Source-conforming pattern" section. The rule is straightforward: staging models clean and rename columns from raw sources, but they do not transform the data. Type casting, null handling, and column renaming happen in staging. Business logic -- joins, calculations, aggregations -- happens in the mart layer.

The stg_ prefix is a structural claim. When another engineer sees stg_production, they know immediately: this model reads from one raw source, cleans the column names, casts the types, and produces a table where every column follows the project's naming convention. It does not join other sources. It does not calculate anything. The staging layer is a contract -- "this data is clean and consistently named, ready for the mart to use."

The pipeline spec defines the naming convention: snake_case for all columns, stg_ prefix for staging models, fct_ prefix for marts. Check which columns each staging model needs to include. The profitability mart will need variety, kilos_milk_in, kilos_cheese_out, aging_start_date, and aging_end_date from production. It will need variety, quantity_sold_kg, and price_per_kg from sales. It will need shepherd_name, liters_received, and price_per_liter from purchases. If a column is missing from staging, the mart cannot use it.

Step 2: Build stg_production

Direct Claude to write the first staging model. This is the first dbt model you are building -- SQL inside a framework, not a standalone script.

Write a staging model at branzeria_carpati/models/staging/stg_production.sql. It should read from the raw production_log source using source('raw', 'production_log'). Rename and cast all columns following snake_case convention. Include: batch_number, variety, milk_type, kilos_milk_in (numeric), kilos_cheese_out (numeric), aging_start_date (date), aging_end_date (date, nullable), shepherd_name. When aging_end_date is null, leave it as null -- the mart will handle the calculation logic.

Review what Claude produces. The model should be a SELECT statement with source('raw', 'production_log') in the FROM clause. Every column should be explicitly named and cast.

AI commonly generates staging models that mix raw column names with renamed ones in the same model -- some columns cleaned up, others passed through unchanged. Check that every column follows the naming convention from materials/pipeline-spec.md.

The type casts matter too. AI tends to cast types without considering edge cases. A CAST(kilos_milk_in AS DECIMAL) on a column containing text values would produce silent NULLs -- no error, just missing data. For this project, the CSV data is clean, but the habit of checking AI's casts against the actual source data is worth building now.

Step 3: Run dbt run and verify the row count

Run the model from inside the branzeria_carpati/ directory:

dbt run

dbt reports which models ran, in what order, how long each took, and whether they succeeded. This is the first time you are seeing framework-provided pipeline status. In P1 and P2, you checked pipeline results by querying row counts manually. dbt gives you a build report automatically -- which models were built, whether each one passed or failed, and how long the run took overall.

Read the output. The summary line at the end tells you how many models passed, how many failed, and how many were skipped.

Now verify the row count. The staging model must contain exactly 240 rows -- the same number as the raw production log. Query the staging table:

SELECT COUNT(*) FROM stg_production;

Compare that to the raw source:

SELECT COUNT(*) FROM production_log;

If the counts do not match, something in the staging model is filtering or duplicating rows. A staging model that follows the source-conforming pattern should never change the row count. Every row in, every row out. This is the same verification from P1 and P2, but now the student reads dbt's run output as a first check before querying directly.

Step 4: Review the staging model structure

Priya Venkatesh, one of the senior data engineers, flags a common pattern she sees in AI-generated staging models. Her question is direct -- did AI's column selections match what the pipeline spec requires? Open materials/pipeline-spec.md and compare the required columns against what Claude actually included in the model.

Check three things:

Column completeness. Every column the profitability mart needs must be present in the staging layer. If kilos_cheese_out is missing from stg_production, the yield calculation downstream will fail -- but it will fail silently if the mart just skips it. Cross-reference materials/verification-checklist.md for the required column list.

Naming consistency. All columns should follow snake_case. AI sometimes renames most columns but leaves one or two with their raw names -- aging_start_date cleaned up but AgeEndDate passed through unchanged. One inconsistency is a sign there may be others.

Type accuracy. Numeric columns should be cast to numeric types. Date columns should be cast to dates. Check that the casts match what the source columns actually contain.

Step 5: Build stg_sales and stg_purchases

The remaining two staging models follow the same pattern. Direct Claude to build both:

Write a staging model at branzeria_carpati/models/staging/stg_sales.sql. Read from source('raw', 'sales'). Include: customer_name, variety, quantity_sold_kg (numeric), price_per_kg (numeric), sale_date (date), customer_type. All columns snake_case.
Write a staging model at branzeria_carpati/models/staging/stg_purchases.sql. Read from source('raw', 'milk_purchases'). Include: shepherd_name, liters_received (numeric), fat_content_pct (numeric), price_per_liter (numeric), purchase_date (date), milk_type. All columns snake_case.

Run dbt run again after both models are written. This time, dbt builds all three staging models. The output shows the execution order -- since none of the staging models depend on each other (they all read from raw sources via source()), the order comes from file naming, not from DAG dependencies. The dependencies become meaningful when the mart model arrives in the next unit.

Verify the row counts for both new models. stg_sales must match the raw sales count (380 rows). stg_purchases must match the raw purchases count (195 rows):

SELECT
  (SELECT COUNT(*) FROM stg_sales) AS stg_sales_count,
  (SELECT COUNT(*) FROM sales) AS raw_sales_count,
  (SELECT COUNT(*) FROM stg_purchases) AS stg_purchases_count,
  (SELECT COUNT(*) FROM milk_purchases) AS raw_purchases_count;

All staging counts must match their raw source counts exactly. If any count is off, review the staging model for WHERE clauses, JOINs, or DISTINCT operations that should not be there -- a source-conforming staging model does none of these.

Step 6: AI self-review

All three staging models are built and row counts are verified. Now direct Claude to review its own work against the pipeline spec:

Review all three staging models (stg_production, stg_sales, stg_purchases) against the pipeline spec in materials/pipeline-spec.md. List every gap you find in: naming convention compliance (snake_case, stg_ prefix), type casting accuracy, column completeness (are all columns needed for the profitability mart present?), and source-conforming pattern compliance (no joins, no business logic, no filtering). Be specific about what you find.

The quality of AI's self-review depends on how specific your prompt is. A vague "review the models" produces reassurance -- "everything looks good." A prompt that names what to check produces actionable findings. If Claude reports gaps, fix them and re-run dbt run to confirm the changes did not break anything.

This is the same self-review technique from P2, applied to new terrain. The specificity of the review prompt determines whether you get useful findings or a rubber stamp.


✓ Check

✓ Check: stg_production row count matches raw production log. stg_sales matches raw sales. stg_purchases matches raw purchases. Naming convention is consistent across all three models. No columns missing that the mart will need.