Learn by Directing AI
Unit 3

Build the staging layer

Step 1: Review what the staging layer does

Open materials/schema.md and read the staging layer section. The staging table — stg_harvests — sits between the raw data you loaded and the mart table Carlos will actually query. Its job is narrow: take what arrived and make it consistent. Cast dates to date types. Cast weights and prices to numbers. Rename columns to match the naming convention. Handle nulls.

The staging layer does not add business logic. It does not deduplicate. It does not aggregate. It conforms — every column has the right type, the right name, and no silent surprises hiding in the values. The reason this layer exists is simple: if you skip it and build the mart directly from raw data, every downstream query inherits whatever parsing decisions were made during CSV loading. Dates that are still strings. Prices that might be text. A weight column where some values are empty and others are the literal word "null."

Check the pipeline spec (materials/pipeline-spec.md) to confirm what the staging layer should produce. The column names, the expected types, the naming convention — stg_harvests makes a structural claim about the data. When someone sees that prefix, they know the data has been conformed but not transformed.

Step 2: Direct Claude Code to write the staging SQL

Casting means changing a value's stored type — so a date string like "2024-03-15" becomes a real date the database can sort and compare, and a weight stored as text becomes a number you can sum. Without casting, the database treats everything loaded from a CSV as raw text.

Ask Claude Code to create the staging table:

Write a SQL script that creates a stg_harvests table from raw_harvests in DuckDB. It should: cast collection_date to DATE, cast weight_kg and price_paid_mzn to numeric types, rename columns to match the stg_ naming convention from materials/schema.md, and handle null values. Reference materials/schema.md for the exact column specifications.

Claude Code will read the schema documentation and produce a SQL script. Before you run it, look at what it wrote. The SQL contains a series of decisions — how to parse dates that might be in different formats, what to do with a weight value that's empty versus one that's the string "null," whether to coerce a quality grade to uppercase.

Every one of those decisions is an interpretation. AI makes them without stating them. A date string "05/06/2024" could be May 6th or June 5th depending on the format assumption. A weight field with a leading zero might get truncated. These are the kind of silent choices that produce data that looks right but isn't — and no error message tells you it happened.

You don't need to catch every interpretation right now. You need to know they're happening.

Step 3: Run the staging script

Run the script. Then query the staging table:

Run the staging SQL script, then show me the first 10 rows of stg_harvests and a total row count.

Look at the output. Dates should look like dates, not strings. Weights and prices should be numbers. Column names should follow the naming convention from the schema documentation. If something looks off — a column still named price_paid_mzn when the schema calls for something different, or dates showing as text — that's worth catching now.

AI may describe its own output as correct. It might say "all types have been successfully cast" in its summary. That description is not evidence. The query output is evidence. Read the actual values.

Step 4: Compare row counts

This is the verification that matters. Count the rows in both tables:

SELECT COUNT(*) AS raw_count FROM raw_harvests;
SELECT COUNT(*) AS staging_count FROM stg_harvests;

The two numbers should match exactly. The staging layer conforms data — it does not filter it, it does not deduplicate it, it does not drop rows. Every record that arrived in raw_harvests should appear in stg_harvests. If the staging count is lower, records were lost. That means something in the transformation silently excluded rows — maybe a type cast failed on certain values and AI's script skipped those rows instead of handling them. Maybe a WHERE clause filtered out nulls that should have been preserved.

A pipeline that runs without errors and silently drops records is a broken pipeline. The row count comparison is how you catch it.

If the counts don't match, ask Claude Code what happened. Direct it to find which records are in raw_harvests but missing from stg_harvests. Fix the staging script and re-run until the counts match.

Step 5: Spot-check values against the source

Row counts tell you nothing was lost. They don't tell you the values are right. Pick three beekeeper records from the original CSV files and verify them in the staging table:

Show me the staging records for beekeeper [name] from [collection point]. I want to compare against the source CSV.

Open one of the source CSVs — say materials/harvest-data/gorongosa-north-q3.csv — and pick a record. Check the date, the weight, the price, the quality grade. Do the staging values match? A date that shifted by a month during parsing or a weight that lost a decimal place is wrong data that passed every automated check.

Do this for at least three records from different collection points. This is manual, and it's slow, and it's necessary. A revenue total that's close but not exact is wrong. Spot-checking is how you catch the errors that row counts can't.


Check your understanding: Staging table row count must match raw table row count exactly. Spot-check 3 records against source values.