Learn by Directing AI

Step 1: Review what the mart layer does

Open materials/pipeline-spec.md and read the mart layer section. Then open materials/schema.md and look at the fct_harvests table definition.

The staging table cleaned the data. It cast types, renamed columns, handled nulls. But it did not answer any of Carlos's questions. He needs production totals by collection point. He needs to look up what a specific beekeeper delivered and when. He needs traceability data for buyers in Maputo who want to know where the honey came from.

The mart layer is where those questions get answered. fct_harvests takes the clean staging data and transforms it into something a cooperative manager can actually query. That means deduplication — making sure the same harvest record does not appear twice — and structuring the data so that grouping by collection point, by grade, or by beekeeper produces meaningful totals.

The distinction matters: staging conforms data, the mart serves the consumer. If the staging layer is a promise that the data is clean, the mart layer is a promise that the numbers mean something.

Step 2: Direct Claude Code to write the mart SQL

A deduplication key is the combination of columns that uniquely identifies a record. Choosing the wrong key means keeping duplicates or losing valid records — if you deduplicate on beekeeper name alone, you collapse every delivery that beekeeper ever made into a single row.

Ask Claude Code to create the mart table:

Write a SQL script that creates fct_harvests from stg_harvests in DuckDB. It should: deduplicate the harvest records, preserve all columns needed for production reporting by collection point and quality grade, and support beekeeper-level history queries. Reference materials/schema.md for the column specifications and materials/pipeline-spec.md for what the mart needs to support.

Claude Code will read the schema and pipeline spec and produce a SQL script. Before you run it, read what it wrote. The script contains a deduplication strategy — some decision about which column or columns identify a unique harvest record. That decision determines what the mart table contains. A different key produces different row counts, different totals, different answers to every query Carlos runs.

AI commonly gets deduplication logic wrong. It tends to pick the simplest, most obvious column as the unique key — which may not actually identify unique records in data where the same entity appears multiple times for legitimate reasons. When that happens, separate records get silently merged. No error. The row count just comes out lower than it should, and the totals are wrong.

You do not need to catch the problem here. You need to know the check is coming.

Step 3: Run the mart script

Run the script and query the result:

Run the mart SQL script, then show me the first 10 rows of fct_harvests and a total row count.

Look at the output. The mart table should have the columns described in the schema documentation. The row count should be lower than staging — deduplication removes true duplicates. But "lower" is not enough information. The question is whether the count matches the expected value.

AI may summarize its own output confidently. It might say "deduplication complete, all records processed correctly." That is a description of what it did, not evidence that the result is right. The query output is the evidence.

Step 4: Check the mart row count

Open materials/verification-checklist.md and find the expected mart row count. Then compare:

SELECT COUNT(*) AS mart_count FROM fct_harvests;

Does your count match the expected value from the checklist? If it does, move to Step 6. If it does not, something went wrong during deduplication — and the numbers in every downstream query are affected.

A mart row count that is too low means records that should be separate were merged. Harvests that actually happened — different dates, different weights, different prices — were collapsed into one because the deduplication treated them as duplicates. The data looks cleaner than it is. The totals are wrong.

Step 5: Investigate the mismatch

If the count is wrong, find out why. Look at the deduplication logic in the SQL script Claude Code wrote. What column or columns did it use as the unique key?

Show me the deduplication logic in the mart SQL script. What columns does it use to identify unique records?

Think about what makes a harvest record unique. A beekeeper delivers honey on a specific date, at a specific collection point, with a specific weight and grade. The same beekeeper delivering on two different dates is two separate harvests — not a duplicate. If the deduplication key only uses the beekeeper name, every beekeeper's history gets collapsed to a single record.

Direct Claude Code to fix the deduplication key:

The deduplication key needs to be a composite of beekeeper name and collection date — the same beekeeper delivering on different dates are separate harvests, not duplicates. Fix the mart SQL and re-run it.

After the fix, check the row count again. It should now match the verification checklist value.

Step 6: Spot-check the revenue top-3

Row counts tell you the right number of records exist. They do not tell you the values are correct. Check the revenue totals by collection point:

SELECT collection_point, SUM(price_paid_mzn) AS total_revenue
FROM fct_harvests
GROUP BY collection_point
ORDER BY total_revenue DESC
LIMIT 3;

Compare the top three collection points and their revenue totals against the values in materials/verification-checklist.md. The names and the numbers should match. If the revenue totals are wrong — even if the row count is now correct — there is still a problem in the transformation logic. A total that is close but not exact is wrong.