Learn by Directing AI
Unit 5

Verify and re-run

Step 1: Re-run the full pipeline

Your pipeline works. Row counts match, revenue top-3 match, the mart table answers Carlos's questions. But there is a question you have not asked yet: what happens if someone runs it again?

A pipeline that only works once is not finished. Next quarter, Carlos will have new spreadsheets from the same collection points. The pipeline needs to load, clean, and transform that data without duplicating what already exists. If re-running it doubles the rows, the production totals double too — and no error message tells you that happened.

Direct Claude Code to run the entire pipeline again — from CSV loading through staging to mart:

Run the full pipeline again — load the CSV files, rebuild the staging table, and rebuild the mart table. Same scripts, same data, same order as before.

Step 2: Check the row counts

After the second run completes, check the mart table:

SELECT COUNT(*) AS mart_count FROM fct_harvests;

Compare that number to the verified count from Unit 4. It should be identical — the same data processed through the same pipeline should produce the same result.

If the count doubled, the pipeline has a problem. It created new rows on top of the existing ones instead of replacing them. The scripts ran without errors. Claude Code may report that everything completed successfully. But the row count tells a different story.

A pipeline that cannot be safely re-run will silently corrupt its own output every time it executes. That is worse than a pipeline that fails loudly, because nobody knows the numbers are wrong until someone checks.

Step 3: Investigate the duplication

If the count is wrong, find out why. The issue is in how the SQL scripts write data to the tables. There are two common patterns: one that appends rows every time it runs, and one that replaces the table contents entirely.

Ask Claude Code to show you the pattern it used:

Show me how the staging and mart SQL scripts write data into the tables. Are they using INSERT, or are they replacing the table?

AI tends to default to INSERT — it adds new rows into an existing table. That works fine on the first run. On the second run, it adds the same rows again. On the third run, triple. The row count grows with every execution, and every aggregate — every total, every average, every count — is wrong by a factor of however many times the pipeline has run.

The fix is to replace the table entirely on each run rather than appending to it. CREATE OR REPLACE TABLE rebuilds from scratch. The result is the same on every execution regardless of how many times it runs.

Step 4: Fix and re-run

Direct Claude Code to fix the scripts:

Change the staging and mart SQL scripts to use CREATE OR REPLACE TABLE instead of INSERT. Then run the full pipeline again.

After the fix, check the row count one more time:

SELECT COUNT(*) AS mart_count FROM fct_harvests;

The count should now match the verified value from Unit 4 — exactly. Not close. Not "within a reasonable range." The same number, because the same data processed through the same logic should always produce the same result. This property is called idempotency — the operation produces the same result whether you run it once or ten times. That is what makes a pipeline trustworthy for repeated use.

Step 5: Verify the quality grade distribution

Row counts confirm the right number of records exist. They do not confirm the values inside those records are correct. One more check: the quality grade distribution.

The harvest data includes quality grades assigned at each collection point. Open materials/verification-checklist.md and find the expected grade distribution. Then query it:

The percentage calculation here uses a window function: SUM(COUNT(*)) OVER() computes each grade's share of the total. The OVER() tells the database to compute the sum across all rows, not just the current group, so each row can compare its count to the grand total.

SELECT quality_grade, COUNT(*) AS grade_count,
       ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) AS pct
FROM fct_harvests
GROUP BY quality_grade
ORDER BY quality_grade;

Compare the grades and percentages against the checklist values. The pipeline spec defines grades A, B, and C — but some collection points adopted an updated grading system that includes B+. If B+ appears in your results, that is correct. The checklist accounts for it.

If the distribution does not match, something in the transformation is altering grade values — possibly a cleaning step that strips the "+" from B+ or maps grades to a different set. That kind of silent change produces a pipeline that looks clean but misrepresents the data.


Check your understanding: Row count after second run matches first run exactly. Quality grade distribution matches verification checklist.