Learn by Directing AI

The Brief

Carlos Matsinhe runs Mel do Sofala, a honey cooperative in Sofala province, Mozambique. Around 350 beekeepers deliver harvests to 12 collection points across the province. Each collection point keeps its own spreadsheet — dates, weights, quality grades, prices paid.

Buyers in Maputo are asking for traceability. Carlos can't provide it from scattered files. He needs all the harvest data in one place: consolidated, cleaned, and queryable. He's sent over a sample CSV from one collection point and says the others have similar formats — "though not exactly the same."

Your Role

You're building Carlos's data pipeline. Take the CSV files, load them into DuckDB, clean and deduplicate the records in a staging layer, and transform them into a mart table the cooperative can actually use.

You'll direct Claude Code to do the implementation. Your job is to give it the right context, review what it produces, and verify the numbers. Everything you need to get started is provided — the pipeline spec, the schema, the expected output targets. Read them before you direct anything.

What's New

This is your first real pipeline. The work follows a pattern you'll use on every project after this: read the spec, direct AI, check the result against known targets.

The hard part isn't getting the pipeline to run. Claude Code will produce something that executes without errors. The hard part is knowing whether the output is actually correct. A pipeline that completes successfully can still produce wrong numbers — and it won't tell you.

Tools

Python — via your Miniconda de environment
DuckDB — lightweight database for loading and querying the harvest data
SQL — for transformations inside DuckDB
Claude Code — your AI agent, doing the implementation work
Git / GitHub — version control from the start
VS Code — your editor

Materials

You'll receive:

Pipeline specification — what to build and what "correct" means
Schema documentation — the staging and mart layer structure, naming conventions
Verification checklist — specific values to check your output against
Sample data — CSV files from Carlos's collection points