The Brief
Carlos Matsinhe runs Mel do Sofala, a honey cooperative in Sofala province, Mozambique. Around 350 beekeepers deliver harvests to 12 collection points across the province. Each collection point keeps its own spreadsheet — dates, weights, quality grades, prices paid.
Buyers in Maputo are asking for traceability. Carlos can't provide it from scattered files. He needs all the harvest data in one place: consolidated, cleaned, and queryable. He's sent over a sample CSV from one collection point and says the others have similar formats — "though not exactly the same."
Your Role
You're building Carlos's data pipeline. Take the CSV files, load them into DuckDB, clean and deduplicate the records in a staging layer, and transform them into a mart table the cooperative can actually use.
You'll direct Claude Code to do the implementation. Your job is to give it the right context, review what it produces, and verify the numbers. Everything you need to get started is provided — the pipeline spec, the schema, the expected output targets. Read them before you direct anything.
What's New
This is your first real pipeline. The work follows a pattern you'll use on every project after this: read the spec, direct AI, check the result against known targets.
The hard part isn't getting the pipeline to run. Claude Code will produce something that executes without errors. The hard part is knowing whether the output is actually correct. A pipeline that completes successfully can still produce wrong numbers — and it won't tell you.
Tools
- Python — via your Miniconda
deenvironment - DuckDB — lightweight database for loading and querying the harvest data
- SQL — for transformations inside DuckDB
- Claude Code — your AI agent, doing the implementation work
- Git / GitHub — version control from the start
- VS Code — your editor
Materials
You'll receive:
- Pipeline specification — what to build and what "correct" means
- Schema documentation — the staging and mart layer structure, naming conventions
- Verification checklist — specific values to check your output against
- Sample data — CSV files from Carlos's collection points