P1: CSV Pipeline — Mel do Sofala
You are building a data pipeline for Carlos Matsinhe at Mel do Sofala, a honey cooperative in Beira, Mozambique. The pipeline consolidates harvest data from 12 CSV files across 6 collection points into a queryable DuckDB database with staging and mart layers.
Client
Carlos Matsinhe — Cooperative Manager, Mel do Sofala. Non-technical. Cares about consolidated data, traceability for buyers, and production totals by region and quality grade. Communicates via email, clear but slightly formal English (second language after Portuguese).
Tech stack
- Python (Miniconda
deenvironment) - DuckDB (analytical database via
duckdbPython package) - SQL (via DuckDB — all transformations are SQL scripts)
- Git / GitHub (version control)
No dbt, no Dagster, no Docker for this project. Raw SQL scripts only.
File structure
materials/ ← Provided inputs (do not modify)
harvest-data/ ← 12 CSV files from collection points
pipeline-spec.md ← What to build and what "correct" means
schema.md ← Target table structure and naming conventions
verification-checklist.md ← Expected values to check against
pipeline/ ← Your SQL scripts (you create this)
load_raw.sql ← CSV loading
staging.sql ← stg_harvests table
mart.sql ← fct_harvests table
mel_do_sofala.duckdb ← Database file (created by pipeline)
Key material references
- materials/pipeline-spec.md — what to build, verification targets
- materials/schema.md — staging and mart table structure, naming conventions
- materials/verification-checklist.md — exact expected values for every check
materials/harvest-data/— 12 source CSV files from 6 collection points
Ticket list
- T1: Load sample CSV into DuckDB and verify row count
- T2: Load all 12 CSV files into raw table(s)
- T3: Create staging table with type casting, column renaming, null handling
- T4: Verify staging row count matches raw row count
- T5: Create mart table with deduplication and aggregations
- T6: Verify mart row count and revenue top-3 against checklist
- T7: Fix deduplication key if row count is wrong
- T8: Re-run pipeline and verify idempotency
- T9: Verify quality grade distribution
- T10: Add average price per kilo by region
- T11: Commit and push to GitHub
Verification targets
- Source row count: 4,809 total rows across all 12 CSV files
- Staging row count: Must match source exactly (4,809)
- Mart row count: 4,650 after correct deduplication (beekeeper + date + collection point)
- Revenue top-3: Buzi, Marromeu, Dondo (see
verification-checklist.mdfor exact values) - Grade distribution: A ~30%, B+ ~5%, B ~30%, C ~34% (see checklist for exact counts)
- Idempotency: Row count after second run must match first run
Reference materials/verification-checklist.md for all exact values.
Commit convention
One commit per ticket. Message format: T[N]: [what was done and why]
Example: T3: Create staging table — cast types, rename columns to stg_ convention, handle nulls