P1: CSV Pipeline — Mel do Sofala

You are building a data pipeline for Carlos Matsinhe at Mel do Sofala, a honey cooperative in Beira, Mozambique. The pipeline consolidates harvest data from 12 CSV files across 6 collection points into a queryable DuckDB database with staging and mart layers.

Client

Carlos Matsinhe — Cooperative Manager, Mel do Sofala. Non-technical. Cares about consolidated data, traceability for buyers, and production totals by region and quality grade. Communicates via email, clear but slightly formal English (second language after Portuguese).

Tech stack

Python (Miniconda de environment)
DuckDB (analytical database via duckdb Python package)
SQL (via DuckDB — all transformations are SQL scripts)
Git / GitHub (version control)

No dbt, no Dagster, no Docker for this project. Raw SQL scripts only.

File structure

materials/              ← Provided inputs (do not modify)
  harvest-data/         ← 12 CSV files from collection points
  pipeline-spec.md      ← What to build and what "correct" means
  schema.md             ← Target table structure and naming conventions
  verification-checklist.md ← Expected values to check against

pipeline/               ← Your SQL scripts (you create this)
  load_raw.sql          ← CSV loading
  staging.sql           ← stg_harvests table
  mart.sql              ← fct_harvests table

mel_do_sofala.duckdb    ← Database file (created by pipeline)

Key material references

materials/pipeline-spec.md — what to build, verification targets
materials/schema.md — staging and mart table structure, naming conventions
materials/verification-checklist.md — exact expected values for every check
materials/harvest-data/ — 12 source CSV files from 6 collection points

Ticket list

T1: Load sample CSV into DuckDB and verify row count
T2: Load all 12 CSV files into raw table(s)
T3: Create staging table with type casting, column renaming, null handling
T4: Verify staging row count matches raw row count
T5: Create mart table with deduplication and aggregations
T6: Verify mart row count and revenue top-3 against checklist
T7: Fix deduplication key if row count is wrong
T8: Re-run pipeline and verify idempotency
T9: Verify quality grade distribution
T10: Add average price per kilo by region
T11: Commit and push to GitHub

Verification targets

Source row count: 4,809 total rows across all 12 CSV files
Staging row count: Must match source exactly (4,809)
Mart row count: 4,650 after correct deduplication (beekeeper + date + collection point)
Revenue top-3: Buzi, Marromeu, Dondo (see verification-checklist.md for exact values)
Grade distribution: A ~30%, B+ ~5%, B ~30%, C ~34% (see checklist for exact counts)
Idempotency: Row count after second run must match first run

Reference materials/verification-checklist.md for all exact values.

Commit convention

One commit per ticket. Message format: T[N]: [what was done and why]

Example: T3: Create staging table — cast types, rename columns to stg_ convention, handle nulls

CLAUDE.md