Learn by Directing AI
All materials

CLAUDE.md

P1: CSV Pipeline — Mel do Sofala

You are building a data pipeline for Carlos Matsinhe at Mel do Sofala, a honey cooperative in Beira, Mozambique. The pipeline consolidates harvest data from 12 CSV files across 6 collection points into a queryable DuckDB database with staging and mart layers.

Client

Carlos Matsinhe — Cooperative Manager, Mel do Sofala. Non-technical. Cares about consolidated data, traceability for buyers, and production totals by region and quality grade. Communicates via email, clear but slightly formal English (second language after Portuguese).

Tech stack

  • Python (Miniconda de environment)
  • DuckDB (analytical database via duckdb Python package)
  • SQL (via DuckDB — all transformations are SQL scripts)
  • Git / GitHub (version control)

No dbt, no Dagster, no Docker for this project. Raw SQL scripts only.

File structure

materials/              ← Provided inputs (do not modify)
  harvest-data/         ← 12 CSV files from collection points
  pipeline-spec.md      ← What to build and what "correct" means
  schema.md             ← Target table structure and naming conventions
  verification-checklist.md ← Expected values to check against

pipeline/               ← Your SQL scripts (you create this)
  load_raw.sql          ← CSV loading
  staging.sql           ← stg_harvests table
  mart.sql              ← fct_harvests table

mel_do_sofala.duckdb    ← Database file (created by pipeline)

Key material references

Ticket list

  • T1: Load sample CSV into DuckDB and verify row count
  • T2: Load all 12 CSV files into raw table(s)
  • T3: Create staging table with type casting, column renaming, null handling
  • T4: Verify staging row count matches raw row count
  • T5: Create mart table with deduplication and aggregations
  • T6: Verify mart row count and revenue top-3 against checklist
  • T7: Fix deduplication key if row count is wrong
  • T8: Re-run pipeline and verify idempotency
  • T9: Verify quality grade distribution
  • T10: Add average price per kilo by region
  • T11: Commit and push to GitHub

Verification targets

  • Source row count: 4,809 total rows across all 12 CSV files
  • Staging row count: Must match source exactly (4,809)
  • Mart row count: 4,650 after correct deduplication (beekeeper + date + collection point)
  • Revenue top-3: Buzi, Marromeu, Dondo (see verification-checklist.md for exact values)
  • Grade distribution: A ~30%, B+ ~5%, B ~30%, C ~34% (see checklist for exact counts)
  • Idempotency: Row count after second run must match first run

Reference materials/verification-checklist.md for all exact values.

Commit convention

One commit per ticket. Message format: T[N]: [what was done and why]

Example: T3: Create staging table — cast types, rename columns to stg_ convention, handle nulls