Learn by Directing AI
All materials

pipeline-spec.md

Pipeline Specification — Mel do Sofala Harvest Data

Overview

This pipeline consolidates honey harvest data from Mel do Sofala, a cooperative in Sofala province, Mozambique. The cooperative collects data from 6 collection points across the province. Each collection point submits a CSV file per quarter — 12 files total covering roughly 20 months of harvest activity. The pipeline loads these files into a DuckDB database, cleans and standardizes the data in a staging layer, then transforms it into a mart table that Carlos can query for production totals, traceability, and beekeeper history.

Source data description

The source is 12 CSV files in materials/harvest-data/. Each file represents one collection point for one quarter. The data covers beekeeper deliveries: who delivered, when, how much, what quality grade, and what price was paid.

Columns vary across files. Not all collection points use the same column names. You will encounter at least four different naming patterns — for example, one file might call the weight column weight_kg while another calls it kg or weight_lbs. Your pipeline must map all variations to a consistent schema.

Date formats vary. Some files use YYYY-MM-DD, others use DD/MM/YYYY, and at least one uses DD-MM-YYYY. All dates must be parsed correctly into a standard DATE type.

Weight units differ. Most collection points record weight in kilograms (range roughly 2-15 kg per delivery). However, 3 of the collection points record weight in pounds instead of kilos. Pound values will be noticeably higher (range roughly 4-33). All weights must be standardized to kilograms in the staging layer. Conversion: 1 lb = 0.453592 kg.

Quality grades. The cooperative uses grades A, B, B+, and C based on moisture content and color. The B+ grade was introduced about 8 months ago — older data files contain only A, B, and C.

Pipeline stages

Stage 1: Extract

Load all 12 CSV files from materials/harvest-data/ into raw DuckDB table(s). Every row from every file must land in the database. Track which file each row came from — you will need this for traceability.

Stage 2: Stage

Create a staging table stg_harvests that conforms the raw data:

  • Rename columns to the standard naming convention (see schema.md)
  • Cast types — dates as DATE, weights as DECIMAL, prices as DECIMAL
  • Standardize weights — convert any pound values to kilograms
  • Handle nulls — some weight and grade values are missing. Preserve them as NULL, do not drop the rows
  • Track source — include the original filename for traceability

The staging layer cleans and conforms. It does not deduplicate or apply business logic.

Stage 3: Mart

Create a mart table fct_harvests from the staging data:

  • Deduplicate on the composite key: beekeeper name + collection date + collection point. Some rows are exact duplicates (data entry errors) — these should collapse to one record
  • Compute total_value — weight in kg multiplied by price per kg
  • Preserve all base columns from staging plus the computed column

The mart serves the consumer. Carlos queries this table for production totals, beekeeper history, and traceability data.

Target schema summary

Two target tables:

  • stg_harvests — cleaned, typed, consistently named. One row per source record. See schema.md for full column definitions.
  • fct_harvests — deduplicated, with computed columns. One row per unique (beekeeper, date, collection point) combination. See schema.md for full column definitions.

Verification targets

These are the numbers you check your pipeline against. If your numbers do not match, something in the pipeline is wrong.

  • Total source row count: 4,809 rows across all 12 files
  • Staging row count: 4,809 (must match source exactly — no rows lost, no rows added)
  • Mart row count after correct deduplication: 4,650 (deduplicated on beekeeper name + collection date + collection point)
  • Mart row count if deduplication key is wrong: If you deduplicate on beekeeper name alone, you will get only 170 rows — this means legitimate separate deliveries were silently merged. The correct composite key produces 4,650.
  • Revenue top-3 collection points (total value = weight_kg * price_per_kg, after correct dedup):
    1. Buzi: ~2,497,224 MZN
    2. Marromeu: ~1,904,096 MZN
    3. Dondo: ~1,588,900 MZN
  • Quality grade distribution (after correct dedup):
    • A: ~1,411 (30.3%)
    • B+: ~238 (5.1%)
    • B: ~1,408 (30.3%)
    • C: ~1,593 (34.3%)
  • Average price per kilo by collection point (for Carlos's later request):
    • Buzi: ~256.09 MZN/kg
    • Chibabava: ~253.06 MZN/kg
    • Dondo: ~251.21 MZN/kg
    • Gorongosa North: ~246.70 MZN/kg
    • Gorongosa South: ~251.02 MZN/kg
    • Marromeu: ~251.10 MZN/kg

What "done" looks like

A pipeline that loads all 12 CSV files, stages them with correct types and standardized weights, transforms them into a deduplicated mart table, and produces output matching all verification targets above. The pipeline must be idempotent — running it a second time produces the same row counts, not doubled rows.