Learn by Directing AI
All materials

pipeline-spec.md

Pipeline Specification -- Branzeria Carpati Profitability

Project overview

Mihai Popescu runs a small artisan cheese operation in Sibiu, Romania, producing six traditional varieties: telemea, cascaval, branza de burduf, urda, cas, and nasal. He has eight years of data across three spreadsheets maintained by different people -- production batches, sales transactions, and milk purchases -- but cannot calculate profitability per variety. The goal: a dbt pipeline that connects milk purchases to production batches to sales, producing profitability by cheese variety that accounts for milk cost, yield, and aging time.

Data sources

Production log (production-log.csv, 240 rows): Batch-level records of cheese production. Columns: batch_number (sequential B001-B240), variety, milk_type (sheep/cow/mixed), kilos_milk_in, kilos_cheese_out, aging_start_date, aging_end_date (sometimes null -- ~20% of records), shepherd_name.

Sales (sales.csv, 380 rows): Customer-level sales transactions. Columns: customer_name, variety, quantity_sold_kg, price_per_kg (in RON), sale_date, customer_type (restaurant/distributor/market). No batch_number column -- sales connect to production only through variety and date approximation.

Milk purchases (milk-purchases.csv, 195 rows): Purchase records from local shepherds. Columns: shepherd_name, liters_received, fat_content_pct, price_per_liter (in RON), purchase_date, milk_type (sheep/cow/mixed).

Target outputs

The pipeline produces three outputs for Mihai:

  1. Profitability by cheese variety -- revenue minus milk cost for each of the six varieties, accounting for yield (cheese kilos out per milk kilos in) and aging time. This is the core deliverable.
  2. Yield per variety -- cheese kilos out divided by milk kilos in, aggregated by variety. Shows production efficiency.
  3. Quarterly summary -- profitability and yield aggregated by quarter for Mihai's accountant.

dbt project expectations

Naming conventions: Staging models use the stg_ prefix (e.g., stg_production, stg_sales, stg_purchases). Fact/mart tables use the fct_ prefix (e.g., fct_variety_profitability). All column names use snake_case.

Source-conforming pattern: Staging models clean and rename columns from raw sources but do not transform the data. Type casting, null handling, and column renaming happen in staging. Business logic (joins, calculations, aggregations) happens in the mart layer.

Dependencies: Mart models reference staging models via ref(). Staging models reference raw tables via source(). All raw tables must have source definitions in schema.yml.

dbt test requirements

Required tests in schema.yml:

  • unique on batch_number in stg_production
  • not_null on key columns: batch_number, variety, shepherd_name (production); customer_name, variety (sales); shepherd_name (purchases)
  • accepted_values on variety across all models: telemea, cascaval, branza_de_burduf, urda, cas, nasal
  • relationships between stg_production and stg_purchases on shepherd_name -- every shepherd in production must exist in purchases

Verification targets

  • Row counts: Staging models must match raw source counts exactly (production: 240, sales: 380, purchases: 195)
  • Profitability: Cross-reference verification-checklist.md for expected margin ranges per variety. Check at least two varieties against the expected values.
  • Idempotency: Running dbt run twice must produce identical output -- same row counts, same profitability values, no duplicates