Learn by Directing AI
All materials

CLAUDE.md

Golden Ayeyarwady Rice Mill -- Data Pipeline

Project

Building a daily operational reporting pipeline for U Kyaw Zin Oo, Managing Director of Golden Ayeyarwady Rice Mill in Pathein, Myanmar. Two mills with combined 200 metric tons/day paddy processing capacity. The pipeline loads daily data from both mills incrementally, handles corrections without creating duplicates, and produces morning operational reports.

Client

U Kyaw Zin Oo needs accurate morning numbers: paddy received, rice produced, shipments dispatched, by mill. Corrections happen frequently (mill supervisors re-export the full day's data when they find errors). He also wants mill-to-mill comparison and farmer payment tracking over time.

Tech stack

  • dbt Core with DuckDB adapter -- transformation engine
  • DuckDB -- local analytical database
  • Dagster -- orchestration and scheduling
  • Soda Core -- data quality monitoring
  • GitHub Actions -- CI/CD
  • Python -- generation scripts and utilities
  • Git/GitHub -- version control

Data dictionary

Mill 1 (CSV export)

Field Type Description
record_id integer Sequential record identifier
farmer_name string Name of farmer delivering paddy
paddy_weight_kg float Weight of paddy delivered in kilograms
moisture_pct float Moisture content percentage (target: 14-16%)
grade string Rice grade: premium, standard, or low
price_mmk integer Price paid in Myanmar Kyat
mill_date date Date of milling operation
intake_time timestamp Exact time of paddy intake

Mill 2 (JSON export)

Field Type Description
id integer Sequential record identifier
supplier_name string Name of supplier (same as farmer_name in Mill 1)
weight_kg float Weight of paddy in kilograms (null for advance payments)
moisture_percent float Moisture content percentage (null for advance payments)
harvest_quality string Quality grade: A (premium), B (standard), C (low). Null for advance payments
payment_amount integer Payment in Myanmar Kyat
processing_date date Date of processing (ISO format)
timestamp timestamp Exact time (ISO format)

Field mapping (Mill 1 -> Mill 2 -> Unified)

Mill 1 Mill 2 Unified name
farmer_name supplier_name farmer_name
paddy_weight_kg weight_kg paddy_weight_kg
moisture_pct moisture_percent moisture_pct
grade harvest_quality grade (map A->premium, B->standard, C->low)
price_mmk payment_amount price_mmk
mill_date processing_date mill_date
intake_time timestamp intake_time

Known quality issues

  • Corrections as re-exports: When Mill 1 finds an error, the supervisor re-exports the entire day's file. The corrected file contains all records for that day -- both correct and corrected ones. There is no flag distinguishing corrections from originals. Loading without MERGE will create duplicates.
  • Advance payments: Mill 2 includes advance payment records where a farmer was paid for future paddy delivery. These have payment_amount but null weight_kg and harvest_quality. They are normal business operations, not errors.
  • Format divergence: Mill 1 exports CSV, Mill 2 exports JSON. Field names differ (see mapping above). Grade encoding differs (text vs letter code).

dbt naming conventions

  • Staging models: stg_mill1_daily, stg_mill2_daily (source-conform, no business logic)
  • Intermediate models: int_daily_operations (unified view from both mills)
  • Fact models: fct_daily_operations (daily aggregated report)
  • Dimension models: dim_farmers (farmer tracking over time)

Verification targets

  • No duplicate records in staging after correction re-load (MERGE working)
  • Mill 1 + Mill 2 combined totals match int_daily_operations
  • fct_daily_operations totals match staging totals (no join inflation)
  • Watermark advances on each incremental run
  • Pre-commit hook blocks commits with failing dbt tests

Commit convention

Commit after each meaningful unit of work: project setup, each dbt model, each test suite addition, each infrastructure component (hooks, monitoring). Messages describe what changed and why.