Golden Ayeyarwady Rice Mill -- Data Pipeline

Project

Building a daily operational reporting pipeline for U Kyaw Zin Oo, Managing Director of Golden Ayeyarwady Rice Mill in Pathein, Myanmar. Two mills with combined 200 metric tons/day paddy processing capacity. The pipeline loads daily data from both mills incrementally, handles corrections without creating duplicates, and produces morning operational reports.

Client

U Kyaw Zin Oo needs accurate morning numbers: paddy received, rice produced, shipments dispatched, by mill. Corrections happen frequently (mill supervisors re-export the full day's data when they find errors). He also wants mill-to-mill comparison and farmer payment tracking over time.

Tech stack

dbt Core with DuckDB adapter -- transformation engine
DuckDB -- local analytical database
Dagster -- orchestration and scheduling
Soda Core -- data quality monitoring
GitHub Actions -- CI/CD
Python -- generation scripts and utilities
Git/GitHub -- version control

Data dictionary

Mill 1 (CSV export)

Field	Type	Description
record_id	integer	Sequential record identifier
farmer_name	string	Name of farmer delivering paddy
paddy_weight_kg	float	Weight of paddy delivered in kilograms
moisture_pct	float	Moisture content percentage (target: 14-16%)
grade	string	Rice grade: premium, standard, or low
price_mmk	integer	Price paid in Myanmar Kyat
mill_date	date	Date of milling operation
intake_time	timestamp	Exact time of paddy intake

Mill 2 (JSON export)

Field	Type	Description
id	integer	Sequential record identifier
supplier_name	string	Name of supplier (same as farmer_name in Mill 1)
weight_kg	float	Weight of paddy in kilograms (null for advance payments)
moisture_percent	float	Moisture content percentage (null for advance payments)
harvest_quality	string	Quality grade: A (premium), B (standard), C (low). Null for advance payments
payment_amount	integer	Payment in Myanmar Kyat
processing_date	date	Date of processing (ISO format)
timestamp	timestamp	Exact time (ISO format)

Field mapping (Mill 1 -> Mill 2 -> Unified)

Mill 1	Mill 2	Unified name
farmer_name	supplier_name	farmer_name
paddy_weight_kg	weight_kg	paddy_weight_kg
moisture_pct	moisture_percent	moisture_pct
grade	harvest_quality	grade (map A->premium, B->standard, C->low)
price_mmk	payment_amount	price_mmk
mill_date	processing_date	mill_date
intake_time	timestamp	intake_time

Known quality issues

Corrections as re-exports: When Mill 1 finds an error, the supervisor re-exports the entire day's file. The corrected file contains all records for that day -- both correct and corrected ones. There is no flag distinguishing corrections from originals. Loading without MERGE will create duplicates.
Advance payments: Mill 2 includes advance payment records where a farmer was paid for future paddy delivery. These have payment_amount but null weight_kg and harvest_quality. They are normal business operations, not errors.
Format divergence: Mill 1 exports CSV, Mill 2 exports JSON. Field names differ (see mapping above). Grade encoding differs (text vs letter code).

dbt naming conventions

Staging models: stg_mill1_daily, stg_mill2_daily (source-conform, no business logic)
Intermediate models: int_daily_operations (unified view from both mills)
Fact models: fct_daily_operations (daily aggregated report)
Dimension models: dim_farmers (farmer tracking over time)

Verification targets

No duplicate records in staging after correction re-load (MERGE working)
Mill 1 + Mill 2 combined totals match int_daily_operations
fct_daily_operations totals match staging totals (no join inflation)
Watermark advances on each incremental run
Pre-commit hook blocks commits with failing dbt tests

Commit convention

Commit after each meaningful unit of work: project setup, each dbt model, each test suite addition, each infrastructure component (hooks, monitoring). Messages describe what changed and why.

CLAUDE.md