Step 1: Set up the project
Open a terminal and navigate to your workspace:
cd ~/dev
Start Claude Code:
claude
Paste this setup prompt:
Create the folder ~/dev/data-engineering/p7. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p7/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.
Claude creates the project directory, downloads the materials, and reads the project context. Once it finishes, you have the data samples, field mapping, templates, and guides ready to go.
Step 2: Read Kyaw Zin Oo's voicemail
Open the U Kyaw Zin Oo chat. He left a voicemail:
Voicemail transcript -- U Kyaw Zin Oo Received 6:47 AM, slight background noise of machinery
Hello, yes, this is Kyaw Zin Oo, I am the managing director at Golden Ayeyarwady Rice Mill in Pathein. My colleague at the Myanmar Rice Federation gave me your contact.
We have... ah... a data problem at our mills. We process about 200 tonnes of paddy every day, two mills, and every morning I need to know the numbers from yesterday. What came in, from which farmers, what moisture, what grade of rice we produced, where it shipped. But the reports are always late, or they have yesterday's numbers that need correcting, and then the corrections cause more confusion.
I would like to discuss building something that updates automatically each day. The data exists -- both mills have systems that log everything. But getting it all in one place, keeping it accurate when corrections happen... this is where I need help.
Please call me back or send an email. I am usually available before 7 in the morning or after 6 in the evening -- during the day I am at the mills.
Thank you very much.
Two mills, 200 tonnes daily, morning numbers needed, corrections causing confusion. That last part matters -- corrections that cause confusion usually means data is getting duplicated or overwritten somewhere. Worth asking about.
Step 3: Review the Mill 1 data
Open materials/mill1-daily-export.csv. This is a sample of Mill 1's daily export -- CSV format.
Load it into DuckDB and look at the shape:
SELECT COUNT(*) as total_rows,
COUNT(DISTINCT farmer_name) as unique_farmers,
COUNT(DISTINCT mill_date) as days,
MIN(paddy_weight_kg) as min_weight,
MAX(paddy_weight_kg) as max_weight
FROM read_csv_auto('materials/mill1-daily-export.csv');
About 155 rows across 5 days. Twelve unique farmers. Columns: record_id, farmer_name, paddy_weight_kg, moisture_pct, grade, price_mmk, mill_date, intake_time.
Note the column names. They matter later.
Step 4: Review the Mill 2 data
Open materials/mill2-daily-export.json. This is Mill 2's daily export -- JSON format, not CSV.
SELECT COUNT(*) as total_rows,
COUNT(DISTINCT supplier_name) as unique_suppliers,
COUNT(DISTINCT processing_date) as days
FROM read_json_auto('materials/mill2-daily-export.json');
About 128 rows. Ten unique suppliers. Same 5-day date range as Mill 1.
Look at the field names: supplier_name (not farmer_name), weight_kg (not paddy_weight_kg), moisture_percent (not moisture_pct), harvest_quality (not grade), payment_amount (not price_mmk), processing_date (not mill_date).
Same business data. Different field names. Different format. This is what happens when two systems evolve independently.
Open materials/field-mapping.md to see the full mapping between the two systems.
Check for something else in the Mill 2 data:
SELECT * FROM read_json_auto('materials/mill2-daily-export.json')
WHERE weight_kg IS NULL;
Records with null weight and quality. These are advance payments -- Kyaw Zin Oo paid a farmer for paddy that hasn't been delivered yet. Normal business, but something the pipeline needs to handle.
Step 5: Message Kyaw Zin Oo
This is a primary client. The voicemail gave you the basics, but the details that matter for pipeline design -- how corrections work, how the two mills' data relates, what happens with advance payments -- are things you need to ask about.
Message Kyaw Zin Oo. Ask about the mills, the data, and especially how corrections work. He's polite and indirect, and he'll go on tangents about the rice business. The information you get depends on the questions you ask.
If you ask about how the two mills send their data, he'll mention that both have systems that send daily files. He may not mention that they use different formats -- to him, "they both send the daily data."
If you ask about what happens when there's an error in the numbers, he'll explain the correction process. That's a piece of information worth having before you design the extraction strategy.
Check: Can you describe what would happen if you loaded Tuesday's corrected data alongside Wednesday's actual data without any deduplication logic?