Learn by Directing AI
Unit 1

Meet Assel and understand the problem

Step 1: Read the brief

Assel Nurzhanova runs two grain elevators in Kazakhstan's Kostanay region. Combined capacity: 120,000 tonnes of wheat, barley, and flax stored for export. Her storage management system logs everything -- bin occupancy, grain type, moisture readings, farmer accounts, arrival and dispatch dates. The system exports CSV files daily.

Open the email from Assel in the project brief. She explains the problem: 800 tonnes lost to spoilage over the past two years. She suspects temperature drops and humidity spikes are causing it, but right now the only weather data comes from someone writing numbers off a website into a notebook every morning. She needs the weather data pulled automatically and combined with her storage data so she can see patterns.

Pay attention to what Assel is actually asking for. She does not need a one-time report. She needs two data sources -- storage and weather -- combined in a way that shows correlations. And it has to work as a daily update.

Step 2: Message Assel

Open the chat with Assel. She is waiting to hear from you.

Pick one of the suggested messages to introduce yourself and confirm you understand the problem. Assel responds directly. She is glad someone is working on this. She confirms the two elevators and says the data exports are daily CSVs. She mentions the 800 tonnes of spoilage with the tone of someone who has been watching it happen and could not do anything about it.

That directness is how Assel operates. She says what she means, does not soften it, and expects the same from you.

Step 3: Review the pipeline specification

Open materials/pipeline-spec.md. This defines what you are building and what "correct" looks like.

The spec describes a pipeline with two data sources: storage CSVs from both elevators and weather data from the Open-Meteo API. The target is a combined fact table that joins storage readings with weather conditions by date -- so Assel can see what weather conditions were present when spoilage occurred.

Read the verification targets. These are the numbers you will check your work against. A pipeline that runs without errors but produces the wrong row count is a broken pipeline. You learned that with Carlos's honey data. The same principle applies here -- the terrain is different, the question is the same.

Step 4: Explore the Elevator A data

Open materials/storage-data/elevator-a-sample.csv. This is one month of storage data from Elevator A.

Look at the columns: bin_id, grain_type, moisture_pct, farmer_id, reading_date, dispatch_date, quality_status. Each row is a daily reading for one bin -- what grain is in it, the moisture level, whose grain it is, and whether the quality is good, degraded, or spoilage.

Look at the bin_id values. They follow a pattern: A-001, A-002, A-003. That "A" is Elevator A. Notice that the file itself does not say which elevator the data comes from -- you have to know from the filename.

Step 5: Explore the Elevator B data

Open materials/storage-data/elevator-b-sample.csv. This is one month from Elevator B.

Look at the bin_id column. Different format: B1-01, B1-02, B1-03. Same columns, same kind of data, different numbering scheme. The two elevators use different systems for identifying their bins.

When you combine data from both elevators, this matters. A query that groups by bin_id will treat A-001 and B1-01 as completely separate identifiers. That is correct -- they are different bins. But the data needs an elevator column so you can tell which facility each reading came from. The CSV files do not include one.


Check your understanding: What two data sources need to be combined? How do the two elevators' bin numbering formats differ?