Learn by Directing AI
Unit 2

Set up the project and profile the sources

Step 1: Set up the project

Open a terminal and navigate to your workspace:

cd ~/dev

Start Claude Code:

claude

Paste this setup prompt:

Create the folder ~/dev/data-engineering/p2. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p2/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude creates the workspace, downloads the materials, and reads the project context. Once it finishes, you have everything you need in ~/dev/data-engineering/p2/materials/.

Step 2: Load the storage data

The materials/storage-data/ directory contains monthly CSV exports from both elevators -- six months from Elevator A, six from Elevator B.

Direct Claude Code to load all the storage CSV files into DuckDB. Be specific about what you want: load every file from materials/storage-data/ (excluding the sample files) into a raw table, and add an elevator column derived from the filename so you can tell which facility each reading came from.

This is the first task-sizing decision. You could ask Claude to load the data, profile it, and design the schema all at once. Break it up instead. Profile the source first. What the data looks like determines what the schema should be -- asking Claude to design both at the same time means it will make assumptions about the data instead of looking at it.

Step 3: Count the rows

Query the raw table and count the total rows across all loaded storage data. This is your source row count baseline -- the number you will check at every subsequent stage.

Also count per elevator. You now have two numbers to track: Elevator A rows and Elevator B rows. Check both against the verification checklist.

Step 4: Profile the storage data

Direct Claude Code to profile the storage data: column types, null percentages, value distributions for key columns. Ask specifically about:

  • grain_type: what values appear, and in what proportions?
  • quality_status: what values appear? How many spoilage events are there?
  • bin_id: what formats appear? How do Elevator A bin IDs differ from Elevator B?
  • farmer_id: what do these values look like? How many unique farmers?
  • dispatch_date: what percentage of readings have a dispatch date?

The profile tells you what the data is before you decide what the schema should be. AI is good at profiling structure -- column types, null counts, cardinality. It is less good at interpreting what the values mean. A column with 15% nulls could be "missing data" or "legitimately empty." That distinction is your call, not AI's.

Step 5: Test the weather API

Now investigate the other data source. Direct Claude Code to make a test request to the Open-Meteo API: daily temperature (min, max, average), humidity, and precipitation for Kostanay, Kazakhstan (latitude 53.21, longitude 63.63), for one week.

Examine the response. What fields does it return? What format are the timestamps? What does the response structure look like -- is the data nested inside an object, or flat?

This is a live API. Unlike the CSV files sitting in your materials/ directory, this source is external, rate-limited, and not under your control. When you build the full extraction, you will be making requests against someone else's infrastructure.

Step 6: Note the timezone

Look at the API response carefully. The timestamps are in UTC. The storage readings are logged in Kazakhstan time, which is UTC+6.

If you join storage readings from January 15 with weather data from January 15 without accounting for this, you might be pairing a morning storage reading with weather from what was still January 14 in UTC. For daily grain, this six-hour offset means some readings pair with the wrong day's weather.

This will need to be handled in the staging layer. Note it now; you will address it when you build the transformation.


Check your understanding: Storage data loaded with correct row count. Describe the API response fields and timezone.