Step 1: Set up the project workspace
Open your terminal and start Claude Code:
cd ~/dev
claude
Paste this prompt:
Set up my project:
1. Create ~/dev/data-engineering/p1
2. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p1/materials.zip and extract them into that folder
3. Read CLAUDE.md — it's the project governance file
Claude Code creates the folder, downloads the materials, and reads the governance file. That last part matters — CLAUDE.md tells Claude Code what this project is about, what tools are available, and what conventions to follow. Without it, every prompt you write has to carry that context. With it, Claude Code already knows it's working on a honey cooperative pipeline with DuckDB and SQL.
Watch what Claude Code does with the prompt. It should create the directory, download the zip, extract the contents, and read the governance file. If it skips any of those steps or makes assumptions about what's inside the zip, that's worth noticing — what you include in a prompt shapes what AI does, and what you leave out, AI fills in from its defaults.
Step 2: Explore the data files
Look at what's in the materials folder. Ask Claude Code:
List the files in materials/harvest-data/
You'll see CSV files from multiple collection points across Sofala province — Gorongosa North, Gorongosa South, Buzi, Dondo, Chibabava, Marromeu — with data from different quarters. In Unit 1 you looked at one sample file from one collection point. Now you have the full set: twelve files representing nearly two years of harvest records from the entire cooperative.
Each file follows a similar structure — beekeeper names, collection dates, weights, quality grades, prices — but Carlos mentioned the formats aren't exactly the same. That will matter later. For now, note how many files there are and roughly how much data Carlos has been managing by hand.
Step 3: Test-load one CSV into DuckDB
DuckDB is an analytical database that runs in-process — no server to start, no connection to configure. It handles structured queries faster than working with raw CSV files and renders results as formatted tables in the terminal. That's why it's the right tool for this project: Carlos needs his data consolidated and queryable, not served through a web application.
Before loading everything, test with a single file. Direct Claude Code to load the Gorongosa North Q3 data into DuckDB:
Load materials/harvest-data/gorongosa-north-q3.csv into a DuckDB table called raw_harvests. Then query it — show me the first few rows and a row count.
Look at the output. The columns should match what you saw in the sample data: beekeeper_name, collection_point, collection_date, weight_kg, quality_grade, price_paid_mzn. The row count tells you how many records were in that one file.
This is the extraction boundary — the point where data moves from a file on disk into a database table. A CSV with 200 rows should produce exactly 200 rows in the table. If the numbers don't match, something was lost or duplicated during loading. Check the count now, while you're looking at one small file and can verify it by eye.
Step 4: Load all CSV files
Now load the full dataset. Direct Claude Code to load every CSV file from the harvest data directory:
Load all CSV files from materials/harvest-data/ into the raw_harvests table in DuckDB. Each file should add its rows to the same table.
This is the real extraction step — getting data from twelve scattered files into one place. Pay attention to what Claude Code does. Does it load the files one at a time or use a glob pattern? Does it report how many files it processed? Does it mention any issues with encoding or parsing?
When a prompt doesn't specify how to handle something, AI makes its own choices. If a CSV has a date that could be MM/DD or DD/MM, AI picks an interpretation and moves on. If a row has a character encoding issue, AI may silently skip it or silently coerce it. These decisions happen without warning. You won't see an error. You'll see a row count that's slightly off — or you won't notice at all, unless you check.
Step 5: Count the total rows
This is the number that matters for the rest of the project. SQL is the language for querying databases — you write structured questions about data stored in tables, and the database returns answers. Query the raw table:
SELECT COUNT(*) FROM raw_harvests;
Write this number down. This is your source row count — the baseline. Every subsequent stage of the pipeline should account for every one of these rows. If the staging table has fewer rows, records were lost. If the mart table has more rows after a re-run, records were duplicated. This single number is how you'll know.
The expected count is approximately 4,800 rows across all collection points. If you're far off — thousands more or thousands fewer — something went wrong during loading. Go back and check whether all twelve files were loaded, whether any files were skipped, whether Claude Code silently filtered rows it couldn't parse.
A pipeline that runs without errors and produces the wrong row count is a broken pipeline. The count is the first thing you check, every time.
Check your understanding: Query the raw table. Total row count should be approximately 4,800.