Learn by Directing AI
Unit 1

Meet the client and understand the brief

Step 1: Read the brief

Carlos Matsinhe manages a honey cooperative in Sofala province, Mozambique. He has almost two years of harvest data scattered across spreadsheets from 12 collection points — beekeeper names, collection dates, weights, quality grades, prices paid. Buyers in Maputo are asking for traceability he can't provide from scattered files. He needs all of it in one place.

Open the email from Carlos in the project brief. He explains the situation: the cooperative tracks harvests across collection points in Sofala — Gorongosa North, Gorongosa South, Buzi, Dondo, Chibabava, Marromeu — and each point sends him a spreadsheet every quarter. The buyers want to know which region, which harvest, which quality batch. He can't answer those questions from twelve separate files.

Pay attention to what Carlos is actually asking for. He doesn't need a dashboard or a report. He needs the data consolidated — all collection points, all quarters, in one queryable place. That's the job.

Step 2: Message Carlos

Open the chat with Carlos. He's waiting to hear from you.

Pick one of the suggested messages to introduce yourself and confirm you understand what he needs. Carlos responds warmly — he's glad someone is helping. He mentions that the other collection points have "similar formats, though not exactly the same" and that the buyers need traceability. He says he's attached one sample file but has all the others ready to send.

That note about formats not being exactly the same is worth remembering. It won't matter yet, but it will.

Step 3: Review the pipeline specification

Open materials/pipeline-spec.md. This is the document that defines what "done" looks like — what the pipeline should produce, what the target structure is, and what numbers to check against.

The spec describes a pipeline that takes CSV files from all collection points, loads them into a database, cleans and deduplicates the data in a staging layer, and transforms it into a mart table the cooperative can query. A database provides what a folder of CSVs does not: the ability to query across all the data at once, enforce consistent types on every column, and run aggregations without opening each file individually. It includes verification targets: an expected row count around 4,800 across all collection points, revenue totals by collection point, and quality grade distributions.

Read the verification targets carefully. These are the numbers you'll check your work against later. A pipeline that runs without errors but produces the wrong row count is a broken pipeline — the verification targets are how you'll know.

Step 4: Explore the sample data

Open materials/harvest-data/gorongosa-north-q3.csv. This is one quarter of data from one collection point.

Look at the columns: beekeeper_name, collection_point, collection_date, weight_kg, quality_grade, price_paid_mzn. These are records of real transactions — individual beekeepers bringing honey to a collection point, where it's weighed, graded, and paid for in Mozambican metical.

Look at the values. The dates aren't all in the same format. The weight column says kilos, but are all the values actually in kilos? The quality grades are text strings — A, B, C. The beekeeper names are real names. The prices are in whole numbers.

This is one file from one collection point for one quarter. Carlos has twelve collection points sending data over multiple quarters. The pipeline needs to handle all of it.

Step 5: Understand the schema

Open materials/schema.md. This document describes the structure your pipeline will produce — how the data moves from raw files through layers until it's trustworthy enough for Carlos to query.

The schema has layers. Raw data lands first — straight from the CSVs, unchanged. Then it moves to a staging layer (stg_harvests), where types are cast, columns are renamed to a consistent convention, and nulls are handled. The staging layer cleans the data without adding business logic. Finally, it moves to a mart layer (fct_harvests), where deduplication and aggregation produce the consumer-ready table.

The naming convention encodes what each table is. stg_ means staging — source-conforming, cleaned, no business logic. fct_ means fact — a mart table with business logic applied, ready for queries. When you see stg_harvests in the pipeline later, you know what it claims to be. A table named just harvests makes no such claim.

The layers exist because raw data needs work before anyone can trust it. If Carlos queried the raw data directly, he'd get inconsistent date formats, possible duplicate records, and weight values that may not all be in the same unit. Every downstream question — "What's the total production by collection point?" — would require re-cleaning the data every time. The staging and mart layers do that work once.


Check your understanding: How many collection points does the cooperative have? What columns are in the harvest data? What does the mart table need to contain?