Learn by Directing AI

Step 1: Set up the project

Open a terminal and navigate to your workspace:

cd ~/dev

Start Claude Code:

claude

Paste this setup prompt:

Create the folder ~/dev/data-engineering/p8. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p8/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude creates the project directory, downloads the materials, and reads the project context. Once it finishes, you have factory data exports, the material mapping table, and the pipeline spec template ready to go.

Step 2: Read Fatimah's email

Open the Fatimah Al-Rashidi chat. She sent an email about migrating to BigQuery for her construction materials company in Kuwait.

Four factories. A CFO who needs cost attribution reports. A board that won't wait a week. And a concern about cloud query costs being unpredictable.

Her email is concise -- she tells you what she needs but not everything you need to know. Ask her about the factory data: what formats, what systems, how the four factories' data relates to each other. Ask about the board's reporting needs. Ask about who sees the data and whether there are sensitivity concerns.

She won't volunteer everything. The material coding differences across factories, the billing method variation, and the PII in Factory 2's quality data are things she knows but won't mention unless you ask the right questions.

Step 3: Profile the Factory 1 data

Open materials/factory1-export.csv. This is Factory 1's delivery data -- CSV format.

Load it into DuckDB and look at the shape:

SELECT COUNT(*) as total_rows,
       COUNT(DISTINCT project_code) as projects,
       COUNT(DISTINCT material_code) as materials,
       COUNT(DISTINCT customer_name) as customers,
       MIN(delivery_date) as earliest,
       MAX(delivery_date) as latest
FROM read_csv_auto('materials/factory1-export.csv');

Note the column names: delivery_date, material_code, quantity_delivered, customer_name, billing_status, payment_kwd. Note the material code format -- SAP-style codes like MAT-BLK-001.

Check billing_status -- you should see both "delivered" and "milestone" values, plus some nulls. That billing variation matters for how revenue appears in reports.

Step 4: Profile the Factory 2 data

Open materials/factory2-export.csv. Factory 2 exports CSV too, but look at the columns:

SELECT COUNT(*) as total_rows,
       COUNT(DISTINCT project_ref) as projects,
       COUNT(DISTINCT item_code) as items,
       COUNT(DISTINCT buyer) as buyers,
       COUNT(DISTINCT quality_inspector) as inspectors
FROM read_csv_auto('materials/factory2-export.csv');

Different column names for the same concepts. ship_date instead of delivery_date. item_code instead of material_code. qty instead of quantity_delivered. buyer instead of customer_name. And the material codes use an internal format: F2-101 instead of SAP numbers.

That quality_inspector column contains full names of people. Those are employee names -- personal data that not everyone in the company should see.

Step 5: Profile the Factory 3 and 4 data

Open materials/factory3-export.json and materials/factory4-export.json. These are JSON, not CSV.

SELECT COUNT(*) as total_rows,
       COUNT(DISTINCT project) as projects,
       COUNT(DISTINCT material) as materials
FROM read_json_auto('materials/factory3-export.json');

Yet another set of field names: date instead of delivery_date or ship_date. material instead of material_code or item_code. amount instead of quantity_delivered or qty. And the material codes use a precast format: PC-A-001 for Factory 3, PC-B-001 for Factory 4.

Four factories. Four naming conventions. Four material coding schemes. Before any pipeline work starts, you need to reconcile these.

Step 6: Examine the material mapping

Open materials/material-mapping.csv. This is the master mapping table that reconciles the four factories' material coding schemes.

SELECT factory_id, factory_material_code, standard_material_code, material_name, mapping_status
FROM read_csv_auto('materials/material-mapping.csv')
ORDER BY standard_material_code, factory_id
LIMIT 20;

Multiple factory codes map to the same standard code. MAT-BLK-001 (Factory 1) and F2-101 (Factory 2) both map to STD-BLK-001 -- the same product, different internal systems.

Check the mapping_status column. Most entries are "VERIFIED" but a few are "UNVERIFIED" -- the mapping is "mostly up to date" as Fatimah said it would be. Those unverified entries are a data quality concern you'll need to handle.

Step 7: Write a context brief

Before your next work session, write a context brief -- a structured description of what's relevant for the schema design work ahead.

In BigQuery, every query scans data and every scan costs money. That changes how you think about everything downstream. The context brief should capture what AI needs to know before you start designing:

The four data sources with their formats, column names, row counts, and material coding schemes
The material mapping table and its UNVERIFIED entries
Fatimah's requirements: cost attribution by factory, product line, project for the CFO
Any hidden constraints you discovered through conversation (billing methods, PII, coding differences)
The cost context: this pipeline's output will be queried in BigQuery where design decisions have financial consequences

Write this brief and save it. You'll use it to start the next session with AI already knowing the project state.

✓ Check

Check: The context brief includes at least: data source summary (4 factories, formats, row counts), material coding differences, and the business requirement (cost attribution by factory, product line, project).