Learn by Directing AI

Step 1: Set up the project

Open a terminal and start Claude Code.

cd ~/dev

claude

Paste this setup prompt:

Create the folder ~/dev/data-engineering/branzeria-carpati. Download the project materials from https://learnbydirectingai.dev/materials/data-engineering/branzeria-carpati/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude downloads and extracts everything: the three CSV data sources, the pipeline spec, the verification checklist, the dbt project template, and the CLAUDE.md governance file. Once it finishes, you have a working project directory with all the materials in place.

Step 2: Tour the workspace

Before you start loading data, get oriented. Ask Claude to show you the project structure.

The materials/ folder contains the three data files Mihai mentioned: production-log.csv (240 batch records), sales.csv (380 transactions), and milk-purchases.csv (195 purchase records). The sample files you reviewed in Unit 1 were 12-row previews -- these are the full datasets.

The materials/dbt-template/ directory has a pre-configured dbt project scaffold: dbt_project.yml, profiles.yml (already wired to DuckDB), and a models/ directory with empty staging/ and marts/ subdirectories plus a schema.yml file. You will copy this template into your working directory in a later unit. For now, leave it where it is.

The materials/pipeline-spec.md and materials/verification-checklist.md files define what "correct" looks like. You reviewed the pipeline spec in Unit 1. The verification checklist has the exact numbers you will check your work against at the end.

Step 3: Load the production log

Direct Claude to load the production log into DuckDB:

Load materials/production-log.csv into a DuckDB table called production_log. Count the rows.

The production log is the core of Mihai's operation -- every batch of cheese he has made over eight years. Each row records one production run: the batch number, which variety, what type of milk, how much milk went in, how much cheese came out, when aging started, when it ended, and which shepherd supplied the milk.

The row count is your first baseline. Write it down or keep it visible. When you build the staging model later, the staging table must have exactly this many rows. Any difference means the transformation dropped or duplicated records.

Step 4: Load the sales data and the milk purchases

Load the remaining two sources:

Load materials/sales.csv into a DuckDB table called sales. Count the rows.

Load materials/milk-purchases.csv into a DuckDB table called milk_purchases. Count the rows.

You now have three tables in DuckDB. Three row count baselines. Every number downstream depends on these counts being right -- if a staging model produces fewer rows than the raw source, something was silently filtered or dropped.

Step 5: Profile all three sources

Profiling is how you understand what the data actually contains before you start transforming it. Direct Claude to profile each source, but be specific about what you need:

Profile all three tables. For each table, show: column names and types, null percentages per column, and row counts. Then cross-validate: how many unique varieties appear in production_log vs sales? Do the variety names match exactly (case-sensitive)? How many unique shepherd names appear in production_log vs milk_purchases? Do those names match exactly?

Breaking the profiling into a single, structured request gives Claude enough context to produce a complete picture. Vague requests like "profile the data" tend to produce surface-level summaries that miss the cross-source comparisons you actually need.

The cross-source comparison is where the real findings surface. If variety names differ between production and sales -- "telemea" in one table, "Telemea" in the other -- that mismatch will silently break any join between the two sources. The staging models will need to normalize those names before any downstream aggregation works correctly.

Similarly, check whether shepherd names in the production log match the milk purchases ledger exactly. These are the same shepherds -- Ion Marginean, Florin Ciobanu, Gheorghe Tabacu -- recorded by different people in different spreadsheets. Spelling differences or extra whitespace will break the link between how much milk a shepherd sold and which batches used that milk.

Step 6: Investigate the aging date nulls

The production log sample in Unit 1 showed blank aging end dates on some rows. Now profile the full dataset to see the scale of the problem:

What percentage of rows in production_log have a null aging_end_date? For the rows where it's null, is there a pattern -- specific varieties, time periods, or shepherds?

When Mihai moves cheese to the cold room before aging is complete, the end date never gets recorded. The actual aging duration for those batches has to be estimated -- the sale date minus the aging start date is one approximation, but it assumes the cheese was sold soon after leaving the cold room. That assumption is imperfect.

This matters for profitability. Aging time affects storage cost and yield. If you ignore the null end dates entirely, you lose roughly a fifth of the production data from the profitability calculation. If you estimate badly, the profitability numbers for varieties with long aging times -- cascaval, branza de burduf -- will be wrong. Either way, these nulls are not just missing data. They are a modeling decision that affects the final numbers Mihai will use to decide which varieties to produce next season.

✓ Check