Step 1: Read Mihai's Slack message
Mihai Popescu runs Branzeria Carpati, a small artisan cheese operation in Sibiu, Romania. He makes six traditional varieties -- telemea, cascaval, branza de burduf, urda, cas, and nasal -- from sheep and cow milk sourced from local shepherds in the Transylvanian hills.
Open the Slack channel. Mihai's message is waiting.
He has a simple question and no way to answer it: which cheese variety actually makes money? His accountant asks every quarter. Mihai shrugs and says he thinks telemea does well because restaurants order a lot, but he has no numbers behind that guess. He has eight years of data spread across three spreadsheets kept by different people -- a production log, a sales record, and a milk purchase ledger. The data exists. The connection between them does not.
Three spreadsheets. Three different people maintaining them. One question nobody can answer.
Step 2: Message Mihai
Open the chat with Mihai. Pick one of the suggested messages to introduce yourself and confirm you understand what he needs.
Mihai responds with warmth. He talks about his grandmother's recipe for branza de burduf -- how she packed it into pine bark and let the resin flavor seep in over weeks. He mentions the quality of Transylvanian sheep milk, how the fat content varies by season and shepherd, how some shepherds bring richer milk from higher pastures. Then he catches himself: "But none of that helps me when my accountant asks which variety I should make more of. I just want to know which cheese makes money."
That mix of craft passion and business frustration is Mihai. He cares deeply about the cheese itself, but the business side of the operation runs on guesswork. The profitability question is not academic -- it determines which varieties he produces next season.
Step 3: Review the pipeline specification
Open materials/pipeline-spec.md. This defines what you are building and what "correct" looks like.
The target is profitability by cheese variety -- revenue minus milk cost for each of the six varieties, accounting for yield (cheese kilos out per milk kilos in) and aging time. Mihai also wants yield per variety and a quarterly summary for his accountant.
The spec describes three data sources: a production log (240 rows of batch-level records), a sales file (380 transactions), and a milk purchases ledger (195 records from local shepherds). The pipeline connects them through a dbt project with staging models, a profitability mart, and automated tests.
Read the verification targets at the bottom. These are the numbers you will check your work against when the pipeline is done. A pipeline that runs without errors but produces the wrong profitability numbers means Mihai makes the wrong decision about which varieties to produce next season. The numbers matter.
Step 4: Explore the production log
Open materials/production-log-sample.csv. This is a 12-row sample from the full production log.
Look at the columns: batch_number, variety, milk_type, kilos_milk_in, kilos_cheese_out, aging_start_date, aging_end_date, shepherd_name. Each row is one production batch -- one run of one cheese variety from one batch of milk.
Notice the batch_number column. Every batch has a unique identifier: B001, B002, B003. This is how Mihai tracks individual production runs.
Now look at aging_end_date. Some rows have dates. Some are blank. When the aging end date is missing, it means the cheese was moved to the cold room before aging finished -- the actual duration has to be estimated later. About 20% of production records have this gap. That detail will matter when you calculate profitability, because aging time affects cost.
Also notice shepherd_name. Each batch records which shepherd supplied the milk. Ion Marginean, Florin Ciobanu, Gheorghe Tabacu -- these are real people Mihai buys from. Their names appear in both the production log and the milk purchases ledger, linking the two sources.
Step 5: Explore the sales data
Open materials/sales-sample.csv. This is a 12-row sample from the full sales file.
Look at the columns: customer_name, variety, quantity_sold_kg, price_per_kg, sale_date, customer_type. Mihai sells to restaurants in Sibiu and Bucharest, distributors across Transylvania, and directly at the Piata Mare market.
Now look for batch_number. It is not there.
The production log tracks every batch with a unique number. The sales data does not reference those batch numbers at all. There is no field that connects a specific sale to a specific production batch. The only overlap between production and sales is variety and approximate date ranges -- both tables record which cheese variety, but the sales data does not say which batch the cheese came from.
This is the core modeling constraint for the entire project. Mihai cannot calculate per-batch profitability because no key links a production batch to a sale. Profitability can only be calculated at the variety level -- total milk cost for all telemea batches versus total revenue from all telemea sales. That is a meaningful answer for Mihai's question, but it is a different granularity than batch-level tracking.
Check: The student can answer: What three data sources need to be connected? Why can't Mihai calculate profitability today? What field connects production to sales -- and what's the problem with that connection?