Learn by Directing AI
Unit 1

The voicemail and the data

Step 1: Project setup

Open a terminal and start Claude Code:

cd ~/dev
claude

Paste this prompt:

Set up my project:
1. Create ~/dev/data-science/p4
2. Download the project materials from https://learnbydirectingai.dev/materials/datascience/p4/materials.zip and extract them into that folder
3. Read CLAUDE.md -- it's the project governance file

Claude will create the folder, download and extract the materials, and read through CLAUDE.md. That file has the full project context: the client, the deliverable, the tech stack, the ticket list, and the verification guidance.

Once Claude confirms it has read CLAUDE.md, you are set up.

Step 2: Listen to Luciana's voicemail

Open materials/voicemail-transcript.md.

Luciana Moretti is the owner and winemaker at Bodega Moretti -- a small family winery in Mendoza, Argentina. Three vineyard plots at different altitudes, Malbec and Cabernet Sauvignon, about 15,000 bottles a year.

Her problem is specific: she tastes through hundreds of barrel samples each harvest to decide which ones become Reserve. The Reserve sells for four times the standard price. She knows she is inconsistent, she is getting busier, and she wants a model that can predict which barrels deserve Reserve based on the production data.

She also wants to know what production factors actually drive quality. And she needs something she can explain to her export partners.

Step 3: Reply to Luciana

Below the voicemail transcript, you will see reply options. Pick the one that fits -- something that confirms you will start by profiling the data.

Luciana responds within a day. She is relieved someone is looking at this, mentions the upcoming harvest, and adds: "I'd rather taste ten extra barrels than miss one great one." That sentence will matter more than it seems right now.

Step 4: Profile the dataset

Direct AI to load materials/barrel-data.csv. Ask for the shape, column names and types, null counts, and value ranges. Do this as a single focused request -- profile the dataset, nothing else.

Read the output. You have approximately 3,000 barrel samples across five vintages. Columns include production variables (fermentation temperature, altitude, rainfall, soil analysis, barrel aging, oak type) and an outcome variable (panel_score -- the blind tasting average on a 1-100 scale).

Notice what else is there. Some columns are straightforward production inputs. At least one column might not be what it seems. You will come back to this.

Step 5: Check the class distribution

This is the most important profiling step for this project. Direct AI to count how many barrels scored 90 or above (the Reserve threshold).

The answer will be somewhere around 8%. About 240 barrels out of 3,000.

That number -- 8% Reserve, 92% Standard -- determines everything about how you evaluate a model on this data. A model that gets most predictions right is not the same as a model that finds the barrels Luciana needs.

Step 6: Read the data dictionary

Open materials/data-dictionary.md. It explains what each column represents: which are production inputs, which are outcomes, what the measurement units are.

Pay attention to which variables are available before Luciana tastes the barrel (altitude, fermentation temperature, rainfall, soil analysis, barrel aging) and which describe the outcome (panel_score). That distinction will matter when you prepare the data for modeling.

Note any columns where the description suggests the value depends on the tasting outcome rather than predicting it.

✓ Check

Check: Dataset loaded. Class distribution: ~8% Reserve. The student can explain why this matters.