Step 1: Read Valentina's Email
Open materials/valentina-email.md. Valentina runs a specialty coffee export business in Huila, Colombia. Twelve farms, direct export to roasters in Europe, Japan, and the US. She commits to contracts six months in advance based on gut feeling and last year's numbers, and last year she was short 20-30% on an order to Copenhagen.
She has two years of daily sensor data from all twelve farms and harvest records. She wants predictions per farm for the upcoming harvest.
Step 2: Talk to Valentina
Open the chat with Valentina. Her email gives the broad picture, but there are details worth asking about. What does the data look like? How many harvests per year? Are there any changes to the farms or data quality issues she knows about?
When you ask about the data, she'll explain the sensor readings: daily temperature, rainfall, soil moisture, humidity from each farm. Two harvests per year in Huila -- one around April through June, another October through December. Weather during flowering and cherry development drives yield.
If you ask about data quality, she'll mention the sensor gaps: "Some sensors went offline during the heavy rains. Missing a week here and there. That's just how it is."
If you ask about changes to the farms, she'll tell you about the variety switch: "Two of my farms switched from Castillo to Gesha about eighteen months ago. La cosecha is completely different with Gesha. Higher quality, lower volume."
Both of these details matter for what comes next.
Step 3: Set Up the Project
Open your terminal and start Claude Code:
cd ~/dev
claude
Paste this prompt:
Create the folder ~/dev/ml/p4. Download the project materials from https://learnbydirectingai.dev/materials/ml/p4/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.
Claude will download the materials and set up the workspace. After it finishes, look at what's in materials/. You should see CLAUDE.md, valentina-email.md, tickets.md, eval-template.md, and the two data files: sensor-data.csv and harvest-records.csv.
Step 4: Profile the Sensor Data
Direct Claude to load and profile materials/sensor-data.csv. Ask it to show you the structure: what columns exist, what the date range is, how many farms, and what the value ranges look like.
The dataset has one row per farm per day. Twelve farms, about two years of daily readings. Each row records temperature, rainfall, soil moisture, humidity, altitude (fixed per farm), and the coffee variety.
Look at the dates. This data has temporal structure -- it's ordered by time, with seasonal patterns. The sensor readings from October through March (the growing season) are what drive the harvest that follows. Keep that in mind.
Direct Claude to check for gaps in the data. Some farms have fewer rows than expected during December through February -- those are the sensor outages Valentina mentioned. The data isn't broken; it's just incomplete during the heavy rain periods.
Step 5: Profile the Harvest Records
Direct Claude to load and profile materials/harvest-records.csv. This file has one row per farm per harvest period. Four harvest periods across two years: 2022-H2, 2023-H1, 2023-H2, 2024-H1.
Look at the yield numbers across farms. Most farms produce between 1,500 and 2,500 kilos per harvest. But two farms -- farm_05 and farm_09 -- show noticeably different numbers in their most recent harvest. Lower yield, but check the quality scores: they're higher. These are the Gesha farms Valentina mentioned. The variety change affected both volume and quality.
This data profile tells you what you're working with: twelve farms, four harvests, temporal sensor readings with gaps, and a variety change that affects two farms. Every decision you make downstream -- features, splitting, outlier handling -- will reference what you've found here.
Check: The student can describe the dataset's temporal structure (daily readings, seasonal harvests) and name at least two data quality issues (sensor gaps, variety change).