Step 1: Project setup
Open a terminal and start Claude Code:
cd ~/dev
claude
Paste this prompt:
Set up my project:
1. Create ~/dev/data-science/p8
2. Download the project materials from https://learnbydirectingai.dev/materials/datascience/p8/materials.zip and extract them into that folder
3. Read CLAUDE.md -- it's the project governance file
Claude will create the folder, download and extract the materials, and read through the project context. The CLAUDE.md is one you author at session start -- standard practice since P7.
Step 2: Read Budi's message
Open the project in the platform. Budi Hartono owns a shrimp farm in Sidoarjo, East Java. Eight ponds, vannamei shrimp for export. He texts from a referral at a tech meetup.
Short message. Two data sources. One problem: they are in separate systems and he cannot connect them. He wants to know if water quality explains why some harvests are better.
His last line -- "My friend said something about SQL" -- is a hint about what comes next.
Step 3: Reply to Budi
Write your own message to Budi. There are no suggested replies -- you decide how to open the conversation.
Ask about the data. What does he have, where is it, how is it organized? Ask about his goal -- what specific questions does he want answered?
Through the conversation, discover:
- He has sensor readings (hourly, 3 ponds, 6 months) and production records (per-cycle, 8 ponds, 2 years)
- The sensor system uses IDs (SID-001, SID-003, SID-006) but the production records use names (Pond A through Pond H)
- He knows which pond has which sensor -- the mapping is in his head, not documented
That ID mismatch matters. Two systems, two naming conventions, no documented link between them.
Step 4: Profile the sensor data
Direct AI to load and profile materials/sensor-readings.csv. Ask for the shape, column names and types, date range, the distinct sensor IDs, and basic statistics for the water quality columns.
The data covers January through June 2025. Three sensors taking hourly readings: pH, dissolved oxygen, temperature, salinity. Some rows are missing -- gaps from power outages during heavy rain, about 2% of the data.
Look at the dissolved oxygen values. The range matters for shrimp survival -- below 4.0 mg/L causes stress, below 3.0 causes mortality. Some readings approach those thresholds.
Step 5: Profile the production records
Direct AI to load and profile materials/production-records.csv. Eight ponds, four harvest cycles over two years. Each row is one pond in one cycle: stocking density, survival rate, average weight, feed conversion ratio, total yield.
Look at the survival rates across ponds. Some ponds consistently outperform others. Is that noise, or is there a pattern?
Step 6: Identify the join challenge
The sensor data has sensor IDs. The production records have pond names. No column connects them directly. This is the first cross-source join problem -- two datasets that describe the same ponds but use different identifiers.
Ask Budi about the mapping. He provides it from memory: SID-001 is Pond C, SID-003 is Pond E, SID-006 is Pond G. Three of eight ponds have sensor data.
Step 7: Plan the approach
Before starting the analysis, think about what kind of question Budi is asking. He wants to know if water quality explains harvest outcomes. Is that description, inference, or prediction?
It is mostly descriptive and exploratory -- which parameters correlate with which outcomes? There may be an inferential component (does water quality explain the difference between sensor and non-sensor ponds?) but with only three sensor-equipped ponds, the sample limits what you can claim.
Direct AI to outline the analytical approach before executing anything. Plan first, then work.
Check: Both datasets profiled. The student can state: sensor data has ~12,900 rows covering 3 ponds over 6 months; production records have 32 rows covering 8 ponds over 2 years. The pond ID vs pond name mismatch is identified.