Learn by Directing AI
Unit 1

Meet the client and the data

Step 1: Project setup

Open a terminal and start Claude Code:

cd ~/dev
claude

Paste this prompt:

Set up my project:
1. Create ~/dev/data-science/p3
2. Download the project materials from https://learnbydirectingai.dev/materials/datascience/p3/materials.zip and extract them into that folder
3. Read CLAUDE.md -- it's the project governance file

Claude will create the folder, download and extract the materials, and read through CLAUDE.md. That file has the full project context: the client, the deliverable, the tech stack, the ticket list, and the verification guidance.

Once Claude confirms it has read CLAUDE.md, you are set up.

Step 2: Read Somchai's email

Open materials/client-email.md.

Somchai Rattanapong is the Director of Operations for Baan Suan Hotels -- five boutique properties across Thailand. Two beach resorts (Koh Samui, Krabi), a city hotel (Bangkok), a cultural property (Chiang Mai), and a nature retreat (Khao Yai).

His problem is specific: three data systems that do not talk to each other. Booking data in one place, guest review scores from TripAdvisor and Booking.com in another, monthly revenue reports in a third. His board compares raw numbers across properties that serve completely different markets, and draws conclusions that Somchai knows are wrong.

He wants three things: a fair comparison, real-versus-noise identification on satisfaction differences, and something he can present to the board.

Step 3: Reply to Somchai

Below the email, you will see reply options. Pick the one that fits -- something that confirms you will start by reviewing the data and come back with an approach.

Somchai responds within a few hours. He is measured and professional -- confirms the data exports are ready, mentions each property manager formats things "in their own way," and asks you to flag anything unusual in the data before proceeding.

Step 4: Profile each dataset independently

This project has three data sources. Before combining them, you need to know what each one contains.

Direct AI to load materials/bookings.csv first. Ask for the shape, column names and types, null counts, and a sample of values. Then do the same for materials/reviews.csv and materials/revenue.csv -- one at a time, not all at once.

Read the data dictionaries alongside each profile: materials/bookings-dictionary.md, materials/reviews-dictionary.md, materials/revenue-dictionary.md. The dictionaries tell you what each column represents. The profile tells you what the data actually looks like.

Pay attention to the date columns. Each source stores dates in its own format. Note these formats -- they will matter when you combine the sources.

Step 5: Read the data dictionaries

Open each data dictionary and compare what the three sources contain:

  • Bookings tells you about reservations: when they were booked, what room type, what rate, whether the guest showed up
  • Reviews tells you about guest satisfaction: scores, text reviews, which platform, reviewer nationality
  • Revenue tells you about money: monthly totals by category, occupancy rates

Some information overlaps (property name, dates). Some is unique to one source (review text only in reviews, occupancy rate only in revenue). The analysis will need data from all three.

Think about which columns could serve as join keys -- the fields you would use to link records across sources. Property name is obvious. Time period is trickier: bookings are per-reservation, reviews are per-review, revenue is monthly.

Step 6: Record observations

Before moving on, note:

  • What each source tells you that the others do not
  • Which potential join keys you have identified
  • Any data quality issues you spotted during profiling (nulls, unexpected values, format differences)
  • What questions can only be answered by combining the sources

These observations feed directly into the methodology memo you will start filling in the next unit.

✓ Check

Check: Each dataset loaded successfully. The student can name the row count, column count, and date format for each source. Null patterns identified.