Step 1: Project setup
Open a terminal and start Claude Code:
cd ~/dev
claude
Paste this prompt:
Set up my project:
1. Create ~/dev/data-science/p1
2. Download the project materials from https://learnbydirectingai.dev/materials/datascience/p1/materials.zip and extract them into that folder
3. Read CLAUDE.md — it's the project governance file
Claude will create the folder, download and extract the materials, and read through CLAUDE.md. That file describes the client, the deliverables, the tech stack, and the ticket list for the entire project. It is the project governance file — every ticket, every verification target, every file reference points back to it.
Once Claude confirms it has read CLAUDE.md, you are set up.
Step 2: Read the brief
Open materials/client-email.md. This is Wanjiku's email.
Wanjiku Muthoni runs a small veterinary clinic in Nairobi's Kilimani neighbourhood. She sees 25-30 pets a day by appointment, and people keep not showing up. Her receptionist Grace has been tracking every appointment for eighteen months — about 8,000 rows of data. Wanjiku has a staff meeting next month and wants real numbers: what is the actual no-show rate, are there patterns by day or visit type, and is it getting worse?
That is the entire brief. Four questions, one dataset, one deadline. Everything you compute in this project connects back to what Wanjiku needs to present to her team.
Step 3: Reply to Wanjiku
Below the email, you will see reply options. Pick the one that fits — something that confirms you have read the brief and will start with the data.
Wanjiku responds warmly. She is glad someone is looking at the numbers instead of guessing. She mentions that Grace put the spreadsheet together carefully, and that she is available by email if questions come up.
This is your first client interaction. The work starts from what the client needs, not from what the tools can do.
Step 4: The data dictionary
Open materials/data-dictionary.md. This is the column contract — it tells you exactly what each column means, what values are allowed, and how Grace records each field.
Eight columns:
| Column | What it means |
|---|---|
| date | When the appointment was scheduled (YYYY-MM-DD) |
| time_slot | Morning, Afternoon, or Evening |
| day_of_week | Monday through Saturday (clinic closed Sundays) |
| visit_type | Consultation, Vaccination, Dental, or Surgery |
| pet_species | Dog, Cat, or Rabbit |
| client_tenure | New or Returning |
| appointment_status | Show, No-show, or Cancelled |
| appointment_length | Standard (45 min) or Extended (90 min) |
Pay attention to appointment_status. Three categories: Show, No-show, and Cancelled. "No-show" means the client did not show up and gave no advance notice. "Cancelled" means they called ahead. That distinction matters later when you compute the no-show rate — cancellations are not no-shows.
The data dictionary is a contract. Before you compute anything, you check that the actual data matches what the dictionary promises. Column names, data types, allowed values. If the data and the dictionary disagree, you stop and figure out why before moving forward.
Step 5: Load and check the data
Direct Claude to load the dataset and verify it against the data dictionary. Ask it to:
- Load materials/appointments.csv into a pandas DataFrame
- Show the shape (rows and columns)
- Show the first few rows with
df.head() - Show the dtypes
- Check that column names match the data dictionary
- Check that categorical columns contain only the allowed values
Each column in a dataset has a data type (dtype) that tells the computer how to store and interpret it -- int64 means whole numbers, object usually means text, datetime64 means dates. When a date column shows as object, it means the data was loaded as plain text instead of being recognized as a date. A categorical column contains a fixed set of possible values -- like payment method (cash, card, mobile) or region (north, south, east, west). In this dataset, columns like visit_type and day_of_week are categorical.
Compare what Claude shows you against the data dictionary. Do the column names match exactly? Do the dtypes make sense — dates as dates, categories as strings? Are the allowed values what Grace documented?
This is the first analytical act: checking reality against documentation. It is not a formality. If the column names do not match, every downstream reference breaks. If a category has an unexpected value, every count and percentage is wrong. The data dictionary is only useful if someone checks it.
✓ Check: The dataset should have approximately 8,000 rows and 8 columns. Column names and dtypes match the data dictionary.