Step 1: Read the Brief
Open materials/emeka-brief.md. This is the email from Emeka Okafor, who runs customer retention at Tunde Mobile — an MVNO in Lagos. His team makes 200 calls a week trying to stop subscribers from leaving, and right now they're guessing who to call.
What he wants: a model that predicts which subscribers will churn, ranked by risk, served as an API his retention team can query weekly. He also wants to know which features drive the predictions — what signals separate the subscribers who leave from the ones who stay.
Read the brief carefully. The details matter. Everything you direct AI to build flows from what Emeka actually asked for.
Also open materials/tickets.md. This is the ticket breakdown for the entire project — every task, already scoped. You won't create tickets; you'll work through them. Scan the full list now so you know where the project is headed. The first few tickets are what you'll work on in this unit.
Step 2: Respond to Emeka
Open the chat with Emeka. He's waiting to hear that someone is picking this up. Send a reply confirming you understand the problem: subscriber churn, ranked predictions, API delivery. If you have initial questions about the data or his team's workflow, this is the time to ask.
Emeka will reply — he's glad someone's on it. His retention team has been asking for this for months. He'll confirm the data is a clean export from their billing system and ask when he can expect a first look at results.
Keep his tone in mind as you work. He's warm, optimistic, moves fast, and uses "kindly" without thinking about it. When you communicate results later, you're writing to this person.
Step 3: Set Up the Project
Open your terminal and start Claude Code:
cd ~/dev
claude
Paste this prompt:
Set up my project:
1. Create ~/dev/ml/p1
2. Download the project materials from https://learnbydirectingai.dev/materials/ml/p1/materials.zip and extract them into that folder
3. Read CLAUDE.md — it's the project governance file
Claude Code will download the project files, unpack them into a folder, and set up the workspace. After it finishes, you will have a materials/ folder containing the data and documentation for this project.
Take a minute to look at what's in materials/. You should see emeka-brief.md (the email you already read), data-dictionary.md, subscribers.csv, tickets.md, and CLAUDE.md. These are your working inputs for the entire project.
Open materials/data-dictionary.md. This describes every column in the subscriber dataset — name, type, description, expected range. You'll check the actual data against this in a few minutes.
Step 4: Profile the Data
When you give Claude Code an instruction that involves code, it writes and runs the code, then shows you the output. You direct what to build; Claude handles the implementation.
Direct Claude to load materials/subscribers.csv and profile it. You want the basics: row count, column names and types, summary statistics for numeric columns, missing value counts, and the distribution of the churn column.
Something like: "Load subscribers.csv and give me a full data profile — shape, dtypes, describe, missing values, and churn class distribution."
What you type is everything Claude knows about this task. If you ask for "a data profile," Claude will decide what that means based on its training — which might not include everything you need. Be specific about what you want to see.
Look at the output. A few things to notice:
The class distribution. Roughly 92% of subscribers didn't churn. Only about 8% did. That ratio matters more than any other number on the screen. A model that predicts "no churn" for every single subscriber would be 92% accurate — and would catch zero of the people Emeka's team needs to call. Keep this in mind. It comes back in Unit 3 when you evaluate the model.
The column types. Some columns are numeric (tenure, charges). Others are categorical (contract type, payment method). These will need different preprocessing before they can go into a model — that's Unit 2's problem, but noticing the types now is part of understanding the data.
Missing values. Check which columns have them and how many. A column with 2% missing is a different situation than a column with 40% missing. The strategy for handling them depends on what the column is and why values might be absent.
Step 5: Review Against the Data Dictionary
Open materials/data-dictionary.md side by side with the profile output. Check:
- Do the columns in the data match the columns in the dictionary? Any missing? Any extra?
- Are the types what you'd expect? A column described as categorical should show up as
object, notint64. - Are the ranges reasonable? If
tenure_monthsis supposed to be 1-72 and the max is 720, something is wrong. - Does the row count match what the dictionary says to expect?
This is the first professional act in an ML project: understanding what you have before you do anything with it. Every decision downstream — how to handle missing values, how to encode categories, how to split the data, what metrics to use — depends on what the data actually looks like, not what you assume it looks like.
✓ Check: The data profile shows the dataset has the expected number of rows and columns from the data dictionary. The churn class distribution shows approximately 8% positive class.