Learn by Directing AI
Unit 1

Priya's Matching Problem

Step 1: Set up the project

Open a terminal, navigate to your dev directory, and start Claude Code.

cd ~/dev
claude

Paste this setup prompt:

Create the folder ~/dev/ml/p6. Download the project materials from https://learnbydirectingai.dev/materials/ml/p6/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude downloads the materials, extracts them, and reads the governance file. Once it finishes, you have a project workspace with the placement dataset, a Pipeline template, and a fairness audit guide.

Step 2: Read Priya's message

Priya reached out on Slack. She runs MedConnect Staffing in Bangalore -- 400 nurses placed per quarter across 80 hospitals in South India. Her team does matching manually, comparing skills, certifications, location preferences, and shift availability against hospital requirements. It works, but it is slow.

She wants a model that scores nurse-hospital match quality so her team works from ranked lists instead of spreadsheets. Three years of placement data are ready. And she mentioned something that will matter later: equitable placement. She does not want a system that automates existing patterns without questioning them.

Step 3: Talk to Priya

Open a chat with Priya. She is available to clarify what a successful match looks like, what her team struggles with most, and what equitable placement means to her.

Some things worth asking about: how her team currently decides between candidates, which hospitals are hardest to fill, what makes a placement succeed or fail, and what she means by equitable placement. Her answers will sharpen your understanding of the delivery target.

Priya is formal in writing but animated when she talks about healthcare equity. She will go on tangents about nursing shortages in rural Karnataka and Tamil Nadu. She gives useful detail when you ask specific questions.

Step 4: Profile the dataset

Load materials/placement-data.csv and profile it. Ask Claude to load the data and show you the shape, column types, distributions, and missing values.

This dataset is different from what you have worked with before. Some columns contain numbers (years of experience, satisfaction ratings). Some contain categorical values (region, specialization, shift preference). And some contain free text -- nurse bios and hospital requirement notes are written in plain English, 20 to 60 words each.

That mix matters. You cannot apply the same preprocessing to a numeric column, a categorical column, and a paragraph of text. Each type needs its own treatment. Recognizing what type each column is -- and knowing they need different pipelines -- is the first step toward building a system that handles heterogeneous data.

Step 5: Identify data quality issues

Look closer at the distributions. Three things should stand out.

First, the nurse regions. The data covers six regions: South India, West India, North India, East India, Northeast India, and Central India. Note how the records are distributed across them.

Second, the satisfaction ratings. Ask Claude to show the distribution of hospital_satisfaction_rating. Most ratings cluster at 4 and 5. Very few hospitals ever rate a placement below 4. When nearly everyone gets a high rating, the ratings stop being useful for distinguishing good matches from adequate ones.

Third, missing values. Some nurse bios and hospital requirement notes are empty. How many? Where? This matters because the text columns are part of the feature set -- missing text means missing information the model could have used.

Step 6: Document findings

Write up what you found: the row count, the column types (numeric, categorical, text), the regional distribution, the rating skew, and the missing value patterns. This profiling document is the foundation for every decision that follows -- Pipeline design, feature selection, and evaluation strategy all depend on understanding what the data looks like.

✓ Check

Check: The profiling output shows the correct number of rows, identifies the text and tabular columns separately, and flags the satisfaction rating skew (>80% of ratings are 4 or 5).