Learn by Directing AI
Unit 2

Profile the data

Step 1: Summary statistics

The dataset is loaded and matches the data dictionary. Before computing anything for Wanjiku, you need to know what the data actually looks like — ranges, distributions, anything unexpected.

Direct Claude to compute summary statistics for all columns. For categorical columns (visit_type, day_of_week, time_slot, pet_species, client_tenure, appointment_status), ask for value counts. For the date column, ask for the range.

Check the output against the data dictionary. Do the categorical columns contain only the allowed values? Does the date range cover roughly 18 months? Are there any values that do not belong — a visit_type that Grace did not document, a day_of_week with a typo?

Summary statistics are not a formality. They are the first moment where you find out whether the data is what you were told it is. A category that should not exist, a date outside the expected range, a numeric column where you expected a string — these show up here, and they are much cheaper to catch now than after you have built an analysis on top of them.

Step 2: Missing values

Direct Claude to check for missing values across all columns. Ask for the count and percentage of missing values per column.

What matters is not just the count but the pattern. A few missing values scattered randomly across columns is routine. Missing values concentrated in one column — say, 200 out of 8,000 rows missing client_tenure — means that column's analysis needs a decision about how to handle the gaps. Missing values concentrated in a time period (all the blanks are from September) might mean Grace changed how she recorded data partway through.

At this stage, you are not fixing anything. You are noting what is there so nothing surprises you downstream.

Step 3: The appointment status distribution

One column matters more than the others right now: appointment_status. This is the column Wanjiku's entire question revolves around.

Direct Claude to show the value counts for appointment_status — how many Show, how many No-show, how many Cancelled.

Look at the counts. The majority should be Show — most people do keep their appointments. The No-show and Cancelled counts tell you the rough scale of the problem before you compute any rates. If No-show is, say, 800 out of 8,000, you are already getting a feel for the magnitude. That feel matters — it becomes a reference point when you compute the actual rate in the next unit.

Notice that Cancelled is a separate category from No-show. A cancellation means the client called ahead. A no-show means they simply did not appear. That distinction drives a denominator decision you will face shortly.

Step 4: Undirected profiling

Now try something. Ask Claude to "profile the data" or "do an EDA" without specifying what to focus on.

What comes back is typically a wall of plots — histograms for every column, bar charts for every categorical variable, maybe a correlation heatmap and some scatter plots. Fifteen or twenty visualizations, each individually fine, collectively unfocused. You see pet_species plotted against appointment_length and day_of_week crossed with time_slot and a histogram of dates and a box plot of something that does not have a meaningful box plot.

This is what happens when AI fills in the gaps. You asked it to profile the data without saying what question the profile should serve. AI does not know about Wanjiku or her staff meeting or the no-show problem — it only knows what you told it. So it profiles everything, because everything is equally relevant when there is no question to anchor the work.

The output looks thorough. It is not useful. Most of those plots tell you nothing about why people miss appointments at Wanjiku's clinic. The volume of output can feel like progress, but volume is not insight.

Open materials/analysis-specification.md. Section 2 lists the variables Wanjiku specifically asked about: day_of_week, time_slot, visit_type, and client_tenure. Those are the variables that matter for her question. A useful profile focuses there.

Step 5: Focused profiling

Direct Claude to show the distributions of the four variables from the analysis specification: day_of_week, time_slot, visit_type, and client_tenure. Ask for bar charts — one per variable.

The difference is immediate. Four clean charts, each showing the distribution of a variable that connects to Wanjiku's question. You can see whether appointments are spread evenly across days or concentrated on certain days. You can see the balance between visit types. You can see how many new clients versus returning clients are in the data.

This is the profile that matters. Not because the other variables are unimportant — they might matter later — but because these are the variables the client asked about. The analysis specification exists for exactly this reason: it tells you what to focus on so you do not waste time profiling everything and understanding nothing.

The difference between what you got in Step 4 and what you got here is the difference between giving AI a task and giving AI a task with context. The prompt you wrote — which variables to show, how to show them — is the context. When you leave it out, AI substitutes its own defaults. When you include it, you get output that connects to the question you are actually trying to answer.

✓ Check

✓ Check: The appointment_status column should have three categories (show, no-show, cancelled). The overall distribution should show the majority as "show." Missing values should be minimal (under 2%) if present at all.