Learn by Directing AI
Unit 2

Clean the data

Step 1: Read the preparation requirements

Open materials/project-plan.md, Section 2 (Data Preparation). This section tells you what the data needs before a model can use it: missing values handled, types correct, decisions documented.

In P1, the data was clean. Grace is organized. But this extended dataset includes three new months, and the data dictionary notes that a temporary receptionist covered during part of that period. The data quality may not be what you are used to.

Step 2: Load and check the dataset

Direct AI to load materials/appointments-extended.csv and show the shape, first few rows, and dtypes. Check against materials/data-dictionary.md.

The dataset should have approximately 9,500 rows and the same 8 columns from P1: date, time_slot, day_of_week, visit_type, pet_species, client_tenure, appointment_status, appointment_length.

Verify the column names match. Verify the dtypes make sense for each column. This is the same first step as P1 -- check the data against its documentation before computing anything.

Step 3: Find the missing values

Direct AI to check for missing values across all columns. Show which columns have them, how many, and what percentage.

P1's data had none. This dataset will have some. The question is not just "how many" but "where." Are the missing values scattered randomly, or concentrated in a specific time period?

Direct AI to show missing values by month. If they cluster in the newer months, that matches the data dictionary's note about the temporary receptionist. If they are spread across the full dataset, something else is going on.

Step 4: Investigate the pattern

Look at which columns have missing values and what the pattern tells you.

AI commonly handles missing data by dropping every row that has a blank field. That is fast and produces a clean dataset. But if missing values are concentrated in specific months, dropping those rows removes all data from that period. The model will never learn patterns from those months.

Check materials/verification-targets.md for the expected missing value rate.

Step 5: Make cleaning decisions

For each column with missing values, you have options:

  • Drop the rows. Simple, but you lose data. If the missing rows are informative (they represent a specific time period with different patterns), dropping them biases the analysis.
  • Impute with the mean or median. Fills in a reasonable value, but compresses the variance of that column. If the column matters for prediction, mean imputation weakens the model's ability to detect its effect.
  • Impute with the mode (most common value). Works for categorical columns where mean does not apply.
  • Flag missingness as a feature. Create a new column indicating whether the original value was missing. Preserves the information that "this record had incomplete data."

Each choice produces a different dataset. A different dataset produces different predictions. This is not a technical prelim. It is analytical work.

Direct AI to handle each column's missing values using the strategy you choose. Be explicit about which strategy and why.

Step 6: Convert types

Direct AI to ensure all columns have the correct data types for modeling. Dates should be datetime objects. Categorical columns (visit_type, pet_species, client_tenure) should be properly encoded. Numeric columns should be numeric.

Type conversion sounds like housekeeping. It is not. A date stored as a string cannot be sorted chronologically. A categorical column stored as free text cannot be reliably grouped. The model needs types that match what it will do with them.

Step 7: Document the preparation

Write a preparation log: what was found (which columns had missing values, how many, what pattern), what was decided (which strategy for each column, and why), and what changed (how many rows remain, what columns were modified or added).

This log is not a formality. It is the evidence that the preparation was deliberate. If someone asks "why does the dataset have 9,200 rows instead of 9,500?" the preparation log answers that question.

Direct AI to save the log. Commit with a meaningful message describing the cleaning decisions.

✓ Check

Check: All missing values handled (none remaining or explicitly flagged). All dtypes correct for modeling. A preparation log documenting each decision exists.