P2: First Prediction Model -- Muthoni Veterinary Clinic

Client

Wanjiku Muthoni, Owner and Head Veterinarian at Muthoni Veterinary Clinic, Nairobi, Kenya. Same client as P1. The descriptive analysis from P1 is complete -- the no-show rate chart is on the wall behind reception. Now Wanjiku wants prediction: which upcoming appointments are likely to be no-shows?

What you are building

A prediction model for appointment no-shows. Using an extended version of the same dataset (21 months, ~9,500 rows), build a linear regression model that predicts no-show probability for each appointment based on features like day of week, time slot, visit type, and client tenure. The deliverable is a ranked list of upcoming appointments by predicted no-show risk, plus a client-facing summary Wanjiku can use for scheduling decisions.

Tech stack

Python 3.11+ (conda "ds" environment)
Jupyter Notebook
pandas
scikit-learn (LinearRegression, train_test_split, metrics)
matplotlib / seaborn
scipy (assumption checking)
Git / GitHub

File structure

materials/
  CLAUDE.md              -- this file (project governance)
  client-email.md        -- Wanjiku's follow-up email
  project-plan.md        -- prediction pipeline structure
  data-dictionary.md     -- column definitions for the dataset
  appointments-extended.csv -- 21 months of appointment data (~9,500 rows)
  verification-targets.md -- expected values for all checks
  scripts/
    generate_appointments_extended.py -- dataset generation (not student-facing)

Student-created files (during the project):

Jupyter notebook with the full analysis
Ranked appointment list (CSV or markdown)
Client summary document
Preparation log (cleaning decisions)
Decision record (temporal split)

Key material references

data-dictionary.md -- column contract for the dataset
project-plan.md -- structured prediction pipeline (problem framing through communication)
verification-targets.md -- expected values to check AI output against

Ticket list

T1: Project setup and client follow-up -- download materials, read email, reply to Wanjiku
T2: Data cleaning -- handle missing values, convert types, document all decisions in preparation log
T3: Build prediction model -- linear regression with train/test split, catch leakage
T4: Evaluate model -- naive baseline comparison, coefficient interpretation, residual analysis
T5: Communicate results -- ranked appointment list and client-facing summary
T6: Client delivery, feedback, decision record, commit and push

Verification targets

See verification-targets.md for all expected values. Key targets:

Temporal-split R-squared: 0.10-0.20
Random-split R-squared: >0.30 (leakage signal -- this is wrong, catch it)
Naive baseline R-squared: near 0
Vaccination follow-ups: highest predicted no-show risk
Missing values: 3-5% in last 3 months only

Commit convention

Commit after each ticket with a meaningful message describing what was done and verified. Examples:

"Clean dataset -- 247 missing values handled, preparation log complete"
"Add temporal train/test split -- verified against target, leakage corrected"
"Generate ranked appointment list -- temporal model predictions, top 10 highest risk"