P2: First Prediction Model -- Muthoni Veterinary Clinic
Client
Wanjiku Muthoni, Owner and Head Veterinarian at Muthoni Veterinary Clinic, Nairobi, Kenya. Same client as P1. The descriptive analysis from P1 is complete -- the no-show rate chart is on the wall behind reception. Now Wanjiku wants prediction: which upcoming appointments are likely to be no-shows?
What you are building
A prediction model for appointment no-shows. Using an extended version of the same dataset (21 months, ~9,500 rows), build a linear regression model that predicts no-show probability for each appointment based on features like day of week, time slot, visit type, and client tenure. The deliverable is a ranked list of upcoming appointments by predicted no-show risk, plus a client-facing summary Wanjiku can use for scheduling decisions.
Tech stack
- Python 3.11+ (conda "ds" environment)
- Jupyter Notebook
- pandas
- scikit-learn (LinearRegression, train_test_split, metrics)
- matplotlib / seaborn
- scipy (assumption checking)
- Git / GitHub
File structure
materials/
CLAUDE.md -- this file (project governance)
client-email.md -- Wanjiku's follow-up email
project-plan.md -- prediction pipeline structure
data-dictionary.md -- column definitions for the dataset
appointments-extended.csv -- 21 months of appointment data (~9,500 rows)
verification-targets.md -- expected values for all checks
scripts/
generate_appointments_extended.py -- dataset generation (not student-facing)
Student-created files (during the project):
- Jupyter notebook with the full analysis
- Ranked appointment list (CSV or markdown)
- Client summary document
- Preparation log (cleaning decisions)
- Decision record (temporal split)
Key material references
- data-dictionary.md -- column contract for the dataset
- project-plan.md -- structured prediction pipeline (problem framing through communication)
- verification-targets.md -- expected values to check AI output against
Ticket list
- T1: Project setup and client follow-up -- download materials, read email, reply to Wanjiku
- T2: Data cleaning -- handle missing values, convert types, document all decisions in preparation log
- T3: Build prediction model -- linear regression with train/test split, catch leakage
- T4: Evaluate model -- naive baseline comparison, coefficient interpretation, residual analysis
- T5: Communicate results -- ranked appointment list and client-facing summary
- T6: Client delivery, feedback, decision record, commit and push
Verification targets
See verification-targets.md for all expected values. Key targets:
- Temporal-split R-squared: 0.10-0.20
- Random-split R-squared: >0.30 (leakage signal -- this is wrong, catch it)
- Naive baseline R-squared: near 0
- Vaccination follow-ups: highest predicted no-show risk
- Missing values: 3-5% in last 3 months only
Commit convention
Commit after each ticket with a meaningful message describing what was done and verified. Examples:
- "Clean dataset -- 247 missing values handled, preparation log complete"
- "Add temporal train/test split -- verified against target, leakage corrected"
- "Generate ranked appointment list -- temporal model predictions, top 10 highest risk"