Learn by Directing AI
All materials

CLAUDE.md

P2: First Prediction Model -- Muthoni Veterinary Clinic

Client

Wanjiku Muthoni, Owner and Head Veterinarian at Muthoni Veterinary Clinic, Nairobi, Kenya. Same client as P1. The descriptive analysis from P1 is complete -- the no-show rate chart is on the wall behind reception. Now Wanjiku wants prediction: which upcoming appointments are likely to be no-shows?

What you are building

A prediction model for appointment no-shows. Using an extended version of the same dataset (21 months, ~9,500 rows), build a linear regression model that predicts no-show probability for each appointment based on features like day of week, time slot, visit type, and client tenure. The deliverable is a ranked list of upcoming appointments by predicted no-show risk, plus a client-facing summary Wanjiku can use for scheduling decisions.

Tech stack

  • Python 3.11+ (conda "ds" environment)
  • Jupyter Notebook
  • pandas
  • scikit-learn (LinearRegression, train_test_split, metrics)
  • matplotlib / seaborn
  • scipy (assumption checking)
  • Git / GitHub

File structure

materials/
  CLAUDE.md              -- this file (project governance)
  client-email.md        -- Wanjiku's follow-up email
  project-plan.md        -- prediction pipeline structure
  data-dictionary.md     -- column definitions for the dataset
  appointments-extended.csv -- 21 months of appointment data (~9,500 rows)
  verification-targets.md -- expected values for all checks
  scripts/
    generate_appointments_extended.py -- dataset generation (not student-facing)

Student-created files (during the project):

  • Jupyter notebook with the full analysis
  • Ranked appointment list (CSV or markdown)
  • Client summary document
  • Preparation log (cleaning decisions)
  • Decision record (temporal split)

Key material references

  • data-dictionary.md -- column contract for the dataset
  • project-plan.md -- structured prediction pipeline (problem framing through communication)
  • verification-targets.md -- expected values to check AI output against

Ticket list

  • T1: Project setup and client follow-up -- download materials, read email, reply to Wanjiku
  • T2: Data cleaning -- handle missing values, convert types, document all decisions in preparation log
  • T3: Build prediction model -- linear regression with train/test split, catch leakage
  • T4: Evaluate model -- naive baseline comparison, coefficient interpretation, residual analysis
  • T5: Communicate results -- ranked appointment list and client-facing summary
  • T6: Client delivery, feedback, decision record, commit and push

Verification targets

See verification-targets.md for all expected values. Key targets:

  • Temporal-split R-squared: 0.10-0.20
  • Random-split R-squared: >0.30 (leakage signal -- this is wrong, catch it)
  • Naive baseline R-squared: near 0
  • Vaccination follow-ups: highest predicted no-show risk
  • Missing values: 3-5% in last 3 months only

Commit convention

Commit after each ticket with a meaningful message describing what was done and verified. Examples:

  • "Clean dataset -- 247 missing values handled, preparation log complete"
  • "Add temporal train/test split -- verified against target, leakage corrected"
  • "Generate ranked appointment list -- temporal model predictions, top 10 highest risk"