Prediction Project Plan

This document structures the prediction pipeline. Work through each stage in order -- each builds on the previous.

1. Problem framing

Wanjiku's P1 question was descriptive: "what are the no-show patterns?" P2's question is predictive: "which of tomorrow's appointments are likely to be no-shows?"

These are fundamentally different kinds of questions. A descriptive analysis looks backward -- it summarizes what happened. A prediction looks forward -- it estimates what will happen. The data is the same, but the approach changes completely:

How you prepare the data: Missing values must be handled because the model cannot process them. In P1, you reported missing values. In P2, you decide what to do about them.
How you evaluate the result: Prediction uses holdout accuracy (R-squared, RMSE on data the model has not seen). Descriptive analysis used confidence intervals and hypothesis tests.
What "correct" means: A prediction model is correct if it predicts well on future data, not if it fits the past well.
What you communicate: Wanjiku does not need R-squared. She needs a ranked list of risky appointments and an honest answer to "how accurate is this?"

2. Data preparation

The extended dataset has 21 months of appointment data (~9,500 rows). The first 18 months are the same data from P1. The last 3 months are new -- added since the P1 analysis.

Requirements:

Load the dataset and verify it against the data dictionary
Check for missing values -- the new months may have data quality issues
For each column with missing values, decide: drop the rows, impute (mean, median, mode), or flag missingness as a separate feature
Document every cleaning decision and why you made it
Convert data types: dates to datetime, categoricals properly encoded
Save a preparation log listing what was found, what was decided, and how many rows remain

Every cleaning decision changes the dataset the model will see. Different decisions produce different predictions. This is not a technical prelim -- it is analytical work.

3. Model building

Build a linear regression model predicting no-show probability from the available features: day_of_week, time_slot, visit_type, pet_species, client_tenure.

Split the data into training and test sets. The model learns patterns from the training set and is evaluated on the test set (data it has never seen).

Important -- temporal discipline: This dataset has a time dimension. Appointments are dated. When splitting into train and test, think about what the model is allowed to "see." The test set should represent genuinely future data that the model could not have learned from.

Fit the model on the training set. Evaluate on the test set using R-squared and RMSE.

4. Evaluation

A model's R-squared is just a number until you have something to compare it against.

Naive baseline: What would happen if you predicted the overall no-show rate for every appointment (no model, just the average)? Compute the baseline R-squared and RMSE. The model is only useful if it meaningfully outperforms this baseline.
Coefficient interpretation: Look at the model's coefficients. Do they make sense? Vaccination follow-ups should increase no-show risk. Morning slots should decrease it. Returning clients should decrease it. If a coefficient has the wrong sign or an implausible magnitude, something is off.
Residual analysis: Plot the residuals (actual minus predicted). Are they randomly scattered, or is there a pattern? Systematic patterns mean the model is missing something.
Metric selection: R-squared and RMSE are the standard starting metrics. RMSE tells you how far off the model typically is in practical units.

The same regression model could be used for prediction ("what will happen?") or for inference ("what drives no-shows?"). The evaluation is different for each purpose. For prediction: RMSE on the holdout. For inference: coefficient significance and assumption checks. Be clear about which purpose you are evaluating.

5. Communication

Wanjiku does not want a coefficient table or an R-squared number. She wants:

A ranked list of upcoming appointments by predicted no-show probability, with the highest-risk appointments at the top. This is what Grace uses for reminder calls.
A summary explaining: what the model does, how accurate it is (in terms Wanjiku understands -- not R-squared, but "the model is typically off by about X percentage points"), and what the ranked list means for her scheduling.