Learn by Directing AI
Unit 4

Evaluate and interpret

Step 1: Compute the naive baseline

The model's R-squared is a number. On its own, it tells you nothing. Is 0.15 good? Bad? Useless?

You need a reference point. Direct AI to compute a naive baseline: predict the overall no-show rate (the mean) for every appointment. No model. No features. Just the average.

Calculate the baseline's R-squared and RMSE. The baseline should have R-squared near zero -- predicting the same number for every appointment explains almost none of the variation.

Step 2: Compare against the model

Now compare: your model's R-squared against the baseline's R-squared.

The model should meaningfully outperform the mean prediction. If the model explains 15% of the variance and the baseline explains 0%, that is real improvement. Without this comparison, you cannot judge whether the model is useful. A model that looks modest might be a substantial improvement over nothing.

Check this against materials/verification-targets.md.

Step 3: Interpret the coefficients

Direct AI to show the model's coefficients. These are the model's claims about the relationship between each feature and no-show probability.

Read them for domain sense. Does each coefficient match what you know about the data?

  • Vaccination follow-ups should increase no-show risk (positive coefficient).
  • Morning time slots should decrease risk (negative coefficient) -- people who book mornings tend to show up.
  • Returning clients should decrease risk (negative coefficient) -- repeat clients are more reliable.

If a coefficient has the wrong sign or an implausible magnitude, something is off. AI optimizes for statistical fit, not for whether the result makes sense in Wanjiku's clinic. Domain knowledge is the check that catches what statistics alone cannot.

Step 4: The dual nature of regression

The same regression model can serve two different purposes. "What predicts no-shows?" is a prediction question -- you evaluate it by how well the model performs on unseen data (RMSE, R-squared on the test set). "What drives no-shows?" is an inference question -- you evaluate it by whether the coefficients are statistically significant and the assumptions hold.

AI often conflates these. It might report coefficient p-values when you asked for prediction accuracy, or report RMSE when you wanted to know which features matter most. Be clear about which question you are answering. For Wanjiku, the primary question is prediction: she wants a ranked list, not a research paper about no-show causes.

Step 5: Check the residuals

Direct AI to plot the residuals -- the differences between the model's predictions and the actual values. A residual plot shows what the model missed.

If the residuals scatter randomly around zero, the model has captured what it can from the available features. If there are patterns -- a curve, a fan shape, clusters of large errors in one range -- the model is missing something systematic.

Direct AI to check for patterns. Random scatter is what you want. Systematic structure means the model could be improved, but for a first linear regression on this data, moderate scatter is expected.

Step 6: Describe the residuals

Direct AI to describe what the residual plot shows in plain language. "The residuals are roughly randomly scattered with slightly wider spread at higher predicted values" is a diagnostic observation. It tells you the model is working reasonably but is less precise for appointments the model considers high-risk.

This is the beginning of professional diagnostic communication. Describing what a plot shows -- not just generating it -- is how you articulate confidence in the result.

Direct AI to self-review: "Check whether the coefficient signs match what we would expect for a veterinary clinic in Nairobi with the no-show patterns we found in P1." This is a specific self-review prompt. A vague prompt like "does this look right?" would produce reassurance, not analysis.

✓ Check

Check: Naive baseline R-squared should be near zero. Model R-squared should be meaningfully higher. Coefficients should have domain-plausible signs. Residual plot generated and interpreted.