Learn by Directing AI

Step 1: The drift taxonomy

The CI/CD pipeline gates model quality -- it blocks bad models from deploying. But what about a good model that stops being good because the world changed?

Three kinds of drift matter:

Data drift means the inputs have changed. The features coming into the model in production look different from the features in the training data. The merged hospitals have standardized requirements. The Coimbatore hospitals switched to 12-hour shifts. The regional distribution of applicants shifted. All of these are data drift -- the input distributions changed.

Concept drift means the relationship between inputs and outputs has changed. Even if the features look the same, the correct answer has changed. A nurse who was a good match for a hospital six months ago might not be a good match today because the hospital changed what they value.

Prediction drift means the model's outputs have changed -- the distribution of match scores looks different from what it used to produce.

Data drift is what you can detect from input distributions alone. Concept drift requires ground truth labels -- you need to know whether the model's predictions were actually right, and that information may arrive weeks after the prediction was made. For now, you build data drift detection. Concept drift comes later.

Step 2: Open the drift configuration template

Open materials/drift-config-template.py. This is an Evidently AI configuration template. It defines: where the reference data is (training baseline), where the current data is (production), which features to monitor, and what thresholds to use.

The template has placeholder sections. You fill in the feature selection, threshold configuration, and subgroup analysis.

Step 3: Select features to monitor

Not all features matter equally. The question is: which features, if they changed, would most affect match quality?

Think about what the model relies on. Shift patterns affect scheduling compatibility. Hospital requirements text drives the matching logic. Minimum experience thresholds filter candidates. Nurse region distribution affects the pool of available matches.

Write explicit constraints for Claude: "Monitor these features individually: shift_pattern, hospital_requirements, min_experience_required, nurse_region, hospital_region. Use configurable thresholds, not hard-coded defaults. Include subgroup-level monitoring by hospital region."

Step 4: Implement with Evidently AI

Direct Claude to implement the drift detection using Evidently AI. Evidently computes statistical tests (PSI for categorical features, KS test for numerical features) that compare reference and current distributions.

AI commonly gets monitoring configuration wrong in a specific way: it monitors all features uniformly without considering which ones matter most. A feature that changed but has no impact on predictions is noise. A feature that changed and directly affects match quality is a signal.

Tell Claude to include per-region monitoring. Aggregate drift statistics can hide subgroup-specific changes. The Coimbatore hospitals show a dramatic shift pattern change, but if you only look at the overall shift_pattern distribution, the change is diluted by the 78 hospitals that did not change.

Step 5: Run drift detection

Run the detection against materials/placement-data-production.csv using materials/placement-data-training.csv as the reference baseline.

The default PSI threshold of 0.25 is a convention. Is it right for healthcare staffing? A threshold too tight triggers constant alerts about normal variation -- alert fatigue that teaches the team to ignore warnings. A threshold too loose lets significant degradation go undetected until placements go wrong.

For MedConnect, the cost of a missed drift is bad placements affecting real nurses and hospitals. The cost of a false alarm is investigation time for Priya's operations team. Calibrate accordingly.

Step 6: Verify the implementation

Check the drift report. Features affected by the hospital changes should show elevated drift scores. Features that did not change should show stable scores.

Verify that the implementation monitors subgroups -- not just aggregate distributions. Verify that thresholds are configurable, not hard-coded. Cross-check by running the same data through a different drift calculation approach (manual PSI calculation or a different statistical test) to confirm the results are reasonable.

Share the findings with Priya: the model detected that data from the merged hospitals and the Coimbatore hospitals has shifted significantly from what the model was trained on.

Priya responds with relief -- this is exactly what she wanted. Then she asks the practical question: "So what do we do when this happens?"

✓ Check