Learn by Directing AI

The Brief

Priya is back. The matching model you built for MedConnect Staffing is in production and working -- placement times are down 40%. Her team is making better matches faster.

But the world has changed. Three hospitals merged their staffing requirements under a new management group. Two hospitals in Coimbatore switched from 8-hour to 12-hour shift patterns. Her operations team says the match scores for those hospitals feel "off" since the changes.

She has two problems. First: how does she know when the model stops working? Right now the only signal is her team noticing bad matches -- after placements have already been made. Second: her CTO Ravi wants model updates to go through automated review before going live. No bad model should reach production.

Your Role

You are building the infrastructure that governs the model's lifecycle -- a CI/CD pipeline with GitHub Actions that blocks bad models from deploying, and a drift detection system that catches when production data shifts away from what the model was trained on.

Templates provide structure for the GitHub Actions workflow and the drift detection configuration. Guides are gone -- you fill the templates with your own judgment about which metrics to gate on, which features to monitor, what thresholds to set, and what the team should do when an alert fires. You plan the work before starting, using Claude's plan mode to map out the dependency chain.

What's New

Last time you built the matching model itself -- Pipelines, transfer learning, a fairness audit. The model was the challenge.

This time the model is the starting point. The challenge is making it reliable in production. CI/CD with GitHub Actions automates evaluation: the pipeline runs the eval suite on every push and blocks deployment when the model does not pass. Drift detection with Evidently AI monitors whether the production data still looks like the training data. And the response plan connects alerts to decisions -- what to do when something changes, not just that something changed.

The hard part is not the configuration. It is designing the system: which metrics should block deployment, which features matter enough to monitor, how sensitive the alerts should be, and what the team does when they fire.

Tools

Python -- scripting, drift analysis
GitHub Actions -- CI/CD automation (new)
Evidently AI -- drift detection (new)
scikit-learn -- evaluation suite (familiar)
MLflow -- experiment tracking (familiar)
Git / GitHub -- branch-based workflows (deepening)
Claude Code -- AI direction, plan mode (new use)

Materials

You receive:

Production placement data with recent records from the changed hospitals
Training baseline data for drift comparison
A GitHub Actions workflow template with placeholder steps for evaluation and gating
An evaluation suite script with configurable thresholds
A drift detection configuration template for Evidently AI
A response plan template with severity levels and response procedures
A project governance file (CLAUDE.md)