Learn by Directing AI

Step 1: What CI/CD means here

The matching model is deployed. People use it to make real placement decisions. When you update the model -- retrain it, adjust features, change thresholds -- that update needs to go through the same quality checks every time. Not because someone remembers to run the eval suite, but because the pipeline refuses to deploy anything that has not passed.

That is what CI/CD does for an ML system. Continuous integration runs the evaluation suite automatically on every push. Continuous deployment only proceeds if the model passes. The eval gate is what separates an automated pipeline from an automated mistake. Without it, every push deploys -- including the ones that degrade match quality for Priya's hospitals.

Step 2: Open the workflow template

Open materials/ci-cd-workflow-template.yml. This is a GitHub Actions workflow file -- it defines what happens when you push code to the repository.

The template has the structure: trigger, job, steps. The trigger fires on pushes to main and feature/* branches. The job runs on a GitHub-hosted runner. The steps are scaffolded: checkout, Python setup, dependency installation, and then placeholders where you fill in the evaluation logic and the gate.

Read through the file. Notice the comments indicating where you need to add the eval suite and the threshold check.

Step 3: Write constraints before generating

Before asking Claude to configure the workflow, write explicit constraints. These shape what Claude produces.

Tell Claude:

Pin all dependency versions in requirements.txt (already done in the materials -- verify it)
The eval step must run evaluation-suite.py and capture the output
The gate step must exit 1 if any metric falls below threshold
Use GitHub Secrets for any credentials -- no plaintext API keys
Install only what the eval job needs -- not the full training environment

These constraints prevent the most common mistakes AI makes with CI/CD pipelines. AI generates workflows that look gated but are not -- the eval step runs but the pipeline proceeds regardless of the result.

Step 4: Configure the eval gate

Open materials/evaluation-suite.py. This script computes metrics for the matching model: accuracy, recall, precision, F1, per-region recall, and a fairness gap.

The thresholds at the top of the file are placeholders. You decide what the gates should be. Recall matters for Priya's team -- they need to catch good matches. The fairness gap matters because of the P6 audit. Precision matters because false positives waste the operations team's time.

Set thresholds that reflect what matters for MedConnect. This is a design decision, not a lookup. A recall threshold of 0.55 means the model must catch at least 55% of good matches. Is that high enough for Priya? Too high for a newly retrained model to pass?

GitHub Actions compute minutes are limited on free tiers. The eval suite should be quick -- load model, load test data, compute metrics, check thresholds, exit. Do not include full training in the CI pipeline. Training happens locally or on dedicated infrastructure.

Step 5: Test the pipeline

Push a branch and observe the GitHub Actions run. The workflow triggers on push.

The first thing you will notice: the runner is not your machine. It has no access to your local files, environment variables, cached dependencies, or any running services. Every dependency must be declared in requirements.txt. Every path must be relative. Every secret must be configured through GitHub Secrets.

This is the notebook-to-production gap showing up in infrastructure. The model works locally because your environment has everything. The runner starts from zero.

Watch the steps execute. If the eval gate passes, you see green checks. If it fails, you see which step failed and why.

Now test the failure path. Modify a threshold to be unreachable -- set recall to 0.99. Push again. The pipeline should fail at the eval gate step with a clear error message and exit code 1.

Step 6: Verify AI's output

Cross-check the workflow with a second model. Open a fresh context and ask it to review the GitHub Actions workflow YAML. Specifically: does the eval gate actually exit 1 on failure? Are dependency versions pinned? Are there any secrets in plaintext? Does the caching configuration (if any) cache the right directory?

AI generates CI/CD pipelines that look right but contain silent failures. A workflow where the eval step always succeeds -- regardless of metrics -- is worse than no workflow at all. It gives false confidence that the gate is working.

✓ Check