Step 1: What CI/CD means here
The matching model is deployed. People use it to make real placement decisions. When you update the model -- retrain it, adjust features, change thresholds -- that update needs to go through the same quality checks every time. Not because someone remembers to run the eval suite, but because the pipeline refuses to deploy anything that has not passed.
That is what CI/CD does for an ML system. Continuous integration runs the evaluation suite automatically on every push. Continuous deployment only proceeds if the model passes. The eval gate is what separates an automated pipeline from an automated mistake. Without it, every push deploys -- including the ones that degrade match quality for Priya's hospitals.
Step 2: Open the workflow template
Open materials/ci-cd-workflow-template.yml. This is a GitHub Actions workflow file -- it defines what happens when you push code to the repository.
The template has the structure: trigger, job, steps. The trigger fires on pushes to main and feature/* branches. The job runs on a GitHub-hosted runner. The steps are scaffolded: checkout, Python setup, dependency installation, and then placeholders where you fill in the evaluation logic and the gate.
Read through the file. Notice the comments indicating where you need to add the eval suite and the threshold check.
Step 3: Write constraints before generating
Before asking Claude to configure the workflow, write explicit constraints. These shape what Claude produces.
Tell Claude:
- Pin all dependency versions in
requirements.txt(already done in the materials -- verify it) - The eval step must run
evaluation-suite.pyand capture the output - The gate step must exit 1 if any metric falls below threshold
- Use GitHub Secrets for any credentials -- no plaintext API keys
- Install only what the eval job needs -- not the full training environment
These constraints prevent the most common mistakes AI makes with CI/CD pipelines. AI generates workflows that look gated but are not -- the eval step runs but the pipeline proceeds regardless of the result.
Step 4: Configure the eval gate
Open materials/evaluation-suite.py. This script computes metrics for the matching model: accuracy, recall, precision, F1, per-region recall, and a fairness gap.
The thresholds at the top of the file are placeholders. You decide what the gates should be. Recall matters for Priya's team -- they need to catch good matches. The fairness gap matters because of the P6 audit. Precision matters because false positives waste the operations team's time.
Set thresholds that reflect what matters for MedConnect. This is a design decision, not a lookup. A recall threshold of 0.55 means the model must catch at least 55% of good matches. Is that high enough for Priya? Too high for a newly retrained model to pass?
GitHub Actions compute minutes are limited on free tiers. The eval suite should be quick -- load model, load test data, compute metrics, check thresholds, exit. Do not include full training in the CI pipeline. Training happens locally or on dedicated infrastructure.
Step 5: Test the pipeline
Push a branch and observe the GitHub Actions run. The workflow triggers on push.
The first thing you will notice: the runner is not your machine. It has no access to your local files, environment variables, cached dependencies, or any running services. Every dependency must be declared in requirements.txt. Every path must be relative. Every secret must be configured through GitHub Secrets.
This is the notebook-to-production gap showing up in infrastructure. The model works locally because your environment has everything. The runner starts from zero.
Watch the steps execute. If the eval gate passes, you see green checks. If it fails, you see which step failed and why.
Now test the failure path. Modify a threshold to be unreachable -- set recall to 0.99. Push again. The pipeline should fail at the eval gate step with a clear error message and exit code 1.
Step 6: Verify AI's output
Cross-check the workflow with a second model. Open a fresh context and ask it to review the GitHub Actions workflow YAML. Specifically: does the eval gate actually exit 1 on failure? Are dependency versions pinned? Are there any secrets in plaintext? Does the caching configuration (if any) cache the right directory?
AI generates CI/CD pipelines that look right but contain silent failures. A workflow where the eval step always succeeds -- regardless of metrics -- is worse than no workflow at all. It gives false confidence that the gate is working.
Check: Push a branch with a deliberately failing model (or unreachable threshold). The GitHub Actions run must fail and block the merge.