Learn by Directing AI
Unit 4

Experiment Tracking Infrastructure

Step 1: Review the Experiment Tracking Tickets

Open materials/tickets.md and read T-07 through T-13. These cover two related concerns: experiment tracking as infrastructure (not ad hoc logging), and reproducibility.

You used MLflow in P2 to log experiments. That was basic -- you logged runs and compared metrics. P3 makes it infrastructure: systematic logging where every parameter and metric is tracked, verified to be accurate, and reproducible across runs.

Emeka's data team wants to experiment with different model settings. For that to work, every experiment needs to be tracked so they can compare results and know which settings produced the best outcome. And it needs to be reproducible -- running the same settings twice should give the same results.

Step 2: Set Up Structured MLflow Logging

Direct Claude to configure an MLflow experiment with systematic logging. Something like: "Set up an MLflow experiment called 'churn-infrastructure'. Write a training script that logs all hyperparameters, all evaluation metrics (including per-segment prepaid/postpaid recall), the model artifact, and the data version. Use mlflow.start_run() as a context manager."

Review what AI produces carefully. AI commonly generates MLflow logging code with silent failures. Check for these patterns:

  • mlflow.log_param("n_estimators", 100) with a hardcoded value instead of mlflow.log_param("n_estimators", n_estimators) referencing the actual variable
  • mlflow.log_metric("accuracy", train_accuracy) logging training accuracy as if it were test accuracy
  • Missing mlflow.end_run() or no context manager, causing runs to bleed into each other

Direct Claude to do a self-review: "List every mlflow.log_param and mlflow.log_metric call and verify each logs the actual variable, not a hardcoded value. Verify all runs use context managers."

Step 3: Add Reproducibility Infrastructure

Direct Claude to add random seeds and version pinning. Specifically:

  • Set random seeds for numpy, scikit-learn, and Python's random module
  • Pin all library versions in requirements.txt (replace the unpinned versions in api-baseline/requirements.txt with exact pins like scikit-learn==1.6.1)
  • Ensure the data split uses a fixed random_state

Open the updated requirements.txt and verify every dependency has an exact version. No >=, no unpinned entries. An unpinned scikit-learn that auto-updates from 1.6 to 1.7 can silently change model training behavior.

Step 4: Run a Controlled Experiment

Train two model variants with different hyperparameters. For example, one with n_estimators=100 and another with n_estimators=200, or different max_depth values. Log both to MLflow.

After both runs complete, open the MLflow UI and compare them. You should see two runs with different parameters and different metrics. The per-segment metrics (prepaid recall, postpaid recall) should be visible -- connecting back to the evaluation work from P2.

Step 5: Verify Reproducibility

Run the same configuration twice with the same seeds and verify identical results. Direct Claude to run the training script with one set of hyperparameters, note the metrics, then run it again with the same settings.

Compare the two runs in MLflow. The metrics should be identical. If they're not, something is non-deterministic -- a missing seed, an unpinned dependency, a data split without a fixed random_state. Track down the source and fix it.

This is the reproducibility test: can someone else run the same code with the same settings and get the same results? If yes, the system is reproducible. If not, "the model scored 0.87" is a claim nobody else can verify.

✓ Check

Check: Two experiment runs in MLflow show different hyperparameters and metrics. Running the same configuration twice produces identical metric values (reproducibility verified). No mlflow.log_param call uses a hardcoded value.