Learn by Directing AI

Step 1: Set Up MLflow Experiment Tracking

Open materials/tickets.md and find the training tickets (T14-T19). The first task is setting up MLflow as experiment tracking infrastructure.

In P1, MLflow was there for basic logging. This time you're using it as the system that makes model comparison possible. Every run gets logged -- parameters, metrics, model artifacts. When you have three or four runs with different settings, MLflow is how you answer "which one actually worked better for prepaid customers?" without relying on memory.

Direct Claude to set up an MLflow experiment for this project. Configure the experiment name, set the tracking URI, and verify that a test run can be logged and viewed.

Step 2: Train the Baseline

Direct Claude to train a logistic regression model on the preprocessed training data. Log everything to MLflow: the model type, any parameters, and the evaluation metrics -- overall and per segment.

Compute per-segment metrics using the evaluation plan from Unit 2. The logistic regression gives you the floor. If the next model can't beat logistic regression on prepaid recall, the added complexity isn't justified.

Step 3: Train and Tune the RandomForest

Direct Claude to train a RandomForest classifier. Log to MLflow. Then tune hyperparameters using cross-validation.

Here's where to pay attention: on a dataset this size, the variance between cross-validation folds can exceed the difference between hyperparameter settings. If fold 1 gives 0.72 and fold 3 gives 0.68 for the same hyperparameters, and the best hyperparameters score 0.71 average while the second-best score 0.70 -- the tuning found noise, not signal.

Direct Claude to report both the mean cross-validation scores and the per-fold spread. Check whether the fold variance is larger than the performance difference between settings. AI runs exhaustive grid searches without checking this. The student who looks at the spread knows whether the tuning actually found something.

Step 4: Per-Segment Evaluation

With the best model selected, run the evaluation you designed in Unit 2. Compute metrics for prepaid customers and postpaid customers separately.

This is the moment that answers Emeka's question. Look at the prepaid recall. Compare it against the logistic regression baseline. Did it improve? If overall recall is 0.70 but prepaid recall is 0.45, the model is still failing the group Emeka cares about most.

Step 5: Update Emeka

Open the chat with Emeka and share the per-segment results. He'll want to know: does the model catch prepaid customers now?

If prepaid recall improved, he'll be pleased. But he'll push: "What was the difference? What changed from last time?" He wants to understand what drove the improvement -- not just the numbers.

If prepaid recall is still low, he'll be direct: "That's the whole reason we're doing this again." Either way, the per-segment numbers force an honest conversation. Overall metrics would let you report good results while the prepaid gap persists.

This is also where Emeka might ask about feature importance per segment: "Can we get the model to tell us WHY a prepaid customer is likely to churn?" This is a reasonable extension. You can examine feature importances for the prepaid subset within the existing framework.

✓ Check

Check: MLflow shows at least two experiment runs with logged parameters and metrics. The per-segment evaluation shows prepaid recall separately from postpaid recall.

Training and Experiments

Step 1: Set Up MLflow Experiment Tracking

Step 2: Train the Baseline

Step 3: Train and Tune the RandomForest

Step 4: Per-Segment Evaluation

Step 5: Update Emeka