Learn by Directing AI
Unit 3

Training and Evaluation

Step 1: Review the Training and Evaluation Tickets

Check materials/CLAUDE.md if you need a reminder of the project structure. Open materials/tickets.md and find the training and evaluation tickets. Three tasks: train a RandomForestClassifier, compute the confusion matrix and classification report, and check recall on the churn class against the target.

The target is specific: churn class recall >= 0.55. That number comes from Emeka. His retention team can act on predictions only if the model catches enough of the subscribers who are actually leaving. A model that misses most churners is not useful, no matter what the other numbers say.

Two parameters matter before you start. The tickets specify class_weight='balanced' and random_state=42. The first tells the RandomForest that the minority class (churn) matters as much as the majority class — without it, the model optimizes for the 92% and largely ignores the 8%. The second makes the result reproducible. Anyone who runs the same code on the same data gets the same model.

Step 2: Train the Model

Direct Claude to train the RandomForestClassifier on the training set. Be specific in your prompt: include class_weight='balanced' and random_state=42. Something like: "Train a RandomForestClassifier on the training data with class_weight='balanced' and random_state=42. Use the preprocessed features and the churn target column."

Training always produces a model. The question is never "did it work?" — it ran, it produced output, it will make predictions. The question is whether those predictions meet the criteria you defined before training started. That distinction matters for every model you build after this one.

If your prompt doesn't mention class_weight='balanced', Claude will use the default — equal weight for both classes. The model will optimize for overall accuracy, which on a 92/8 split means learning to predict "no churn" almost every time. It would still produce a model. It would still report a number. The number would just be the wrong one to care about.

Step 3: Generate the Evaluation

Direct Claude to compute the confusion matrix and classification report on the test set. Not the training set — the test set. The training set is what the model learned from. Evaluating on it tells you how well the model memorized, not how well it generalizes.

Ask for both outputs explicitly: "Generate the confusion matrix and classification report on the test set. Show me the full output."

The confusion matrix is a 2x2 grid. Each cell counts predictions against reality for a specific combination.

Look at the classification report next.

The report shows precision, recall, and F1 for each class, plus overall accuracy. The accuracy number will catch your eye first — it will be somewhere around 90%. That number includes both classes, and the "no churn" class is 92% of the data. A model that predicts "no churn" for everyone would get 92% accuracy. The accuracy line is not where the answer lives.

Step 4: Interpret the Results

Find the recall value for the churn class. That is the number that answers Emeka's question: of the subscribers who actually churned, what fraction did the model catch?

Now look at the confusion matrix cells and think about what each one means for the retention team.

True positives — bottom right — are churners the model correctly identified. These are the subscribers Emeka's team will call, and the calls will matter. False negatives — bottom left — are churners the model missed. These subscribers will leave and nobody will reach out. False positives — top right — are stable subscribers the model flagged as churners. Emeka's team calls them, but they were never going to leave. Wasted effort, but not a lost customer. True negatives — top left — are stable subscribers correctly left alone. The largest cell, and the least interesting.

The trade-off is between false negatives and false positives. Catching more churners (higher recall) means flagging more stable subscribers too (more false positives). Emeka's team has limited capacity — they can make about 200 calls a week. Every false positive takes a slot that could have gone to a real churner.

This is why recall has a specific target instead of "as high as possible." The model needs to catch enough churners to be useful without flooding the team with false alarms. Check: does the churn recall meet the >= 0.55 target?

One thing to watch for: Claude may present accuracy as the headline result. If the summary leads with "the model achieved 90% accuracy," that framing is technically correct and practically misleading. The metric that matters for this problem is churn recall, and the reason is the class imbalance you saw in Unit 1. A model's value depends on what you need it to do, not on the number that looks best.

Step 5: Extract Feature Importances

Direct Claude to extract and display the feature importances from the trained RandomForest. Something like: "Show the feature importances from the trained model, sorted from most to least important."

Feature importances tell you which columns in the dataset had the most influence on the model's predictions. This is not the same as causation — a feature being important to the model does not mean it causes churn. But it tells you what the model is paying attention to, which is useful for two reasons.

First, it's a sanity check. If the model considers subscriber_id the most important feature, something went wrong — the model is memorizing individual records, not learning patterns. The important features should be things that plausibly relate to churn behavior: contract type, tenure, charges, payment method.

Second, Emeka will ask. He wants to know what drives churn in his subscriber base. The feature importances are a first-pass answer. Not a definitive one — RandomForest importances are biased toward high-cardinality features and don't capture interactions — but a useful starting point for a conversation with his retention team about where to focus.

Direct Claude to present the importances in a way you could share with Emeka. Business terms, not column names. "Contract type and tenure are the strongest predictors of churn" means more to him than "contract_type importance: 0.23, tenure_months importance: 0.19."

Step 6: Update Emeka

Send Emeka the results. Translate what you found into his language: how many churners the model catches, what the false positive rate means for his team's 200 weekly calls, and which features drive the predictions.

Emeka will respond — he gets excited fast. Something like: "So you're telling me we can catch 65 out of 100 customers who are about to leave? That's already better than guessing." He'll ask two things. First, about the false positives: how many stable subscribers will his team waste calls on? That's the precision number for the churn class, framed as a ratio. Second, about the feature importances: if contract type is the biggest driver, should his team focus retention offers on month-to-month subscribers?

Both are good questions. The false positive question connects the confusion matrix to his team's real capacity constraint. The feature importance question is where you explain that importance means "the model pays attention to this" — not "this causes churn." Month-to-month subscribers churn more, but offering them a discount might not be what keeps them. That's a business decision, not a model output.


✓ Check

✓ Check: Churn class recall >= 0.55 on the test set.