Learn by Directing AI
Unit 2

Evaluation Design

Step 1: Review the Evaluation Ticket

Open materials/tickets.md and find the evaluation design tickets (T05-T07). The task: choose your primary and secondary evaluation metrics, define how you'll evaluate prepaid and postpaid customers separately, establish baselines, and set success thresholds.

This is the first time you're making these decisions. In P1, the evaluation criteria were handed to you -- recall >= 0.55 on the churn class. This time you decide what "success" looks like and why.

Step 2: Get AI's Metric Recommendations

Direct Claude to recommend evaluation metrics for a churn prediction problem with class imbalance and a segment gap. Something like: "Given an imbalanced churn dataset where overall churn is ~8% and prepaid customers churn at ~12% while postpaid churn at ~4%, what evaluation metrics should I use? Why?"

Read what Claude recommends. Does it mention class imbalance? Does it suggest looking at prepaid and postpaid separately, or only overall metrics? AI commonly defaults to metrics that look good in aggregate while hiding segment-level failures. Check whether the recommendations address the specific problem Emeka described.

Now try AI self-review. Ask Claude: "Review your metric recommendations. For each metric you suggested, explain specifically how it would reveal or hide the prepaid churn gap." This is a targeted self-review prompt -- it forces Claude to evaluate its own output against a specific criterion rather than giving a generic "looks good."

Step 3: Choose Your Metrics

Make your decisions. Your primary metric should capture whether the model catches customers who are about to leave. Your secondary metric should capture whether the model wastes the retention team's time on false positives.

Write the rationale for each choice. Not just "use F1" but why -- connect the metric to the business problem. "We chose recall as the primary metric because Emeka's retention team can only call 200 people a week and every missed churner is a subscriber they could have saved. We chose precision as the secondary metric because every false positive is a wasted call."

Step 4: Design Per-Segment Evaluation

The overall metrics could show solid performance while the model completely fails on prepaid customers. You need to evaluate the two segments separately.

Direct Claude to set up an evaluation plan that reports metrics for prepaid customers and postpaid customers independently. The plan should specify: which metrics to compute per segment, what the segment column is, and what counts as improvement over the baseline for each segment.

This is disaggregated evaluation. The P1 model's overall metrics probably looked fine. The problem was that "overall" averaged away the prepaid failure. Separating the segments makes the gap visible.

Step 5: Establish Baselines

Before you train anything, you need to know the floor. Direct Claude to compute two baselines:

First, the majority-class predictor. What happens if the model just predicts "no churn" for every subscriber? On this dataset, that's roughly 92% accurate. And it catches exactly zero of the customers who are about to leave. This is why accuracy is meaningless here.

Second, a logistic regression baseline. This is the simplest real model. Log the overall metrics and the per-segment metrics. These numbers define what the real model has to beat -- not zero, but the logistic regression floor.

✓ Check

Check: The evaluation design includes per-segment metrics (prepaid and postpaid evaluated separately) and at least one baseline score to compare the final model against.