Learn by Directing AI

Step 1: The delayed ground truth problem

In P7, drift detection answered: has the input data changed? That question can be answered immediately -- compare today's features to the training distribution.

Recommendation quality is different. A recommendation made today cannot be evaluated for accuracy today. The customer might click in ten minutes, purchase in three days, and return the item in two weeks. Each of those events changes the answer to "was the recommendation good?"

Open materials/production-recommendations-sample.csv. This is 30 days of production recommendation events. Check the ground_truth_available column. About 60% of recommendations have confirmed outcomes. The remaining 40% are still waiting -- the customer has not acted yet, or the return window has not closed.

This is the delayed ground truth problem. Any monitoring system that reports recommendation accuracy using only the 60% with confirmed outcomes is biased toward fast decisions. Customers who browse, consider, and buy later are invisible until their purchase arrives.

Step 2: Design the monitoring strategy

Ask Claude to design a monitoring strategy for the recommendation system. The strategy needs to answer three questions: what to monitor, how to handle the ground truth delay, and who the monitoring serves.

What to monitor: recommendation clicks (immediate), purchases (delayed days), returns (delayed weeks). How: metrics update as ground truth arrives over time, not just at the moment of recommendation. Who: Max's merchandising team needs to read this dashboard and make decisions from it.

AI commonly generates monitoring code that treats all metrics as immediately available -- computing accuracy on the full dataset without distinguishing between confirmed and pending outcomes. Check for this pattern and direct Claude to handle the temporal gap.

Step 3: Build disaggregated monitoring

Aggregate monitoring is a trap. "Overall recommendation click-through rate: 23.5%" sounds healthy. But break it down by customer segment -- age group, sustainability preference, geographic region -- and the picture changes.

Ask Claude to build disaggregated monitoring that tracks recommendation performance separately for each customer segment. The production data has a built-in signal: the 25-34 age group shows degraded click-through in the last 10 days while other segments remain stable. Aggregate monitoring will not catch this. Disaggregated monitoring will.

Direct Claude to compute per-segment metrics, not just overall averages. A model that works well for most customers but fails for one segment has a problem that needs attention.

Step 4: Add fairness monitoring

The model may have been fair at deployment. That does not mean it stays fair. As the catalog changes -- new brands, seasonal collections, products going out of stock -- the recommendations shift. Those shifts do not affect all customer segments equally.

Ask Claude to add fairness monitoring that tracks whether recommendation quality degrades differently across segments over time. This is not a one-time audit. It is ongoing monitoring that catches disparate impact as it develops.

A model that recommended sustainable fashion equally to all age groups last month might start under-recommending to older customers if the new seasonal collection skews younger. The monitoring should detect this before the merchandising team notices it in sales numbers.

Step 5: Build the monitoring dashboard

The dashboard is a communication artifact. Max's merchandising team will read it. They did not build the model. They do not know what PSI means. They do not care about KL divergence.

Ask Claude to build a dashboard that translates metrics into business language. "Recommendation conversion dropped 7% for new customers this month" communicates. "PSI: 0.31 on segment_new" does not.

AI commonly generates dashboards with technically correct charts that are unintelligible to non-technical audiences. Direct Claude to use business-language labels, clear visual hierarchy, and segment breakdowns that a merchandising team can act on.

Step 6: Design the ground truth pipeline

The monitoring needs to update as outcomes arrive. A recommendation made on December 1 might not have a confirmed purchase until December 8 and might not have a return status until December 22.

Ask Claude to design a pipeline that updates recommendation quality metrics as ground truth arrives over time. The dashboard should show which metrics are based on confirmed data and which are provisional.

Step 7: Verify with simulated shift

Test the monitoring system. The production data in materials/production-recommendations-sample.csv contains a simulated drift: the 25-34 age group's click-through rate drops from ~28% to ~15% in the last 10 days. Other segments remain stable.

Run the monitoring on this data. The aggregate view should look stable -- overall click-through rate barely changes. The disaggregated view should catch the problem.

Max reviews the dashboard at this point. He responds to the business-language metrics: "Das ist exactly what my merchandising team needs -- they can actually read this." He asks about seasonal transitions: "What happens when the summer collection replaces spring?" This connects to the ground truth delay -- summer collection performance data will not be available until customers have had time to purchase and return items.

✓ Check

Check: Simulated data shift triggers a visible alert in the dashboard, and disaggregated metrics show the shift affecting specific customer segments differently.