Learn by Directing AI

Step 1: Build a baseline model

Direct AI to fit a logistic regression on the prepared data. Use a train/test split -- 80/20 is standard. Ask for the accuracy score on the test set.

The model will report something in the low 90s. It looks excellent.

Step 2: Check what the model actually predicts

That accuracy number is lying to you. Direct AI to generate the confusion matrix for the test set predictions.

The confusion matrix is a 2x2 grid. It shows four things: barrels correctly predicted as Standard (true negatives), barrels correctly predicted as Reserve (true positives), Standard barrels the model falsely called Reserve (false positives), and Reserve barrels the model missed (false negatives).

Read the matrix. The model predicts nearly every barrel as Standard. It achieves high accuracy because 92% of barrels ARE Standard -- predicting "Standard" for everything is right 92% of the time. But look at the Reserve column. Out of the actual Reserve barrels in the test set, the model catches almost none.

Step 3: Understand why

The 8% class imbalance means a model that always predicts "Standard" gets 92% accuracy. The logistic regression learned that "Standard" is the safe bet. It achieves a high score by ignoring the minority class entirely.

This is the accuracy trap: on imbalanced data, accuracy measures how well the model predicts the majority class, not how well it does its job. Luciana does not need a model that correctly identifies barrels she already knows are standard. She needs a model that catches the barrels that deserve Reserve.

AI defaults to reporting accuracy as the evaluation metric. On balanced data, this works. On imbalanced data, it hides exactly the information that matters.

Step 4: Precision and recall

Accuracy compressed the confusion matrix into a single misleading number. Precision and recall separate the two questions that matter:

Precision answers: of the barrels the model flagged as Reserve, how many actually were? If it flagged 30 and 25 were truly Reserve, precision is 83%. The other 5 are false alarms -- standard barrels Luciana would taste unnecessarily.

Recall answers: of the barrels that actually were Reserve, how many did the model catch? If there were 30 Reserve barrels and the model found 25, recall is 83%. The other 5 are missed -- Reserve barrels that slipped through without extra attention.

Direct AI to compute precision and recall for the Reserve class specifically. The numbers will tell a very different story from the accuracy score.

Step 5: Connect to Luciana's priorities

Luciana said: "I'd rather taste ten extra barrels than miss one great one."

That is a recall priority. She wants to catch all the Reserve barrels even if it means some false alarms. Missing a Reserve barrel costs the price premium. Tasting an extra standard barrel costs time. She has told you which error she can live with.

This is not a technical optimization. The precision-recall trade-off is a business decision, and it belongs to the client. AI optimizes for F1 by default, which balances precision and recall equally. But Luciana's costs are not equal.

Step 6: Introduce the F1 score and its limitation

Direct AI to compute the F1 score. F1 is the harmonic mean of precision and recall -- it balances the two equally. If precision is 0.80 and recall is 0.60, F1 is 0.69.

F1 is useful when both types of errors cost the same. For Luciana, they do not. A metric that splits the difference between catching Reserve barrels and avoiding false alarms is a compromise she did not ask for. F1 is a starting point for comparison, not the final word on which model serves the client.

Step 7: Update the methodology memo

Update materials/methodology-memo-template.md with the evaluation strategy: which metrics you used (precision, recall, confusion matrix -- not accuracy alone) and why. Note Luciana's recall priority and what that means for evaluation.

✓ Check

Check: Confusion matrix interpreted. Accuracy explained as misleading. Precision and recall computed. Luciana's priorities mapped to metrics.