Wine Quality Classification for Bodega Moretti
Client
Luciana Moretti, Owner and Winemaker at Bodega Moretti -- a small family winery in Lujan de Cuyo, Mendoza, Argentina. She produces Malbec and Cabernet Sauvignon from three vineyard plots at different altitudes. About 15,000 bottles per year.
What you are building
A classification model that predicts which wine barrels are likely to score high enough (90+) for Reserve designation based on production data (fermentation temperature, altitude, rainfall, soil analysis, barrel aging). The model helps Luciana focus her tasting time on borderline cases instead of tasting through barrels that are obviously standard.
Tech stack
- Python 3.11+ (conda "ds" environment)
- Jupyter Notebook
- pandas
- scikit-learn (LogisticRegression, DecisionTreeClassifier, confusion_matrix, classification_report, roc_auc_score, roc_curve)
- matplotlib / seaborn
- scipy
- Git / GitHub
File structure
p4/
materials/
CLAUDE.md (this file)
voicemail-transcript.md
barrel-data.csv
data-dictionary.md
methodology-memo-template.md
scripts/
generate_barrel_data.py
images/
platform-config/
living-client.md
senior-colleague.md
inline-checks.md
understanding-goals.md
Key materials
- materials/barrel-data.csv -- 5 years of barrel production data (~3,000 samples)
- materials/data-dictionary.md -- column definitions and notes
- materials/voicemail-transcript.md -- Luciana's initial voicemail
- materials/methodology-memo-template.md -- template for documenting analytical approach
Tickets
- T1: Profile data and identify class distribution (~8% Reserve)
- T2: Clean data, encode features, detect and remove proxy features
- T3: Build baseline classifier and expose the accuracy trap
- T4: Compare models (logistic regression vs decision tree) with correct metrics, tune threshold
- T5: Cross-check methodology, translate findings to barrel/winery language
- T6: Deliver findings to Luciana, handle scope extension, decision record, push to GitHub
Verification guidance
- Class distribution should be approximately 8% Reserve (panel_score >= 90)
- The export_status column is a proxy feature -- it is determined BY the panel score, not predictive of it. It must be removed before model fitting.
- Evaluation must use precision, recall, confusion matrix, and ROC/AUC -- NOT accuracy alone. Accuracy on imbalanced data is meaningless.
- Threshold should be tuned to favor recall (catching Reserve barrels) over precision, matching Luciana's stated priority.
- Findings must be translated into winery language (barrels, not statistical terms).
Commit convention
Commit after completing each ticket. Use descriptive commit messages that explain what was done and why.