Wine Quality Classification for Bodega Moretti

Client

Luciana Moretti, Owner and Winemaker at Bodega Moretti -- a small family winery in Lujan de Cuyo, Mendoza, Argentina. She produces Malbec and Cabernet Sauvignon from three vineyard plots at different altitudes. About 15,000 bottles per year.

What you are building

A classification model that predicts which wine barrels are likely to score high enough (90+) for Reserve designation based on production data (fermentation temperature, altitude, rainfall, soil analysis, barrel aging). The model helps Luciana focus her tasting time on borderline cases instead of tasting through barrels that are obviously standard.

Tech stack

Python 3.11+ (conda "ds" environment)
Jupyter Notebook
pandas
scikit-learn (LogisticRegression, DecisionTreeClassifier, confusion_matrix, classification_report, roc_auc_score, roc_curve)
matplotlib / seaborn
scipy
Git / GitHub

File structure

p4/
  materials/
    CLAUDE.md (this file)
    voicemail-transcript.md
    barrel-data.csv
    data-dictionary.md
    methodology-memo-template.md
    scripts/
      generate_barrel_data.py
    images/
  platform-config/
    living-client.md
    senior-colleague.md
    inline-checks.md
    understanding-goals.md

Key materials

materials/barrel-data.csv -- 5 years of barrel production data (~3,000 samples)
materials/data-dictionary.md -- column definitions and notes
materials/voicemail-transcript.md -- Luciana's initial voicemail
materials/methodology-memo-template.md -- template for documenting analytical approach

Tickets

T1: Profile data and identify class distribution (~8% Reserve)
T2: Clean data, encode features, detect and remove proxy features
T3: Build baseline classifier and expose the accuracy trap
T4: Compare models (logistic regression vs decision tree) with correct metrics, tune threshold
T5: Cross-check methodology, translate findings to barrel/winery language
T6: Deliver findings to Luciana, handle scope extension, decision record, push to GitHub

Verification guidance

Class distribution should be approximately 8% Reserve (panel_score >= 90)
The export_status column is a proxy feature -- it is determined BY the panel score, not predictive of it. It must be removed before model fitting.
Evaluation must use precision, recall, confusion matrix, and ROC/AUC -- NOT accuracy alone. Accuracy on imbalanced data is meaningless.
Threshold should be tuned to favor recall (catching Reserve barrels) over precision, matching Luciana's stated priority.
Findings must be translated into winery language (barrels, not statistical terms).

Commit convention

Commit after completing each ticket. Use descriptive commit messages that explain what was done and why.