Step 1: Read the methodology memo template
Open materials/methodology-memo-template.md. This is the same kind of document you used in the last project -- a living record of your analytical decisions. Fill in the Data Source section now: the dataset, its size, the class distribution you found in Unit 1.
You will update the remaining sections as you work through the project.
Step 2: Handle missing values
Direct AI to check for null values across all columns. The dataset has realistic missing patterns -- some null fermentation temperatures from sensor failures, some missing soil analysis values.
Decide how to handle each: impute with the column median, drop the rows, or flag them. The cleaning decisions here are familiar from your previous projects. Document your choices in the methodology memo.
Step 3: Examine features for proxy variables
This step is new. Direct AI to list all columns and, for each one, answer this question: would this information be available BEFORE Luciana tastes the barrel?
Production inputs like fermentation temperature, altitude, and rainfall are available before tasting. They describe what happened during growing and aging. These are legitimate predictors.
But some columns describe what happened AFTER tasting. If a column's value is determined by the quality score, including it as a feature gives the model the answer instead of making it predict. AI will include these columns because they improve metrics -- they improve metrics because they contain the answer.
Step 4: Catch the proxy feature
Look at the data dictionary again. One column in this dataset is a proxy -- its value is determined by the panel score, not predictive of it. If you include it, the model achieves artificially high performance because it is reading the outcome, not predicting it.
Direct AI to check whether any column's values can be derived directly from panel_score. When you find it, remove it. Document in the methodology memo what the column was, why it was a proxy, and why you removed it even though it would have improved the model.
Step 5: Encode categorical features
The dataset includes categorical columns: vineyard plot (Alto, Medio, Bajo), grape variety (Malbec, Cabernet Sauvignon), and oak type (French, American).
Direct AI to encode these for modeling. For vineyard plot, consider that the three categories have a natural altitude ordering -- Alto is highest, Bajo is lowest. Ordinal encoding preserves that ordering. One-hot encoding does not. For grape variety and oak type (no natural order), one-hot encoding is appropriate.
Document the encoding choices and the reasoning.
Step 6: Create the target variable
The panel scores are continuous (1-100). Classification needs a binary target: Reserve (90+) or Standard (below 90). Direct AI to create a binary column from panel_score and verify the distribution matches what you found during profiling -- approximately 8% Reserve.
Check: Proxy feature removed. Encoding documented. Target ~8% Reserve. No downstream features remain.