Step 1: Build the honest model
Direct AI to fit a regression model on the prepared data for the top 200 SKUs. Use the engineered features -- the lagged social media mentions, the calendar features, the historical sales lags -- with the temporal train/test split from Unit 2.
A random forest or gradient boosting regressor works well for this kind of data. The model choice matters less than the preparation. Both models will produce reasonable predictions on properly prepared data. Both will produce misleading predictions on leaked data.
Step 2: Evaluate on the temporal test set
Direct AI to compute MAE (mean absolute error) and RMSE (root mean squared error) on the temporal test set. These are the same metrics you used for regression in P2, but now applied to demand quantities instead of scores.
Read the numbers. They will not be as low as you might expect if you have been working with classification metrics. Demand forecasting is harder than classification -- you are predicting a quantity, not a category. An MAE of 40-50 units means the forecast is typically off by 40-50 units per SKU per day. Whether that is good depends on the context: for a SKU selling 200 units a day, it is reasonable. For a SKU selling 10 units a day, it is not.
Step 3: Compare performance by product type
Evaluate the model separately for seasonal products and trend-driven products.
Direct AI to compute MAE for each group. Seasonal products should forecast significantly better -- their patterns are regular and learnable from history. Trend-driven products will forecast worse -- viral events are inherently unpredictable from historical data alone.
This split is not a failure of the model. It is an honest assessment of what the data can and cannot predict. Seasonal sunscreen demand follows a pattern the model can learn. A serum that goes viral because of one TikTok video does not.
Step 4: Check feature importances
Direct AI to show the feature importances from the model. For seasonal products, calendar features and historical sales should dominate. For trend-driven products, lagged social media mentions should carry more weight.
If same-day mention features appear as important -- something went wrong in the preparation. Go back and verify those columns were removed from the feature set.
Step 5: Build the cheating comparison
Now build a model deliberately designed to leak. Include the original same-day social media mentions as features. Use a random train/test split instead of the temporal split. Fit the same model type.
Evaluate on the randomly split test set. The metrics will be dramatically better than the honest model -- potentially two to three times better on MAE. This model looks excellent on paper.
But it is cheating. The same-day mentions give the model the answer. The random split lets it memorize temporal patterns. In production, when Eunji's buying team tries to use this model to order inventory next week, it will fall apart because next week's social media mentions and next week's sales do not exist yet.
Step 6: Document the gap
The difference between the honest model and the cheating model is the leakage penalty. Document this in the methodology memo.
The honest model is worse on paper. It is the only model that tells the truth about what you can actually predict. The cheating model's superior numbers are an artifact of using information that would not be available at prediction time.
This is the core lesson of preparation as methodology: the preparation decisions -- which features to include, how to time them, how to split the data -- determine whether the model is trustworthy. The model choice matters less. A simple model on honest data outperforms a sophisticated model on leaked data in production.
Update the methodology memo's "Leakage Assessment" section with this comparison. Note the metric gap, what caused it, and why the honest model is the one to deploy.
Check: MAE/RMSE on temporal test. Performance split by product type. No same-day features. Leakage gap demonstrated.