The Brief
Eunji Cho runs merchandising analytics for Glow Republic, a mid-size Korean beauty retailer in Seoul. Twelve physical stores, an e-commerce platform, about 2,000 SKUs from 150 K-beauty brands.
K-beauty trends move fast. A product can go from unknown to viral in two weeks because of one TikTok video. Glow Republic is either overstocked on yesterday's trend or out of stock on today's. Last quarter: 180 million won in expired inventory write-offs and 250 million won in missed sales from stockouts.
Eunji has two years of daily sales data, social media mention counts from Instagram and TikTok, influencer tag data, and seasonal calendars. She wants to predict demand by SKU at least a week ahead so the buying team can order the right quantities. And she wants to know which products are starting to trend before they peak.
Your Role
You build a demand forecasting model from sales and social media data. But the model is not the hard part. The hard part is preparing the data correctly -- because the social media data contains a trap, and how you handle it determines whether the model produces honest predictions or impressive-looking numbers that fall apart in production.
The analysis specification is less detailed this time. You decide what preparation steps are needed, what features to engineer, and how to split the data. The methodology memo template is still here. Cross-model review is still available. What is gone is the step-by-step preparation guidance. The preparation design is yours.
What's New
Last time, you built a classification model on imbalanced data, discovered that accuracy is meaningless when one class dominates, and learned to evaluate with precision, recall, and ROC curves. You caught a proxy feature and tuned a threshold to match the client's priorities.
This time, the data has a different kind of problem. The preparation decisions -- which features to include, how to time them, how to split the data -- are where the analysis lives or dies. A model trained on data that leaks future information will look excellent and predict nothing. This is not about building a better model. It is about preparing data that makes any model trustworthy.
The hard part is recognizing what is wrong with the data before you see it in the results.
Tools
- Python 3.11+ via your conda "ds" environment
- Jupyter Notebook for the analysis
- pandas for data handling
- scikit-learn for regression models, feature importance, MAE/RMSE
- scipy for statistical checks
- matplotlib / seaborn for visualization
- Claude Code as the AI you direct
- Git / GitHub for version control
Materials
You receive:
- Two datasets: daily sales data (~145,000 rows, 200 SKUs over 24 months) and daily social media mention counts
- A data dictionary explaining both datasets
- A methodology memo template with a new "Preparation Decisions" section for documenting feature engineering, temporal splitting, leakage assessment, and data quality decisions
- A project governance file (CLAUDE.md) for Claude Code
- Eunji's Slack message explaining what she needs