Learn by Directing AI
Unit 2

The leakage trap

Step 1: Jamie's warning

Before you start building, a message comes in from Jamie Park, a senior data scientist on the team:

hey -- before you build anything, check whether the social media features are lagged or same-day. same-day is leakage.

One line, no explanation. But it points at something specific. The social media data -- the Instagram and TikTok mention counts you profiled in the last unit. Are those counts from the same day as the sales, or from an earlier period?

Step 2: Investigate the timing

Go back to the data dictionary. It says the social media counts are same-day: the count for Tuesday includes mentions that occurred on Tuesday.

Now think about what that means for prediction. If you use Tuesday's mention count to predict Tuesday's sales, what are you actually doing?

Both numbers -- Tuesday's mentions and Tuesday's sales -- spike because of the same underlying event. A viral TikTok video goes live on Tuesday morning. By Tuesday afternoon, mentions are up and sales are up. One did not cause the other. They both happened because of the same thing.

Using same-day mentions to predict same-day sales is like using the answer to predict the answer. The model will look excellent. The predictions will be worthless.

Step 3: Understand what leakage means

Data leakage happens when your model has access to information it would not have at the time of prediction. If Eunji needs to predict next week's demand, she cannot use next week's social media counts -- they do not exist yet.

Same-day social media data is a form of leakage because at the time of prediction (before sales happen), the same-day mentions have not occurred yet. The only legitimate social media features are lagged -- yesterday's mentions, last week's mentions, the rolling average from the past seven days. These happened before the prediction window.

This is different from the proxy features you found in P4. There, a variable was causally downstream of the target (barrel quality tier was determined by the same tasting that produced the score). Here, two variables (mentions and sales) are both effects of the same cause (a trend event). The leakage is subtler, but the result is the same: fake accuracy.

Step 4: Create lagged features

Direct AI to create lagged versions of the social media mention counts. You need at least:

  • mentions_lag_1d -- yesterday's total mentions (Instagram + TikTok)
  • mentions_lag_7d -- mentions from seven days ago
  • mentions_lag_7d_avg -- seven-day rolling average of total mentions

These are the legitimate social media features. They represent information that would actually be available when the buying team places orders.

After creating the lagged features, direct AI to remove the original same-day mention columns from the feature set. Those columns stay in the raw data for reference but must not enter the model.

Step 5: Understand temporal splits

Direct AI to explain why a random train/test split is wrong for time-series data.

On a random split, rows from 2025 can end up in training while rows from 2024 end up in test. The model has "seen" the future during training. This leaks temporal information -- seasonal patterns, trend evolutions, demand shifts -- into training, producing accuracy that collapses on genuinely unseen data.

A temporal split trains on the past and tests on the future. Train on months 1-18 (January 2024 through June 2025), test on months 19-24 (July through December 2025). Every training observation comes before every test observation.

AI commonly defaults to random splits on time-series data. This is one of the most consequential preparation errors in data science. Check how AI splits the data before you trust any evaluation metric.

Step 6: Implement the temporal split

Direct AI to split the data temporally. Train on everything before July 2025. Test on July 2025 onward. Verify the split: ask for the earliest and latest dates in training and test. No training dates should appear after the earliest test date.

Open the methodology memo template (materials/methodology-memo-template.md). Begin filling in the "Temporal Splitting" subsection under Preparation Decisions. Document what you chose, why, and what the alternative (random split) would have done wrong.

✓ Check

Check: Same-day leakage identified. Lagged features created. Temporal split verified.