Train/Test Splits & Leakage

advanced7 min read

In time series, a careless split leaks the future into the past. The pitfall that fakes great models.

To trust an ML model, you train it on one set of data and test it on another it never saw. In time-series (market) data, doing this carelessly leaks the future into the past — the single most common way ML trading models fake spectacular results that collapse live.

Data leakage is *any way the model accidentally sees information from the future it’s trying to predict — and in markets it’s insidious because it produces gorgeous test results that are completely fake. The classic trap: random train/test splitting (standard in most ML) shuffles data freely, so the model trains on 2022 data to “predict” 2019 — impossible in reality, and it inflates accuracy enormously. The fix is to always split chronologically* (train on the past, test only on the future) — never shuffle time. But leakage hides in subtler places too: a feature computed using future information (e.g. normalising by the full-period mean, which includes the future), restated fundamentals, or overlapping windows that bleed test info into training. Because leakage makes a model look amazing, the rule is brutal: if your ML model looks too good, assume leakage until proven otherwise. A model is only as trustworthy as the strict separation in time between what it learned from and what it’s tested on.

What it is — the model accidentally accessing future information, faking great test results that fail live.
The classic trap — random train/test splits shuffle time, training on the future to predict the past.
The fix — split chronologically (train past, test future); never shuffle time-series data.
Subtle leaks — features using future data (full-period normalisation), restated fundamentals, overlapping windows.
The rule — a too-good ML model almost certainly has leakage; hunt for it before believing the result.

ExampleA model shows 90% directional accuracy — astonishing. The cause: a random 80/20 split let it train on data after the test points, leaking the future. Re-split chronologically (train ≤2020, test 2021+) and accuracy drops to a realistic ~53%. The “90%” was leakage, not skill — exactly the trap that fuels false ML-trading hype.

Key takeawayData leakage — the model seeing future info — is the top way ML trading models fake great results. The classic cause is random train/test splits (shuffling time); the fix is strict chronological splitting (train past, test future). Subtle leaks hide in features and restated data. Rule: a too-good model has leakage until proven otherwise.

FAQs

How do I make sure my ML model has no leakage?

Split strictly by time (train on the past, validate on the future), compute every feature point-in-time (no future data in normalisation, no restated fundamentals), use purged/embargoed time-series cross-validation, and treat suspiciously good results as a red flag to investigate. Leakage is sneaky and usually flatters results, so disciplined chronological hygiene and healthy skepticism are essential.