Train/Test Splits & Leakage
In time series, a careless split leaks the future into the past. The pitfall that fakes great models.
To trust an ML model, you train it on one set of data and test it on another it never saw. In time-series (market) data, doing this carelessly leaks the futureA binding agreement to buy or sell at a set price on a future date. into the past — the single most common way ML trading models fake spectacular results that collapse live.
- What it is — the model accidentally accessing futureA binding agreement to buy or sell at a set price on a future date. information, faking great test results that fail live.
- The classic trap — random train/test splits shuffle time, training on the futureA binding agreement to buy or sell at a set price on a future date. to predict the past.
- The fix — split chronologically (train past, test futureA binding agreement to buy or sell at a set price on a future date.); never shuffle time-series data.
- Subtle leaks — features using futureA binding agreement to buy or sell at a set price on a future date. data (full-period normalisation), restated fundamentalsValuing a company from its business and financials., overlapping windows.
- The rule — a too-good ML model almost certainly has leakage; hunt for it before believing the result.
How do I make sure my ML model has no leakage?
Split strictly by time (train on the past, validate on the future), compute every feature point-in-time (no future data in normalisation, no restated fundamentals), use purged/embargoed time-series cross-validation, and treat suspiciously good results as a red flag to investigate. Leakage is sneaky and usually flatters results, so disciplined chronological hygiene and healthy skepticism are essential.