WealthJot.ai

Train/Test Splits & Leakage

advanced7 min read

In time series, a careless split leaks the future into the past. The pitfall that fakes great models.

To trust an ML model, you train it on one set of data and test it on another it never saw. In time-series (market) data, doing this carelessly leaks the futureA binding agreement to buy or sell at a set price on a future date. into the past — the single most common way ML trading models fake spectacular results that collapse live.

Data leakage is *any way the model accidentally sees information from the futureA binding agreement to buy or sell at a set price on a future date. it’s trying to predict — and in markets it’s insidious because it produces gorgeous test results that are completely fake. The classic trap: random train/test splitting (standard in most ML) shuffles data freely, so the model trains on 2022 data to “predict” 2019 — impossible in reality, and it inflates accuracy enormously. The fix is to always split chronologically* (train on the past, test only on the futureA binding agreement to buy or sell at a set price on a future date.) — never shuffle time. But leakage hides in subtler places too: a feature computed using future information (e.g. normalising by the full-period mean, which includes the future), restated fundamentalsValuing a company from its business and financials., or overlapping windows that bleed test info into training. Because leakage makes a model look amazing, the rule is brutal: if your ML model looks too good, assume leakage until proven otherwise. A model is only as trustworthy as the strict separation in time between what it learned from and what it’s tested on.
ExampleA model shows 90% directional accuracy — astonishing. The cause: a random 80/20 split let it train on data after the test points, leaking the futureA binding agreement to buy or sell at a set price on a future date.. Re-split chronologically (train ≤2020, test 2021+) and accuracy drops to a realistic ~53%. The “90%” was leakage, not skill — exactly the trap that fuels false ML-trading hype.
Key takeawayData leakage — the model seeing futureA binding agreement to buy or sell at a set price on a future date. info — is the top way ML trading models fake great results. The classic cause is random train/test splits (shuffling time); the fix is strict chronological splitting (train past, test futureA binding agreement to buy or sell at a set price on a future date.). Subtle leaks hide in features and restated data. Rule: a too-good model has leakage until proven otherwise.
FAQs
How do I make sure my ML model has no leakage?

Split strictly by time (train on the past, validate on the future), compute every feature point-in-time (no future data in normalisation, no restated fundamentals), use purged/embargoed time-series cross-validation, and treat suspiciously good results as a red flag to investigate. Leakage is sneaky and usually flatters results, so disciplined chronological hygiene and healthy skepticism are essential.