WealthJot.ai

The Multiple-Comparisons Trap

advanced7 min read

Test 1,000 strategies and a few look great by pure luck. Why a low p-value is not enough.

The multiple-comparisons trap is one of the most underappreciated ways quant traders fool themselves: if you test enough strategies, some willArranging how your wealth passes on after death. look brilliant purely by chance — even with no real edgeA repeatable, structural reason your trades win over time. at all. It’s overfitting’s sneaky cousin, hiding in the search process itself.

The intuition: give enough monkeys enough keyboards and one types a sentence. If you backtestTesting a trading strategy on historical data. 1,000 random strategies, statistics guarantees that some willArranging how your wealth passes on after death. show spectacular returns by sheer luck — just as flipping 1,000 coins 10 times each willArranging how your wealth passes on after death. produce a few “amazing” all-heads runs. The danger is that you then proudly present the winner (“look, 40% CAGRCompound Annual Growth Rate — the smoothed yearly return., p < 0.05!”) while forgetting the 999 you discarded. A p-value or great backtestTesting a trading strategy on historical data. from one strategy means something; the best of thousands means almost nothing, because you selected it for luck. This is why naive data-mining — grinding through millions of indicator combinations until something “works” — reliably produces strategies that die live. The defences: drastically reduce the number of things you try (start from a hypothesis, not a brute-force search), adjust your significance bar for how many tests you ran, and always validate the winner on fresh out-of-sample data.
  • The trap — test enough strategies and some look great by luck alone; you then keep the lucky winner and forget the rest.
  • Why p-values mislead — “significant” for one test is meaningless for the best of thousands (you selected for luck).
  • Data-mining danger — brute-forcing millions of combinations reliably finds flukes that fail live.
  • The fix — fewer tests (hypothesis-first), correct for the number of trials, and validate the winner out-of-sample.
ExampleYou run an optimiser over 5,000 indicator combinations and the best shows a gorgeous 35% CAGRCompound Annual Growth Rate — the smoothed yearly return.. Exciting — until you realise that with 5,000 tries, a few fantastic-looking results are statistically expected even from pure noise. Re-run that “winner” on untouched data and it’s ordinary or negative — it was the luckiest of 5,000, not the best.
Key takeawayTest enough strategies and some shine by pure luck — so the best of thousands is usually a fluke, not an edgeA repeatable, structural reason your trades win over time. (a low p-value on a hand-picked winner is meaningless). Defend by testing fewer things (hypothesis-first), correcting for the number of trials, and validating the winner on fresh data.
FAQs
Isn’t running lots of backtests how you find good strategies?

Exploration is fine, but *unconstrained* searching invites the multiple-comparisons trap — the more you try, the more luck contaminates your “best.” Far better to start from a *hypothesis* with an economic reason, test a *small* number of variations, and reserve out-of-sample data to validate. If you must search broadly, account for the number of trials and treat any winner with heavy skepticism.