The Multiple-Comparisons Trap

Q: Isn’t running lots of backtests how you find good strategies?

Exploration is fine, but *unconstrained* searching invites the multiple-comparisons trap — the more you try, the more luck contaminates your “best.” Far better to start from a *hypothesis* with an economic reason, test a *small* number of variations, and reserve out-of-sample data to validate. If you must search broadly, account for the number of trials and treat any winner with heavy skepticism.

advanced7 min read

Test 1,000 strategies and a few look great by pure luck. Why a low p-value is not enough.

The multiple-comparisons trap is one of the most underappreciated ways quant traders fool themselves: if you test enough strategies, some will look brilliant purely by chance — even with no real edge at all. It’s overfitting’s sneaky cousin, hiding in the search process itself.

The intuition: give enough monkeys enough keyboards and one types a sentence. If you backtest 1,000 random strategies, statistics guarantees that some will show spectacular returns by sheer luck — just as flipping 1,000 coins 10 times each will produce a few “amazing” all-heads runs. The danger is that you then proudly present the winner (“look, 40% CAGR, p < 0.05!”) while forgetting the 999 you discarded. A p-value or great backtest from one strategy means something; the best of thousands means almost nothing, because you selected it for luck. This is why naive data-mining — grinding through millions of indicator combinations until something “works” — reliably produces strategies that die live. The defences: drastically reduce the number of things you try (start from a hypothesis, not a brute-force search), adjust your significance bar for how many tests you ran, and always validate the winner on fresh out-of-sample data.

The trap — test enough strategies and some look great by luck alone; you then keep the lucky winner and forget the rest.
Why p-values mislead — “significant” for one test is meaningless for the best of thousands (you selected for luck).
Data-mining danger — brute-forcing millions of combinations reliably finds flukes that fail live.
The fix — fewer tests (hypothesis-first), correct for the number of trials, and validate the winner out-of-sample.

ExampleYou run an optimiser over 5,000 indicator combinations and the best shows a gorgeous 35% CAGR. Exciting — until you realise that with 5,000 tries, a few fantastic-looking results are statistically expected even from pure noise. Re-run that “winner” on untouched data and it’s ordinary or negative — it was the luckiest of 5,000, not the best.

Key takeawayTest enough strategies and some shine by pure luck — so the best of thousands is usually a fluke, not an edge (a low p-value on a hand-picked winner is meaningless). Defend by testing fewer things (hypothesis-first), correcting for the number of trials, and validating the winner on fresh data.

FAQs

Isn’t running lots of backtests how you find good strategies?

Exploration is fine, but *unconstrained* searching invites the multiple-comparisons trap — the more you try, the more luck contaminates your “best.” Far better to start from a *hypothesis* with an economic reason, test a *small* number of variations, and reserve out-of-sample data to validate. If you must search broadly, account for the number of trials and treat any winner with heavy skepticism.