AdvancedQuantitative Methods / Machine Learning

Overfitting

Building a model that explains past data perfectly and future data not at all.

What it is

Overfitting occurs when a model is fitted too closely to the noise in a training dataset, capturing patterns that are specific to that sample rather than general features of the underlying process. An overfit model performs excellently on data it has seen before and poorly on data it has not — which is the only kind of performance that matters in real applications.

Every dataset contains two components: signal (real, repeating patterns) and noise (random variation specific to that sample). Overfitting means mistaking the noise for signal. The model learns the history rather than the process.

The classic example

Imagine fitting a polynomial to a scatterplot of ten data points. A degree-9 polynomial can pass through all ten points perfectly — zero training error. But when new data arrives, the polynomial oscillates wildly between points and predicts nothing useful. A simpler line fits less well in-sample but generalizes dramatically better.

The insight: model complexity should be matched to the amount of data available and to the genuine complexity of the underlying process. Given ten data points, a nine-parameter model has almost no degrees of freedom for learning signal — it uses all its flexibility on noise.

In quantitative finance

Quant finance is the domain where overfitting has destroyed the most capital. A trading strategy with 23 parameters, optimized on five years of daily data (roughly 1,250 observations), is almost certainly overfit. Each parameter consumes statistical capacity. With many parameters and limited data, the optimizer will find combinations that worked in the sample purely by chance.

A backtest Sharpe ratio of 3.0 sounds exceptional. After accounting for data-mining bias — the fact that this strategy was selected from a larger set of tested strategies — the expected out-of-sample Sharpe might be 0.3. The difference is overfitting.

The severity of the problem scales with the number of strategies tested against the same dataset. If a team tests 100 strategies and publishes the best performer, the apparent performance is substantially inflated — even if no individual researcher was dishonest. This is why academic finance has a replication problem: published factors that worked in the original sample often do not replicate, because they were selected from a large candidate set on the same data they will be tested on.

In machine learning for finance

Neural networks, random forests, and gradient boosting algorithms are powerful precisely because they are highly flexible — but flexibility is a liability without regularization and proper train/test splits. A neural network with millions of parameters can memorize any financial dataset. The question is whether it has learned anything that will hold in the future.

The standard corrective is out-of-sample testing: train on one time period, validate on a later one that was not used in any optimization step. But in finance, even this is compromised if the researcher has seen the out-of-sample period and adjusted the model accordingly. True out-of-sample testing requires a dataset that no one involved in model development has ever looked at.

The right way to think about it

Parsimony is not a limitation — it is a feature. A simple model with few parameters and strong out-of-sample performance tells you more than a complex model with perfect in-sample performance. The principle known as Occam's razor — prefer simpler explanations — has strong Bayesian justification: simpler models make fewer specific predictions, and when their predictions succeed, it provides stronger evidence of genuine signal.

Practical correctives include cross-validation (testing on held-out folds of the data), regularization (penalizing model complexity during fitting), and the Bonferroni correction for multiple testing (requiring stronger statistical evidence when many hypotheses have been tested).

One thing most people get wrong

The more parameters a model has relative to data points, the more it memorizes rather than learns. But practitioners often add complexity in response to poor performance, not realizing this almost always makes overfitting worse. The counterintuitive move — removing parameters, constraining the model, accepting worse in-sample fit — is what improves out-of-sample performance. Regularization techniques like Lasso and Ridge regression do exactly this: they deliberately introduce bias to reduce variance, accepting slightly wrong predictions in-sample to produce much more stable predictions out-of-sample. The tradeoff of bias for variance is the central tension in statistical modeling, and overfitting is what happens when variance wins completely.