AdvancedQuant / Forecasting

The Brier Score and Calibration

A proper scoring rule that rewards both accuracy and honest confidence.

What it is

The Brier score is a proper scoring rule for probabilistic forecasts of binary outcomes. It measures the mean squared error between predicted probabilities and actual outcomes. A perfect forecast over many predictions receives a Brier score of 0; the worst possible score is 2.

For a single prediction, Brier score = (forecast probability - actual outcome)²

Where the outcome is 1 if the event occurred and 0 if it did not.

A 70% confidence forecast on an event that occurs receives: (0.70 - 1)² = 0.09. The same 70% forecast on an event that does not occur receives: (0.70 - 0)² = 0.49.

Lower is better. A forecast of 50% — the equivalent of a coin flip — receives an average Brier score of 0.25 on a random collection of outcomes. Any coherent forecaster with genuine information should beat 0.25.

What calibration means

A forecaster is well-calibrated if, among all the predictions they made at X% confidence, X% of those events actually happened. Among all the times they said "70% likely," 70% of events occurred. Among all the times they said "90% likely," 90% of events occurred.

Calibration is different from accuracy. A forecaster can be perfectly calibrated while having no predictive power — if they say "50%" on everything, and outcomes are random, they are well-calibrated. Good forecasting requires both calibration (honesty about confidence) and resolution (ability to assign high confidence to things that actually happen and low confidence to things that do not).

The Brier score measures both simultaneously. A proper scoring rule — and the Brier score is one — has the mathematical property that it is maximized by reporting your true belief. Inflating or deflating your stated confidence to game the score will always reduce your expected score.

Superforecasters and what distinguishes them

Philip Tetlock's Good Judgment Project, which recruited tens of thousands of forecasters to predict world events, found that a small subset of forecasters — dubbed superforecasters — performed dramatically better than average, better than intelligence analysts with classified information, and roughly as well as prediction markets.

Superforecasters shared measurable characteristics: strong calibration (not systematically over- or under-confident), frequent updates as new information arrived, comfort with granular probabilities (saying 67% rather than "likely"), and a belief that predictions should be specific enough to be falsifiable.

Their Brier scores were consistently 0.10-0.15 on difficult geopolitical questions where the baseline was around 0.25. The gap is not small: it represents a genuinely meaningful ability to extract signal from complex events.

Asymmetric penalties for overconfidence

Because Brier score squares the error, overconfidence is more costly than underconfidence for extreme predictions.

Saying 95% confident on an event that does not occur: (0.95 - 0)² = 0.9025. Saying 50% on the same event: (0.50 - 0)² = 0.25. Saying 5% on the same event: (0.05 - 0)² = 0.0025.

The forecaster who said 95% on a non-event is penalized 361× more than the one who said 5%. Saying "I don't know" is quantifiably better than saying "I'm certain" and being wrong. This is why experienced forecasters are often reluctant to assign probabilities near 0% or 100% — the asymmetry of penalties at the extremes demands genuine certainty before claiming it.

One thing most people get wrong

Most people confuse confidence with accuracy. Someone who says "I'm 90% confident" on things that happen 60% of the time sounds informed — and may even have useful information — but their predictions are worse than someone who says "60% likely" on the same events. The overstater has positive edge but is wasting it through miscalibration.

The Brier score penalizes this correctly. Stated overconfidence is not a neutral error: it actively makes your predictions worse in expected value terms, because proper scoring rules are constructed precisely to reward honest confidence and punish inflation. Training yourself to say "65% likely" when you feel 65% likely — rather than rounding to "pretty sure" — is one of the highest-value calibration improvements a forecaster can make.