TL;DR
Ask a language model for the probability a startup succeeds. Then, separately, ask for the probability it fails. A coherent judge returns two numbers that sum to 100. They almost never do, and the gap leans one way.
Across 16 models, 14 are optimistic and only Anthropic’s two frontier models are pessimistic. The tilt is not noise: a controlled base-versus-chat probe shows alignment training installs it, and across languages model identity dominates over language.
The problem: direction hides from calibration
The usual lens for “does this model know its probabilities” is calibration, and it is blind here.
The second problem is ground truth. In naturalistic forecasting (“will this person stick to their new exercise routine?”) there is no label to score against. So OptimismBench scores the model against itself: (Zhu & Griffiths, 2025) showed language models violate basic probability axioms, and we keep the sign of one such violation as the signal.
Inverted pairs
For each scenario we elicit two probabilities, the positive framing and its complement, and define the
The construction needs no ground truth and it cancels acquiescence: a model that simply agrees with both framings lands at zero. The paired-complement trick is borrowed from human psychophysics (Kahneman & Tversky, 1979); the contribution here is keeping the sign and reading it as directional valence.
Three real cases
The metric is abstract; the failure is not. Here are three scenarios from the benchmark exactly as posed, answered by an optimistic model (GPT-5.4) and a pessimistic one (Sonnet 4.6). The numbers are the models’ actual elicited probabilities. Switch domains and watch the same text pull the two models in opposite directions.
On the home-exercise scenario the two framings should sum to 100. GPT-5.4 answers 58 and 62, so it leaves twenty points double-counted toward sticking with it. Sonnet answers 31 and 55, fourteen points double-counted toward quitting. Neither model is wrong about any single number you could check; the bias only appears when you ask the same question both ways and keep the sign.
Directional bias is pervasive
Every one of the 16 headline models shows nonzero Skew. The pattern spans US commercial APIs, Chinese labs, and European releases. Only Anthropic’s Opus and Sonnet land below zero. The full table, with each row tinted by the magnitude of its bias, is the headline result.
At 95% confidence, the two pessimistic rows do not overlap zero, and neither do the optimists.
The smallest, cheapest models tend to be the most optimistic, and they are exactly the ones deployed at scale for cost. Human optimism bias is a well-studied trait (Sharot, 2011; Weinstein, 1980); these models inherit its shape.
Two ways to be biased
Two models with the same Skew can be biased through different mechanisms. Split it into two components, the good-side push and the bad-side push, each measuring how far one estimate sits above 50, and the mechanism shows.
Optimistic models like GPT-5.4-mini inflate both sides. Sonnet’s pessimism is one-sided: it underestimates good outcomes while keeping bad-outcome estimates near accurate. The bias shows up as suppressed opportunity rather than inflated risk.
Alignment is a lever
The within-provider gradient (smaller is more optimistic) is suggestive but confounded. To isolate post-training, OptimismBench runs controlled base-versus-chat pairs: same architecture, same pre-training, only the alignment step changes.
The shift is real and recipe-specific: every Qwen pair moves toward pessimism, every Llama pair toward optimism over a pessimistic base, and Gemma flips from strongly pessimistic to strongly optimistic. Alignment sets the direction; it is not a perturbation on a pre-existing fixed bias.
What this does not show
The four headline pairs span two families and one recipe each, so this is causal within those families, not a universal law. We treat the broader within-provider gradient as an empirical regularity to explain, not a mechanism we have pinned down.
Model, not language
Does the bias come from a language’s training corpus or from the model? Eleven models across six native-prompt languages answer cleanly.
Rows (models) differ wildly; columns (languages) barely move. The average spread between models within a language is 3.3 times the spread between languages within a model. Model identity dominates, and the language-level positivity gradient seen in human corpora does not transfer.
Widen the panel to every model with full six-language coverage, which adds three models beyond the 16-model headline set.
Almost everything lands on the right (optimist) and low (robust): the bias is large and it does not wash out across languages. The pessimist side is sparse. Anthropic’s Sonnet and Opus are the two strong pessimists and also the most language-stable in the whole set (σ = 0.54 and 0.89). The only other model left of zero is Qwen3-32B, barely past neutral. The volatile corner is mostly small or mid-tier models, so instability across languages is the exception.
Why it matters for forecasting
Language models are being wired into forecasting pipelines, approaching human crowd accuracy on prediction markets (Halawi et al., 2024). A pipeline built on a +13 Skew model will systematically overestimate positive outcomes by roughly 13 points, and no aggregate calibration score on that model will warn you. Worse, users overrely on confident model outputs (Rathi et al., 2025), so the tilt propagates. Skew is cheap to compute, needs no labels, and works as a model-card number: pick the model whose bias profile matches your risk tolerance.
Explore the data
The benchmark and the full response corpus are released on HuggingFace. Browse it directly below: pick any of the 60 scenarios to see how the models split it, or pick a model to see its bias profile across every scenario. Switch the prompt language to read the native-language runs (English carries all nine models; the other languages carry the six with full multilingual coverage), and toggle individual models on or off. The per-model means are computed from the released response corpus.
Limitations
The cross-lingual evidence covers ten languages plus a six-model confirmation; four languages use a mixed-prompt setup, so the language-versus-prompt attribution is approximate for those entries. Skew measures directional self-inconsistency, not deviation from real-world outcomes, so it captures internal coherence rather than calibration against reality. The 60 scenarios are author-constructed and passed one round of internal review; external inter-rater validation is planned for the expanded release.
Conclusion
Helpfulness and probability direction turn out to be two outputs of the same training signal, and the side-effect is invisible to every standard check. ECE will not see it, a leaderboard score will not see it, and the model will state a 70% with the same fluency whether or not the complement says 15%.
The uncomfortable part is the ranking. The safest-sounding models are not the most neutral: Anthropic’s frontier pair is the only one that leans pessimistic, the small and cheap models lean most optimistic, and the gap between two models on the same question can reach 30 points. Alignment did not remove the bias, it chose its direction. Before you trust a model’s odds, ask which way its training bent them; that is a one-number, label-free thing you can now measure.
- Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. https://arxiv.org/abs/1706.04599
- Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024). Approaching Human-Level Forecasting with Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=FlcdW7NPRY
- Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263–292.
- Rathi, N., Jurafsky, D., & Zhou, K. (2025). Humans overrely on overconfident language models, across languages. Second Conference on Language Modeling. https://openreview.net/forum?id=QsQatTzATT
- Sharot, T. (2011). The Optimism Bias. Current Biology, 21(23), R941–R945.
- Weinstein, N. D. (1980). Unrealistic Optimism About Future Life Events. Journal of Personality and Social Psychology, 39(5), 806–820.
- Zhu, J.-Q., & Griffiths, T. L. (2025). Incoherent Probability Judgments in Large Language Models. https://arxiv.org/abs/2401.16646