OptimismBench

TL;DR

Ask a language model for the probability a startup succeeds. Then, separately, ask for the probability it fails. A coherent judge returns two numbers that sum to 100. They almost never do, and the gap leans one way.

Across 16 models, 14 are optimistic and only Anthropic’s two frontier models are pessimistic. The tilt is not noise: a controlled base-versus-chat probe shows alignment training installs it, and across languages model identity dominates over language.

The problem: direction hides from calibration

The usual lens for “does this model know its probabilities” is calibration, and it is blind here.

ECE

aggregates unsigned error (Guo et al., 2017), so a model with ECE = 5 could be uniformly optimistic, uniformly pessimistic, or neither. The sign, the very thing that tells you whether a forecasting agent will over- or under-promise, is averaged away.

The second problem is ground truth. In naturalistic forecasting (“will this person stick to their new exercise routine?”) there is no label to score against. So OptimismBench scores the model against itself: (Zhu & Griffiths, 2025) showed language models violate basic probability axioms, and we keep the sign of one such violation as the signal.

Inverted pairs

For each scenario we elicit two probabilities, the positive framing and its complement, and define the

Skew

as the gap between them. Drag the two estimates and watch the unaccounted-for points appear.

Skew calculator

"Probability of success?"

P⁺ = 70

"Probability of failure?"

P⁻ = 15

Figure 1. The inverted-pair method. Skew is the probability mass that goes missing when the two framings disagree.

The construction needs no ground truth and it cancels acquiescence: a model that simply agrees with both framings lands at zero. The paired-complement trick is borrowed from human psychophysics (Kahneman & Tversky, 1979); the contribution here is keeping the sign and reading it as directional valence.

Three real cases

The metric is abstract; the failure is not. Here are three scenarios from the benchmark exactly as posed, answered by an optimistic model (GPT-5.4) and a pessimistic one (Sonnet 4.6). The numbers are the models’ actual elicited probabilities. Switch domains and watch the same text pull the two models in opposite directions.

Case studies

Same scenario, opposite tilt

Two framings of one question. A coherent judge has them sum to 100; the leftover is Skew. GPT-5.4 overcounts the good side, Sonnet 4.6 the bad side, on identical text.

Real Track B elicitations: GPT-5.4 versus Sonnet 4.6 on three identical scenarios. Same question, opposite tilt.

On the home-exercise scenario the two framings should sum to 100. GPT-5.4 answers 58 and 62, so it leaves twenty points double-counted toward sticking with it. Sonnet answers 31 and 55, fourteen points double-counted toward quitting. Neither model is wrong about any single number you could check; the bias only appears when you ask the same question both ways and keep the sign.

Directional bias is pervasive

Every one of the 16 headline models shows nonzero Skew. The pattern spans US commercial APIs, Chinese labs, and European releases. Only Anthropic’s Opus and Sonnet land below zero. The full table, with each row tinted by the magnitude of its bias, is the headline result.

Skew table

Track B Skew across 16 modelsrow tint ∝ |Skew|

Model	Provider	Size	Skew	σ	Dir.

All 16 significant at p < 0.002 (Bonferroni). Skew = P⁺ − (100 − P⁻); σ is per-item std.

Table 1. Track B Skew, per-item σ, and direction for all 16 models, rows tinted by |Skew|.

At 95% confidence, the two pessimistic rows do not overlap zero, and neither do the optimists.

Skew forest plot

Figure 2. Track B Skew with 95% CIs across 16 models. Toggle between ranking by Skew and grouping by provider.

The smallest, cheapest models tend to be the most optimistic, and they are exactly the ones deployed at scale for cost. Human optimism bias is a well-studied trait (Sharot, 2011; Weinstein, 1980); these models inherit its shape.

Two ways to be biased

Two models with the same Skew can be biased through different mechanisms. Split it into two components, the good-side push and the bad-side push, each measuring how far one estimate sits above 50, and the mechanism shows.

Valence decomposition

Figure 3. Good-side push versus bad-side push for six models. Arrows trace each provider's small-to-large shift.

Optimistic models like GPT-5.4-mini inflate both sides. Sonnet’s pessimism is one-sided: it underestimates good outcomes while keeping bad-outcome estimates near accurate. The bias shows up as suppressed opportunity rather than inflated risk.

Alignment is a lever

The within-provider gradient (smaller is more optimistic) is suggestive but confounded. To isolate post-training, OptimismBench runs controlled base-versus-chat pairs: same architecture, same pre-training, only the alignment step changes.

Base vs chat

Base → Chat Skew shift

Each line is one architecture: where alignment moves its Skew from base checkpoint to chat. Qwen down, Llama up, Gemma flips.

Figure 4. Base versus chat Skew across architectures. Qwen all shift down, Llama all shift up, Gemma flips.

The shift is real and recipe-specific: every Qwen pair moves toward pessimism, every Llama pair toward optimism over a pessimistic base, and Gemma flips from strongly pessimistic to strongly optimistic. Alignment sets the direction; it is not a perturbation on a pre-existing fixed bias.

What this does not show

The four headline pairs span two families and one recipe each, so this is causal within those families, not a universal law. We treat the broader within-provider gradient as an empirical regularity to explain, not a mechanism we have pinned down.

Model, not language

Does the bias come from a language’s training corpus or from the model? Eleven models across six native-prompt languages answer cleanly.

Cross-lingual heatmap

Cross-lingual Skewrows vary, columns barely move

Inter-model variance is 3.3× inter-language variance. Hover a cell for the value.

Figure 5. Cross-lingual Skew across 11 models and 6 languages. Read by row and by column: rows vary, columns barely move.

Rows (models) differ wildly; columns (languages) barely move. The average spread between models within a language is 3.3 times the spread between languages within a model. Model identity dominates, and the language-level positivity gradient seen in human corpora does not transfer.

Widen the panel to every model with full six-language coverage, which adds three models beyond the 16-model headline set.

Collapsing each row to two numbers, its mean Skew and how much that Skew wobbles across the six languages, sorts the whole fleet into four corners. The useful reading is the bottom band: a model can be both biased and stable, the worst case for a deployer, because the tilt is large and cannot be averaged away by switching prompt language.

Bias-stability plane

Bias-stability planehow big, and how stable across languages

Mean Skew across six native-prompt languages (horizontal) against the spread across those languages (vertical). Right of the line is optimist, left is pessimist; below the dashed line is language-robust. Hover for values.

Figure 6. Mean Skew versus inter-language σ for 17 models. Right of the line is optimist; lower is more language-robust. The strongly pessimistic, language-stable corner belongs to Anthropic.

Almost everything lands on the right (optimist) and low (robust): the bias is large and it does not wash out across languages. The pessimist side is sparse. Anthropic’s Sonnet and Opus are the two strong pessimists and also the most language-stable in the whole set (σ = 0.54 and 0.89). The only other model left of zero is Qwen3-32B, barely past neutral. The volatile corner is mostly small or mid-tier models, so instability across languages is the exception.

Why it matters for forecasting

Language models are being wired into forecasting pipelines, approaching human crowd accuracy on prediction markets (Halawi et al., 2024). A pipeline built on a +13 Skew model will systematically overestimate positive outcomes by roughly 13 points, and no aggregate calibration score on that model will warn you. Worse, users overrely on confident model outputs (Rathi et al., 2025), so the tilt propagates. Skew is cheap to compute, needs no labels, and works as a model-card number: pick the model whose bias profile matches your risk tolerance.

Explore the data

The benchmark and the full response corpus are released on HuggingFace. Browse it directly below: pick any of the 60 scenarios to see how the models split it, or pick a model to see its bias profile across every scenario. Switch the prompt language to read the native-language runs (English carries all nine models; the other languages carry the six with full multilingual coverage), and toggle individual models on or off. The per-model means are computed from the released response corpus.

Dataset explorer

Explore the responses 🤗 seonglae/OptimismBench

Language 1 / 60

Mean of 10 runs per item. P(positive) and P(negative) are the two framings; a coherent judge sums to 100, the leftover is Skew. Per-model means are precomputed from the released 🤗 response corpus.

Mean P(positive)/P(negative) per Track B scenario per model, with directional Skew. Browse by scenario or by model, switch prompt language, and toggle models.

Limitations

The cross-lingual evidence covers ten languages plus a six-model confirmation; four languages use a mixed-prompt setup, so the language-versus-prompt attribution is approximate for those entries. Skew measures directional self-inconsistency, not deviation from real-world outcomes, so it captures internal coherence rather than calibration against reality. The 60 scenarios are author-constructed and passed one round of internal review; external inter-rater validation is planned for the expanded release.

Conclusion

Helpfulness and probability direction turn out to be two outputs of the same training signal, and the side-effect is invisible to every standard check. ECE will not see it, a leaderboard score will not see it, and the model will state a 70% with the same fluency whether or not the complement says 15%.

The uncomfortable part is the ranking. The safest-sounding models are not the most neutral: Anthropic’s frontier pair is the only one that leans pessimistic, the small and cheap models lean most optimistic, and the gap between two models on the same question can reach 30 points. Alignment did not remove the bias, it chose its direction. Before you trust a model’s odds, ask which way its training bent them; that is a one-number, label-free thing you can now measure.

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. https://arxiv.org/abs/1706.04599
Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024). Approaching Human-Level Forecasting with Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=FlcdW7NPRY
Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263–292.
Rathi, N., Jurafsky, D., & Zhou, K. (2025). Humans overrely on overconfident language models, across languages. Second Conference on Language Modeling. https://openreview.net/forum?id=QsQatTzATT
Sharot, T. (2011). The Optimism Bias. Current Biology, 21(23), R941–R945.
Weinstein, N. D. (1980). Unrealistic Optimism About Future Life Events. Journal of Personality and Social Psychology, 39(5), 806–820.
Zhu, J.-Q., & Griffiths, T. L. (2025). Incoherent Probability Judgments in Large Language Models. https://arxiv.org/abs/2401.16646