The Confidence Manifold

How Language Models Secretly Know When They're Wrong

Correct Incorrect 817 samples · PLS 3D

Authors

Published

Feb. 8, 2026

Paper

Table of Contents

When a language model asserts that “the capital of Australia is Sydney,” does it know this is wrong? The output gives no indication; the answer is delivered with the same confident fluency as a correct one. But internally, LLMs encode a rich correctness signal that is never expressed in their outputs. It lives in a surprisingly low-dimensional subspace of just 3 to 8 dimensions, consistent across architectures. We call this subspace the confidence manifold. By studying 11 models across 5 architecture families, from 124M-parameter GPT-2 to 14B-parameter Qwen2.5, we reveal the shape of this hidden knowledge.


The Problem

Confident Hallucinations

When we assess an LLM’s confidence, we typically look at output signals: token probabilities, distribution entropy, or simply asking “are you sure?”

None of these work well.

Models can be confidently wrong. On misconception questions, models assert false beliefs with high probability. The output distribution reveals uncertainty (does the model hedge?), not correctness (is the answer right?).

Uncertainty ≠ Correctness.

The model knows when it’s wrong. It just doesn’t tell you. Press Play below to watch a simulated generation. The output confidence stays high; but the internal probe drops on hallucinated sentences.

Output vs Internal Confidence
Output
-
Probe
-
Watch the model generate sentence by sentence. Output confidence (orange) reflects token probabilities. Internal probe confidence (gray) reveals what the model actually knows. Red highlights mark confident hallucinations.

Inside the Residual Stream

Linear Directions

Transformer residual streams encode abstract concepts as linear directions in

activation space
; king − man + woman ≈ queen. Previous work found such truth directions in activations: Burns et al. discovered them without supervision, Marks & Tegmark mapped their geometry, Li et al. showed they’re causally relevant.

These works focused on detecting the signal. We ask: what is its geometry?

Extraction

For each question-answer pair, we extract the hidden state at the last token position across all transformer layers. Using

Partial Least Squares
regression, we project the high-dimensional vector (768–4096 dimensions) onto a subspace that maximally separates correct from incorrect.

p(correcth)=σ(wh+b)p(\text{correct} \mid \mathbf{h}) = \sigma(\mathbf{w}^\top \mathbf{h} + b)

The probe is logistic regression with L2 regularization. No deep networks needed. Just a hyperplane.

But before probing, we need to reduce dimensionality. PLS finds directions that maximize covariance with the correctness label:

Wk=argmaxWcov(XW,y)cov(XW,y)\mathbf{W}_k = \arg\max_{\mathbf{W}} \text{cov}(\mathbf{X}\mathbf{W}, \mathbf{y})^\top \text{cov}(\mathbf{X}\mathbf{W}, \mathbf{y})

Why PLS over PCA? PCA maximizes variance; it finds where activations spread most, regardless of correctness. PLS maximizes covariance with the label; it finds where correct and incorrect activations differ most. In a 4096-dimensional space, the highest-variance directions may have nothing to do with correctness.

Why not Sparse Autoencoders?

Individual SAE features explain less than 6% of variance. Confidence is distributed across many features, not localized to interpretable atoms, motivating the probe-based approach.

Confound checks

The probe detects semantics, not surface features. A length-only probe achieves just 0.54 AUC (r = −0.016). Paraphrase control (5 variants per answer) yields F-ratio 17.40: correctness drives separation 17× more than style.


3–8 Dimensions

This is our central finding.

Across all 11 models, the discriminative signal saturates at 3–8 PLS dimensions, yet full-dimensional probes perform significantly worse, confirming the signal lives in a compact subspace.

Dimension Sweep
Click models to toggle. Hover to highlight. AUC saturates at 3–8 PLS dimensions.

Why does performance decrease with more dimensions? The extras capture noise, not signal. PLS finds the maximally-separating subspace; beyond 3–8 components, what remains hurts generalization.

A 7B model with 4096-dimensional hidden states encodes correctness in just 5 dimensions, 0.12% of its representational capacity.

ModelParamsHidden DimOptimal kAUC32D
GPT-2124M76830.7580.621
GPT-2-Med355M1,02430.7900.676
GPT-2-Large774M1,28030.7910.673
LLaMA-1B1B2,04840.8950.813
Qwen2-1.5B1.5B1,53680.8730.810
Gemma-2B2B2,30430.8610.751
LLaMA-3.2-3B3B3,07250.9190.881
Qwen2-7B7B3,58450.8960.843
Mistral-7B7B4,09650.9020.849
LLaMA-3.1-8B8B4,09620.9310.890
Qwen2.5-14B14B5,12020.9510.939

The Output Gap

How do internal methods compare to standard output-based approaches? We tested 6 geometric classifiers in the 8D PLS subspace alongside 4 output-based baselines.

The Output Gap
Detection performance (AUC) of output-based methods and 6 geometric classifiers in 8D PLS space. Select a model to see individual results.

The gap is dramatic. Internal probes outperform the best output method by 23–37 percentage points (38–67% relative improvement). Even the training-free centroid classifier, using just cluster means in 8D PLS space, far exceeds any output signal. No nonlinear method improves over linear. The signal is a mean shift, not a covariance difference.

The centroid classifier

The simplest classifier that works: compute the mean of each class in PLS space, then classify by nearest centroid.

y^=argminc{0,1}zμc2\hat{y} = \arg\min_{c \in \{0,1\}} \| \mathbf{z} - \boldsymbol{\mu}_c \|_2

No training loop. No hyperparameters. Just two mean vectors and a distance comparison. It matches the full logistic probe within 1–2% AUC.

Production scaling

This simplicity scales to production. Anthropic’s Constitutional Classifier++ uses linear probes on internal activations to reduce jailbreak success from 86% to 4.4% with just 1% compute overhead.

Unsupervised baselines

Unsupervised features (L2 norm, reconstruction error, LOF) reach at most 0.62 AUC. Some labels are necessary. But very few: centroid matches probe with just two mean vectors.

Manifold Curvature

Linear separability doesn’t mean the manifold is flat. A hyperplane can separate two clusters even if each cluster curves through activation space like a crumpled sheet of paper.

Across all models, geodesic/Euclidean ratios land at 1.39-1.50 in the PLS subspace. For comparison, PCA projections of the same activations give 1.24-1.28. The PLS directions that maximize correctness separation also reveal more curvature than the directions that maximize variance. The manifold isn’t just low-dimensional. It’s bent.

Manifold Curvature
Geodesic/Euclidean distance ratios across models. Ratios above 1.0 indicate non-flat geometry. PLS subspace consistently shows higher curvature than PCA baseline.

Why does this matter? If the surface were flat, PCA would capture it. The curvature explains why supervised PLS outperforms unsupervised dimensionality reduction. You need directions that follow the bend of the correctness signal, not just the spread of activations.


Layer Evolution

Not all layers encode correctness equally. We find a three-phase pattern across all architectures.

Layer-wise Signal
AUC and intrinsic dimension per layer for all 11 models. Click models to toggle. Switch between AUC and Intrinsic Dimension.

Phase I (0–30%): Token processing. Probe accuracy near chance. The model hasn’t yet integrated semantic meaning.

Phase II (30–70%): Semantic integration. The signal emerges through gradual rotation of the activation geometry.

Intrinsic dimension
compresses from 20–55D to 8–12D.

Phase III (70–100%): Stable encoding. Cross-layer similarity reaches 0.81. The signal has crystallized into its final form.

PhaseDepthProcessDimAUC
I0–30%Token features20–55D~0.50
II30–70%Semantic integration→ 8–12D0.50 → 0.85
III70–100%Stable encoding8–12D0.85–0.97

Layer selection is 11× more impactful than dimension selection. Tuning the layer from 0% to optimal yields 77% AUC improvement; tuning dimensions from 1D to 5D yields only 7%.

Optimal depth varies: base models peak at ~100%, instruction-tuned at 43–75%. Instruction tuning pushes correctness encoding earlier.


Steering

Finding correlation doesn’t prove causation. We validate through

activation steering
:

h=h+αw^\mathbf{h}' = \mathbf{h} + \alpha \cdot \hat{\mathbf{w}}

Add a scaled version of the learned direction, then let the model generate.

Activation Steering
Uncertainty Confidence
α = 0.0  ·  Error rate: 0.539
Drag the slider to steer between uncertainty and confidence. Real data from Qwen2-7B-Instruct on TruthfulQA.

Learned direction: 10.9 percentage point swing (α=−5 to α=+5). Random and orthogonal controls: ~1.7pp. For comparison, Li et al.’s Inference-Time Intervention on TruthfulQA achieved ~12pp using 48 attention heads; we match this with a single direction.

The direction doesn’t just correlate with correctness. It controls it.

Practical implication

This means we can tune model confidence at inference time without retraining. Add the direction to reduce hallucinations; subtract it to probe model uncertainty. The cost is a single vector addition per layer; negligible compared to the forward pass.

Concept Erasure

Steering shows the direction is causally sufficient; adding it changes behavior. But is it necessary? We test this by erasing it.

We remove the top PLS direction from the hidden states and retrain the probe from scratch. The result: AUC drops to near chance across all models. Erasing a single direction in a 768-4096 dimensional space; under 0.2% of the representational capacity; is enough to eliminate the correctness signal.

Concept Erasure
Probe AUC after erasing the top PLS direction. A single direction accounts for nearly all detectable signal.

This independently confirms what the dimension sweep suggests: the signal doesn’t just peak at low dimensions; it lives there. The PLS subspace isn’t a convenient summary. It’s where the information is.


Across Architectures

The structure is consistent across 5 families, 11 models, and 113x parameter range. The two-cluster structure (correct vs. incorrect) appears universally. Switch between models in the 3D manifold at the top of this article.

Model Scaling
Best AUC vs parameters (log scale). Toggle between scaling curve and optimal depth. Base models peak at network end; instruct models peak earlier.

Two patterns emerge. First, larger models encode stronger signals, from 0.80 AUC at 124M to 0.97 at 3B. But the relationship saturates: 7B models don’t outperform 3B. The signal is extractable even from GPT-2.

Second, instruction tuning shifts the optimal layer earlier. Base models (GPT-2 family) peak at final layers (100% depth). The signal crystallizes at the very end. Instruction-tuned models peak at 43–75%. RLHF appears to push correctness encoding into middle layers, perhaps because the model learns to “commit” to an answer earlier.

Cross-model transfer retains 54–92% of performance.

Full 11-model results
ModelFamilyParamsOptimal LayerDepthOptimal kAUC
GPT-2GPT124ML11/12100%30.80
GPT-2 MediumGPT355ML23/24100%30.84
GPT-2 LargeGPT774ML35/36100%30.84
Llama-3.2-1BLlama1BL8/1656%40.93
Qwen2-1.5BQwen1.5BL16/2861%80.91
Gemma-2-2BGemma2BL15/2662%30.93
Llama-3.2-3BLlama3BL12/2843%50.97
Qwen2-7BQwen7BL20/2875%50.94
Mistral-7BMistral7BL23/3275%50.92
Llama-3.1-8BLlama8BL16/3250%20.92
Qwen2.5-14BQwen14BL32/4867%20.95

GPT-2 family peaks at the final layer. Larger models peak at 43-75% depth. All models saturate at 2-8 PLS dimensions despite hidden dimensions ranging from 768 to 5,120.

Few-shot Efficiency

The centroid classifier requires no training: just compute means of correct vs. incorrect activations. Even with limited labels, detection works:

Few-shot Efficiency
Probe AUC vs. number of labeled examples (GPT-2, 3 seeds). Just 25 examples achieve 0.69 AUC.

With just 25 labeled examples, the centroid achieves 0.69 AUC, 89% of full-data performance.


What This Means

LLMs have rich internal correctness representations:

The practical upshot: single-pass hallucination detection that requires no extra generation, no sampling, and no external model; just a matrix multiply and a distance comparison on activations you already have.

Detection pipeline in practice
  1. Extract hidden state h\mathbf{h} at the last token position, optimal layer
  2. Project to the PLS subspace: z=Wkh\mathbf{z} = \mathbf{W}_k^\top \mathbf{h} (k = 5 components)
  3. Classify by nearest centroid: compare zμcorrect\|\mathbf{z} - \boldsymbol{\mu}_\text{correct}\| vs zμincorrect\|\mathbf{z} - \boldsymbol{\mu}_\text{incorrect}\|
  4. Flag if classified as incorrect

Compute cost: under 1ms per token on CPU. The projection is a single matrix multiply (4096 × 5), and classification is two distance computations. No GPU required for inference; only for the initial extraction of the PLS directions.

Why PLS Matters

A natural baseline for finding a “correctness direction” is the

difference-of-means
(DoM) vector: d^=μcorrectμincorrect\hat{\mathbf{d}} = \boldsymbol{\mu}_\text{correct} - \boldsymbol{\mu}_\text{incorrect}. It works in low-noise settings. But in full activation space, it fails.

On GPT-2, DoM achieves 0.47 AUC; below chance. The full 768-dimensional difference vector points in a direction dominated by noise. PCA preprocessing often makes things worse (4 of 5 models), because PCA preserves high-variance noise dimensions that have nothing to do with correctness.

Only supervised PLS reduction rescues the signal. By first projecting to the 5D subspace that maximizes covariance with correctness labels, centroid distance jumps to 0.77 AUC. The signal was always there; it was just drowned out in 768 irrelevant dimensions.

PLS vs Difference-of-Means
DoM performance in full space, PCA-reduced space, and PLS-reduced space across 5 models. PLS is the only reduction that makes detection viable.

Limitations

Generation gap. Teacher-forced probing outperforms generated text probing.

Generation Gap
Teacher-forced vs. model-generated probe AUC across models. The gap is significant but the signal persists.

Domain specificity. Cross-domain transfer drops to near chance (0.47–0.51 AUC) with full-dimensional probes. This is our most important negative result: the probe does not find a universal “truth detector.” It finds a task-specific correctness pattern that must be recalibrated per domain.

Cross-dataset Transfer
Transfer matrix (GPT-2) and per-model in-domain vs cross-domain AUC. Switch between heatmap and bar views.

Probes trained on TruthfulQA transfer poorly to SciQ, CommonsenseQA, HaluEval, and FEVER. The diagonal dominates: each dataset’s signal is largely self-contained. This suggests models encode domain-specific correctness patterns, not a universal “truth” feature. However, PLS projection helps: on Qwen2-7B, projecting to 5D improves cross-domain transfer by 10–14% absolute AUC (0.61–0.79 vs 0.47–0.69 full-dim), by discarding dataset-specific variance.

The output gap. The signal exists internally but isn’t expressed. Why remains the most important open question.

Models encode correctness with geometric precision internally, yet fail to express it. Understanding this gap; and closing it; may be the most direct path to self-correcting language models.