When a language model asserts that “the capital of Australia is Sydney,” does it know this is wrong? The output gives no indication — the answer is delivered with the same confident fluency as a correct one. But internally, LLMs encode a rich correctness signal that is never expressed in their outputs. It lives in a surprisingly low-dimensional subspace of just 3 to 8 dimensions, consistent across architectures. We call this subspace the confidence manifold. By studying 9 models across 5 architecture families, from 124M-parameter GPT-2 to 7B-parameter Mistral, we reveal the shape of this hidden knowledge.
The Problem
Confident Hallucinations
When we assess an LLM’s confidence, we typically look at output signals: token probabilities, distribution entropy, or simply asking “are you sure?”
None of these work well.
Models can be confidently wrong. On misconception questions, models assert false beliefs with high probability. The output distribution reveals uncertainty (does the model hedge?), not correctness (is the answer right?).
Uncertainty ≠ Correctness.
The model knows when it’s wrong. It just doesn’t tell you. Press Play below to watch a simulated generation. The output confidence stays high — but the internal probe drops on hallucinated sentences.
Inside the Residual Stream
Linear Directions
Transformer residual streams encode abstract concepts as linear directions in
These works focused on detecting the signal. We ask: what is its geometry?
Extraction
For each question-answer pair, we extract the hidden state at the last token position across all transformer layers. Using
The probe is logistic regression with L2 regularization. No deep networks needed. Just a hyperplane.
But before probing, we need to reduce dimensionality. PLS finds directions that maximize covariance with the correctness label:
Why PLS over PCA? PCA maximizes variance — it finds where activations spread most, regardless of correctness. PLS maximizes covariance with the label — it finds where correct and incorrect activations differ most. In a 4096-dimensional space, the highest-variance directions may have nothing to do with correctness.
Why not Sparse Autoencoders?
Individual SAE features explain less than 6% of variance. Confidence is distributed across many features, not localized to interpretable atoms, motivating the probe-based approach.
Confound checks
The probe detects semantics, not surface features. A length-only probe achieves just 0.54 AUC (r = −0.016). Paraphrase control (5 variants per answer) yields F-ratio 17.40: correctness drives separation 17× more than style.
3–8 Dimensions
This is our central finding.
Across all 9 models, the discriminative signal saturates at 3–8 PLS dimensions, yet full-dimensional probes perform significantly worse, confirming the signal lives in a compact subspace.
Why does performance decrease with more dimensions? The extras capture noise, not signal. PLS finds the maximally-separating subspace; beyond 3–8 components, what remains hurts generalization.
A 7B model with 4096-dimensional hidden states encodes correctness in just 5 dimensions, 0.12% of its representational capacity.
| Model | Params | Hidden Dim | Optimal k | AUC | 32D |
|---|---|---|---|---|---|
| GPT-2 | 124M | 768 | 3 | 0.758 | 0.621 |
| Qwen2-1.5B | 1.5B | 1,536 | 8 | 0.873 | 0.810 |
| LLaMA-3.2-3B | 3B | 3,072 | 5 | 0.919 | 0.881 |
| Mistral-7B | 7B | 4,096 | 5 | 0.902 | 0.849 |
The Output Gap
How do internal methods compare to standard output-based approaches? We tested 6 geometric classifiers in the 8D PLS subspace alongside 4 output-based baselines.
The gap is dramatic. Internal probes outperform the best output method by 23–37 percentage points (38–67% relative improvement). Even the training-free centroid classifier, using just cluster means in 8D PLS space, far exceeds any output signal. No nonlinear method improves over linear. The signal is a mean shift, not a covariance difference.
The simplest classifier that works: compute the mean of each class in PLS space, then classify by nearest centroid.
No training loop. No hyperparameters. Just two mean vectors and a distance comparison. It matches the full logistic probe within 1–2% AUC.
Production scaling
This simplicity scales to production. Anthropic’s Constitutional Classifier++ uses linear probes on internal activations to reduce jailbreak success from 86% to 4.4% with just 1% compute overhead.
Unsupervised baselines
Unsupervised features (L2 norm, reconstruction error, LOF) reach at most 0.62 AUC. Some labels are necessary. But very few: centroid matches probe with just two mean vectors.
Layer Evolution
Not all layers encode correctness equally. We find a three-phase pattern across all architectures.
Phase I (0–30%): Token processing. Probe accuracy near chance. The model hasn’t yet integrated semantic meaning.
Phase II (30–70%): Semantic integration. The signal emerges through gradual rotation of the activation geometry.
Phase III (70–100%): Stable encoding. Cross-layer similarity reaches 0.81. The signal has crystallized into its final form.
| Phase | Depth | What happens | Intrinsic Dim | AUC |
|---|---|---|---|---|
| I — Token processing | 0–30% | Tokenization, positional encoding | 20–55D | ~0.50 |
| II — Semantic integration | 30–70% | Meaning builds, geometry rotates | → 8–12D | 0.50 → 0.85 |
| III — Stable encoding | 70–100% | Signal crystallized, high cross-layer similarity | 8–12D | 0.85–0.97 |
Layer vs dimension impact
Layer selection is 11× more impactful than dimension selection. Tuning the layer from 0% to optimal yields 77% AUC improvement; tuning dimensions from 1D to 5D yields only 7%.
Optimal depth by model type
Optimal depth varies: base models peak at ~100%, instruction-tuned at 43–75%. Instruction tuning pushes correctness encoding earlier.
Steering
Finding correlation doesn’t prove causation. We validate through
Add a scaled version of the learned direction, then let the model generate.
Learned direction: 10.9 percentage point swing (α=−5 to α=+5). Random and orthogonal controls: ~1.7pp. For comparison, Li et al.’s Inference-Time Intervention on TruthfulQA achieved ~12pp using 48 attention heads; we match this with a single direction.
The direction doesn’t just correlate with correctness. It controls it.
This means we can tune model confidence at inference time without retraining. Add the direction to reduce hallucinations; subtract it to probe model uncertainty. The cost is a single vector addition per layer — negligible compared to the forward pass.
Across Architectures
The structure is consistent across 5 families and 56× parameter range. The two-cluster structure (correct vs. incorrect) appears universally. Switch between models in the 3D manifold at the top of this article.
Two patterns emerge. First, larger models encode stronger signals, from 0.80 AUC at 124M to 0.97 at 3B. But the relationship saturates: 7B models don’t outperform 3B. The signal is extractable even from GPT-2.
Second, instruction tuning shifts the optimal layer earlier. Base models (GPT-2 family) peak at final layers (100% depth). The signal crystallizes at the very end. Instruction-tuned models peak at 43–75%. RLHF appears to push correctness encoding into middle layers, perhaps because the model learns to “commit” to an answer earlier.
Cross-model transfer retains 54–92% of performance.
Full 9-model results
| Model | Family | Params | Optimal Layer | Depth | Optimal k | AUC |
|---|---|---|---|---|---|---|
| GPT-2 | GPT | 124M | L11/12 | 100% | 3 | 0.80 |
| GPT-2 Medium | GPT | 355M | L23/24 | 100% | 3 | 0.84 |
| GPT-2 Large | GPT | 774M | L35/36 | 100% | 3 | 0.84 |
| Llama-3.2-1B | Llama | 1B | L8/16 | 56% | 5 | 0.93 |
| Qwen2-1.5B | Qwen | 1.5B | L16/28 | 61% | 8 | 0.91 |
| Gemma-2-2B | Gemma | 2B | L15/26 | 62% | 5 | 0.93 |
| Llama-3.2-3B | Llama | 3B | L12/28 | 43% | 5 | 0.97 |
| Qwen2-7B | Qwen | 7B | L20/28 | 75% | 5 | 0.94 |
| Mistral-7B | Mistral | 7B | L23/32 | 75% | 5 | 0.92 |
Base models (GPT-2 family) consistently peak at the final layer. Instruction-tuned models peak at 43–75% depth — RLHF appears to push correctness encoding into middle layers. All models saturate at 3–8 PLS dimensions despite hidden dimensions ranging from 768 to 4,096.
Few-shot Efficiency
The centroid classifier requires no training: just compute means of correct vs. incorrect activations. Even with limited labels, detection works:
With just 25 labeled examples, the centroid achieves 0.69 AUC, 89% of full-data performance.
What This Means
LLMs have rich internal correctness representations:
- Low-dimensional: 3–8D subspace
- Linearly separable: hyperplane suffices
- Causal: steering works
- Universal: 9 models, 5 families
- Few-shot: 25 examples enough
The practical upshot: single-pass hallucination detection that requires no extra generation, no sampling, and no external model — just a matrix multiply and a distance comparison on activations you already have.
Detection pipeline in practice
- Extract hidden state at the last token position, optimal layer
- Project to the PLS subspace: (k = 5 components)
- Classify by nearest centroid: compare vs
- Flag if classified as incorrect
Compute cost: under 1ms per token on CPU. The projection is a single matrix multiply (4096 × 5), and classification is two distance computations. No GPU required for inference — only for the initial extraction of the PLS directions.
Limitations
Generation gap. Teacher-forced probing outperforms generated text probing.
Domain specificity. Cross-domain transfer drops to near chance (0.47–0.51 AUC) with full-dimensional probes. This is our most important negative result: the probe does not find a universal “truth detector.” It finds a task-specific correctness pattern that must be recalibrated per domain.
Probes trained on TruthfulQA transfer poorly to SciQ, CommonsenseQA, HaluEval, and FEVER. The diagonal dominates: each dataset’s signal is largely self-contained. This suggests models encode domain-specific correctness patterns, not a universal “truth” feature. However, PLS projection helps: on Qwen2-7B, projecting to 5D improves cross-domain transfer by 10–14% absolute AUC (0.61–0.79 vs 0.47–0.69 full-dim), by discarding dataset-specific variance.
The output gap. The signal exists internally but isn’t expressed. Why remains the most important open question.
Models encode correctness with geometric precision internally, yet fail to express it. Understanding this gap — and closing it — may be the most direct path to self-correcting language models.