When a language model asserts that “the capital of Australia is Sydney,” does it know this is wrong? The output gives no indication; the answer is delivered with the same confident fluency as a correct one. But internally, LLMs encode a rich correctness signal that is never expressed in their outputs. It lives in a surprisingly low-dimensional subspace of just 3 to 8 dimensions, consistent across architectures. We call this subspace the confidence manifold. By studying 11 models across 5 architecture families, from 124M-parameter GPT-2 to 14B-parameter Qwen2.5, we reveal the shape of this hidden knowledge.
The Problem
Confident Hallucinations
When we assess an LLM’s confidence, we typically look at output signals: token probabilities, distribution entropy, or simply asking “are you sure?”
None of these work well.
Models can be confidently wrong. On misconception questions, models assert false beliefs with high probability. The output distribution reveals uncertainty (does the model hedge?), not correctness (is the answer right?).
Uncertainty ≠ Correctness.
The model knows when it’s wrong. It just doesn’t tell you. Press Play below to watch a simulated generation. The output confidence stays high; but the internal probe drops on hallucinated sentences.
Inside the Residual Stream
Linear Directions
Transformer residual streams encode abstract concepts as linear directions in
These works focused on detecting the signal. We ask: what is its geometry?
Extraction
For each question-answer pair, we extract the hidden state at the last token position across all transformer layers. Using
The probe is logistic regression with L2 regularization. No deep networks needed. Just a hyperplane.
But before probing, we need to reduce dimensionality. PLS finds directions that maximize covariance with the correctness label:
Why PLS over PCA? PCA maximizes variance; it finds where activations spread most, regardless of correctness. PLS maximizes covariance with the label; it finds where correct and incorrect activations differ most. In a 4096-dimensional space, the highest-variance directions may have nothing to do with correctness.
Why not Sparse Autoencoders?
Individual SAE features explain less than 6% of variance. Confidence is distributed across many features, not localized to interpretable atoms, motivating the probe-based approach.
Confound checks
The probe detects semantics, not surface features. A length-only probe achieves just 0.54 AUC (r = −0.016). Paraphrase control (5 variants per answer) yields F-ratio 17.40: correctness drives separation 17× more than style.
3–8 Dimensions
This is our central finding.
Across all 11 models, the discriminative signal saturates at 3–8 PLS dimensions, yet full-dimensional probes perform significantly worse, confirming the signal lives in a compact subspace.
Why does performance decrease with more dimensions? The extras capture noise, not signal. PLS finds the maximally-separating subspace; beyond 3–8 components, what remains hurts generalization.
A 7B model with 4096-dimensional hidden states encodes correctness in just 5 dimensions, 0.12% of its representational capacity.
| Model | Params | Hidden Dim | Optimal k | AUC | 32D |
|---|---|---|---|---|---|
| GPT-2 | 124M | 768 | 3 | 0.758 | 0.621 |
| GPT-2-Med | 355M | 1,024 | 3 | 0.790 | 0.676 |
| GPT-2-Large | 774M | 1,280 | 3 | 0.791 | 0.673 |
| LLaMA-1B | 1B | 2,048 | 4 | 0.895 | 0.813 |
| Qwen2-1.5B | 1.5B | 1,536 | 8 | 0.873 | 0.810 |
| Gemma-2B | 2B | 2,304 | 3 | 0.861 | 0.751 |
| LLaMA-3.2-3B | 3B | 3,072 | 5 | 0.919 | 0.881 |
| Qwen2-7B | 7B | 3,584 | 5 | 0.896 | 0.843 |
| Mistral-7B | 7B | 4,096 | 5 | 0.902 | 0.849 |
| LLaMA-3.1-8B | 8B | 4,096 | 2 | 0.931 | 0.890 |
| Qwen2.5-14B | 14B | 5,120 | 2 | 0.951 | 0.939 |
The Output Gap
How do internal methods compare to standard output-based approaches? We tested 6 geometric classifiers in the 8D PLS subspace alongside 4 output-based baselines.
The gap is dramatic. Internal probes outperform the best output method by 23–37 percentage points (38–67% relative improvement). Even the training-free centroid classifier, using just cluster means in 8D PLS space, far exceeds any output signal. No nonlinear method improves over linear. The signal is a mean shift, not a covariance difference.
The simplest classifier that works: compute the mean of each class in PLS space, then classify by nearest centroid.
No training loop. No hyperparameters. Just two mean vectors and a distance comparison. It matches the full logistic probe within 1–2% AUC.
Production scaling
This simplicity scales to production. Anthropic’s Constitutional Classifier++ uses linear probes on internal activations to reduce jailbreak success from 86% to 4.4% with just 1% compute overhead.
Unsupervised baselines
Unsupervised features (L2 norm, reconstruction error, LOF) reach at most 0.62 AUC. Some labels are necessary. But very few: centroid matches probe with just two mean vectors.
Manifold Curvature
Linear separability doesn’t mean the manifold is flat. A hyperplane can separate two clusters even if each cluster curves through activation space like a crumpled sheet of paper.
Across all models, geodesic/Euclidean ratios land at 1.39-1.50 in the PLS subspace. For comparison, PCA projections of the same activations give 1.24-1.28. The PLS directions that maximize correctness separation also reveal more curvature than the directions that maximize variance. The manifold isn’t just low-dimensional. It’s bent.
Why does this matter? If the surface were flat, PCA would capture it. The curvature explains why supervised PLS outperforms unsupervised dimensionality reduction. You need directions that follow the bend of the correctness signal, not just the spread of activations.
Layer Evolution
Not all layers encode correctness equally. We find a three-phase pattern across all architectures.
Phase I (0–30%): Token processing. Probe accuracy near chance. The model hasn’t yet integrated semantic meaning.
Phase II (30–70%): Semantic integration. The signal emerges through gradual rotation of the activation geometry.
Phase III (70–100%): Stable encoding. Cross-layer similarity reaches 0.81. The signal has crystallized into its final form.
| Phase | Depth | Process | Dim | AUC |
|---|---|---|---|---|
| I | 0–30% | Token features | 20–55D | ~0.50 |
| II | 30–70% | Semantic integration | → 8–12D | 0.50 → 0.85 |
| III | 70–100% | Stable encoding | 8–12D | 0.85–0.97 |
Layer selection is 11× more impactful than dimension selection. Tuning the layer from 0% to optimal yields 77% AUC improvement; tuning dimensions from 1D to 5D yields only 7%.
Optimal depth varies: base models peak at ~100%, instruction-tuned at 43–75%. Instruction tuning pushes correctness encoding earlier.
Steering
Finding correlation doesn’t prove causation. We validate through
Add a scaled version of the learned direction, then let the model generate.
Learned direction: 10.9 percentage point swing (α=−5 to α=+5). Random and orthogonal controls: ~1.7pp. For comparison, Li et al.’s Inference-Time Intervention on TruthfulQA achieved ~12pp using 48 attention heads; we match this with a single direction.
The direction doesn’t just correlate with correctness. It controls it.
This means we can tune model confidence at inference time without retraining. Add the direction to reduce hallucinations; subtract it to probe model uncertainty. The cost is a single vector addition per layer; negligible compared to the forward pass.
Concept Erasure
Steering shows the direction is causally sufficient; adding it changes behavior. But is it necessary? We test this by erasing it.
We remove the top PLS direction from the hidden states and retrain the probe from scratch. The result: AUC drops to near chance across all models. Erasing a single direction in a 768-4096 dimensional space; under 0.2% of the representational capacity; is enough to eliminate the correctness signal.
This independently confirms what the dimension sweep suggests: the signal doesn’t just peak at low dimensions; it lives there. The PLS subspace isn’t a convenient summary. It’s where the information is.
Across Architectures
The structure is consistent across 5 families, 11 models, and 113x parameter range. The two-cluster structure (correct vs. incorrect) appears universally. Switch between models in the 3D manifold at the top of this article.
Two patterns emerge. First, larger models encode stronger signals, from 0.80 AUC at 124M to 0.97 at 3B. But the relationship saturates: 7B models don’t outperform 3B. The signal is extractable even from GPT-2.
Second, instruction tuning shifts the optimal layer earlier. Base models (GPT-2 family) peak at final layers (100% depth). The signal crystallizes at the very end. Instruction-tuned models peak at 43–75%. RLHF appears to push correctness encoding into middle layers, perhaps because the model learns to “commit” to an answer earlier.
Cross-model transfer retains 54–92% of performance.
Full 11-model results
| Model | Family | Params | Optimal Layer | Depth | Optimal k | AUC |
|---|---|---|---|---|---|---|
| GPT-2 | GPT | 124M | L11/12 | 100% | 3 | 0.80 |
| GPT-2 Medium | GPT | 355M | L23/24 | 100% | 3 | 0.84 |
| GPT-2 Large | GPT | 774M | L35/36 | 100% | 3 | 0.84 |
| Llama-3.2-1B | Llama | 1B | L8/16 | 56% | 4 | 0.93 |
| Qwen2-1.5B | Qwen | 1.5B | L16/28 | 61% | 8 | 0.91 |
| Gemma-2-2B | Gemma | 2B | L15/26 | 62% | 3 | 0.93 |
| Llama-3.2-3B | Llama | 3B | L12/28 | 43% | 5 | 0.97 |
| Qwen2-7B | Qwen | 7B | L20/28 | 75% | 5 | 0.94 |
| Mistral-7B | Mistral | 7B | L23/32 | 75% | 5 | 0.92 |
| Llama-3.1-8B | Llama | 8B | L16/32 | 50% | 2 | 0.92 |
| Qwen2.5-14B | Qwen | 14B | L32/48 | 67% | 2 | 0.95 |
GPT-2 family peaks at the final layer. Larger models peak at 43-75% depth. All models saturate at 2-8 PLS dimensions despite hidden dimensions ranging from 768 to 5,120.
Few-shot Efficiency
The centroid classifier requires no training: just compute means of correct vs. incorrect activations. Even with limited labels, detection works:
With just 25 labeled examples, the centroid achieves 0.69 AUC, 89% of full-data performance.
What This Means
LLMs have rich internal correctness representations:
- Low-dimensional: 3–8D subspace
- Linearly separable: hyperplane suffices
- Causal: steering works
- Universal: 11 models, 5 families
- Few-shot: 25 examples enough
The practical upshot: single-pass hallucination detection that requires no extra generation, no sampling, and no external model; just a matrix multiply and a distance comparison on activations you already have.
Detection pipeline in practice
- Extract hidden state at the last token position, optimal layer
- Project to the PLS subspace: (k = 5 components)
- Classify by nearest centroid: compare vs
- Flag if classified as incorrect
Compute cost: under 1ms per token on CPU. The projection is a single matrix multiply (4096 × 5), and classification is two distance computations. No GPU required for inference; only for the initial extraction of the PLS directions.
Why PLS Matters
A natural baseline for finding a “correctness direction” is the
On GPT-2, DoM achieves 0.47 AUC; below chance. The full 768-dimensional difference vector points in a direction dominated by noise. PCA preprocessing often makes things worse (4 of 5 models), because PCA preserves high-variance noise dimensions that have nothing to do with correctness.
Only supervised PLS reduction rescues the signal. By first projecting to the 5D subspace that maximizes covariance with correctness labels, centroid distance jumps to 0.77 AUC. The signal was always there; it was just drowned out in 768 irrelevant dimensions.
Limitations
Generation gap. Teacher-forced probing outperforms generated text probing.
Domain specificity. Cross-domain transfer drops to near chance (0.47–0.51 AUC) with full-dimensional probes. This is our most important negative result: the probe does not find a universal “truth detector.” It finds a task-specific correctness pattern that must be recalibrated per domain.
Probes trained on TruthfulQA transfer poorly to SciQ, CommonsenseQA, HaluEval, and FEVER. The diagonal dominates: each dataset’s signal is largely self-contained. This suggests models encode domain-specific correctness patterns, not a universal “truth” feature. However, PLS projection helps: on Qwen2-7B, projecting to 5D improves cross-domain transfer by 10–14% absolute AUC (0.61–0.79 vs 0.47–0.69 full-dim), by discarding dataset-specific variance.
The output gap. The signal exists internally but isn’t expressed. Why remains the most important open question.
Models encode correctness with geometric precision internally, yet fail to express it. Understanding this gap; and closing it; may be the most direct path to self-correcting language models.