TL;DR
CorrSteer is a fully automated method for steering LLMs using Sparse Autoencoder features. It selects features by correlating task outcomes with SAE activations during generation, then validates through intervention. No contrastive datasets, no backward passes, no activation storage needed.
Key results: +3.3% MMLU, +27.2% HarmBench, with half the side effects of fine-tuning.
Why Steering Matters
Large language models have behaviors we want to control: reduce bias, improve factual accuracy, prevent harmful outputs. The standard approach is fine-tuning, but it is a blunt instrument. It changes thousands of parameters simultaneously, often introducing unintended side effects that are difficult to diagnose or reverse.
Sparse Autoencoders (SAEs) offer an alternative. They decompose neural activations into interpretable features: individual directions in activation space that correspond to human-understandable concepts like “refusal to harmful requests” or “mathematical reasoning.”
But existing SAE steering methods have limitations: they require contrastive datasets, large activation stores, or backward passes. And critically, they select features from context tokens rather than generation tokens, missing the features that actually drive output behavior.
The CorrSteer Method
CorrSteer solves these problems with a simple two-stage approach: correlate, then intervene.
Stage 1: Correlation-Guided Feature Selection
Given a dataset of prompts, we generate responses and capture SAE activations at each layer. For each feature , we compute its Pearson correlation with task outcomes :
The key insight: we compute correlations on generation-time activations (the tokens the model produces), not context tokens (the prompt). This captures features that drive output behavior.
We use max-pooling across generated tokens to aggregate multi-token activations, capturing peak feature engagement.
Stage 2: Coefficient Estimation
For each selected feature, the steering coefficient is the mean activation across samples with positive outcomes:
This anchors the coefficient to the feature’s natural scale during successful generation, exploiting SAE’s non-negative activations.
Stage 3: Inference-Time Steering
At generation time, we add the steering vector to the residual stream:
Applied only to generation positions ( where is prompt length). The SAE itself is not needed at inference; only the pre-computed steering vectors.
Three Variants
CorrSteer comes in three variants that trade off simplicity against performance:
- CorrSteer-S: The single most positively correlated feature across all layers. Simplest possible steering.
- CorrSteer-A: The top feature from each layer. Multi-layer steering captures distributed representations.
- CorrSteer-P: CorrSteer-A with validation-based pruning. Each feature is tested individually; only features that improve performance when amplified are retained. This identifies a minimal “steering subcircuit.”
Feature Activation Frequencies
How often do the selected features actually fire? The frequency distribution across layers reveals where the model’s most active steering features live, and how they differ between tasks.
High-frequency features fire on many inputs and represent broad patterns (e.g., formatting, punctuation), while low-frequency features are highly specific (e.g., particular mathematical operations, refusal phrases). The distribution shifts across layers: early layers tend to have high-frequency syntactic features, while deeper layers surface task-specific semantic features with lower but more discriminative activation rates.
Exploring the Feature Space
What do the selected features look like? Each SAE feature has a human-readable description from Neuronpedia, allowing us to inspect what the model is doing differently when steered.
Each point is an SAE feature. X = layer, Y = correlation with task success, Z = steering coefficient. Color encodes task. Rotate to explore; hover for feature descriptions.
Features cluster into interpretable categories:
- Structured-output features for multiple-choice tasks (MMLU, BBQ) encoding format patterns
- Refusal features for safety tasks (HarmBench) encoding rejection behavior
- Neutrality features for bias mitigation (BBQ) encoding balanced perspective
- Mathematical features appearing across tasks, consistent with the finding that math pre-training boosts broad accuracy
Results
We evaluate on 8 benchmarks across 5 categories: knowledge (MMLU, MMLU-Pro), reasoning (GSM8K), bias (BBQ), factuality (SimpleQA), and safety (HarmBench, XSTest).
CorrSteer-A matches fine-tuning accuracy on MMLU (55.48% vs 55.75%) while halving the Side Effect Ratio (0.21 vs 0.41). On HarmBench, CorrSteer achieves +27.2% improvement in harmful request refusal.
Side Effect Trade-offs
We introduce the Side Effect Ratio (SER): the fraction of changed answers that become incorrect. Lower SER means the method changes answers more precisely, converting wrong answers to right without breaking correct ones.
On MMLU, CorrSteer-A changes 879 answers compared to fine-tuning’s 2,724, yet achieves comparable accuracy. CorrSteer-A outperforms CorrSteer-S on 5 of 8 tasks, indicating that multi-layer feature combinations produce gains beyond single-feature steering. Positive-only SAE methods (CorrSteer, SPARE, DSG) consistently show lower SER than fine-tuning, because sparse features provide targeted rather than global modifications.
Feature Interpretability
The layer-by-layer distribution of top features reveals how different tasks engage different parts of the network:
Safety Analysis
CorrSteer’s safety improvements come from discrimination, not indiscriminate refusal. The XSTest benchmark separates safe prompts (e.g., asking about historical events) from unsafe contrast prompts (similar wording but genuinely harmful intent).
Safe prompts like historical events, definitions, and figurative language show 0% over-refusal. The steered model correctly identifies these as benign. Meanwhile, contrast categories with genuinely harmful intent show refusal rates of 22-73%.
Cross-Task Transfer
Can features selected for one task improve performance on others?
Three patterns emerge:
- MMLU features transfer broadly to MMLU-Pro (+5.67%) and BBQ Ambig (+4.91%), sharing multiple-choice format
- BBQ features transfer to MMLU, indicating bias-mitigation features encode general QA capabilities
- Task-specific features outperform transferred features in most cases, confirming the value of targeted selection
Ablation Studies
Pooling Strategy Comparison
Max-pooling captures peak feature activation across generated tokens and succeeds across all tasks. Mean-pooling catastrophically fails on multi-token generation tasks (HarmBench: 0%, XSTest: 53.65%) because averaging dilutes sparse signals.
Positive vs. Negative Features
Steering with negatively correlated features provides no improvement (Neg-S) or severe degradation (Neg-A: MMLU-Pro drops to 0.66%). SAE activations are non-negative; steering should amplify success, not suppress failure. The ablation chart above (tab 2) shows this clearly.
Control Experiments
Label permutation: Randomly shuffled correctness labels yield 6.24% MMLU accuracy (near chance), confirming features capture genuine task structure, not artifacts.
Random features: Random selection yields 6.29% MMLU accuracy, comparable to chance, confirming that correlation-based selection is essential, not just the steering mechanism itself.
Efficiency and Scalability
CorrSteer is lightweight and practical:
- Streaming correlation in memory per feature, scaling to + features
- ~100 samples minimum, stable at ~4,000 samples
- No backward passes, no activation storage, no task-specific tuning
- Inference cost: only steering vectors needed; SAE not required at generation time
- Reversible: steering can be adjusted or removed without retraining
Conclusion
CorrSteer demonstrates that correlation-based feature selection from generation-time SAE activations provides an effective, interpretable, and low-side-effect approach to LLM steering. By treating correlation as a selection heuristic and intervention as the causal test, it bridges the gap between interpretability research and practical model control.
The method’s interpretability advantage is unique: every steering decision is traceable to specific, human-readable SAE features. This transparency is essential for deploying steering in safety-critical applications.
Resources:
- Paper: arXiv:2508.12535
- Code: github.com/seonglae/CorrSteer