CorrSteer

Try clicking layers to steer real-time generation

Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Authors

Published

Aug 18, 2025

PDF

Table of Contents

TL;DR

CorrSteer is a fully automated method for steering LLMs using Sparse Autoencoder features. It selects features by correlating task outcomes with SAE activations during generation, then validates through intervention. No contrastive datasets, no backward passes, no activation storage needed.

Key results: +3.3% MMLU, +27.2% HarmBench, with half the side effects of fine-tuning.

Why Steering Matters

Large language models have behaviors we want to control: reduce bias, improve factual accuracy, prevent harmful outputs. The standard approach is fine-tuning, but it is a blunt instrument. It changes thousands of parameters simultaneously, often introducing unintended side effects that are difficult to diagnose or reverse.

Sparse Autoencoders (SAEs) offer an alternative. They decompose neural activations into interpretable features: individual directions in activation space that correspond to human-understandable concepts like “refusal to harmful requests” or “mathematical reasoning.”

But existing SAE steering methods have limitations: they require contrastive datasets, large activation stores, or backward passes. And critically, they select features from context tokens rather than generation tokens, missing the features that actually drive output behavior.

Steering in Action
See how CorrSteer changes model responses on safety-critical and knowledge tasks. Select different tasks to explore.

The CorrSteer Method

CorrSteer solves these problems with a simple two-stage approach: correlate, then intervene.

Stage 1: Correlation-Guided Feature Selection

Given a dataset of prompts, we generate responses and capture SAE activations at each layer. For each feature ziz_i, we compute its Pearson correlation with task outcomes yy:

ri=Cov(zi,y)Var(zi)Var(y)r_i = \frac{\text{Cov}(z_i, y)}{\sqrt{\text{Var}(z_i) \cdot \text{Var}(y)}}

The key insight: we compute correlations on generation-time activations (the tokens the model produces), not context tokens (the prompt). This captures features that drive output behavior.

We use max-pooling across generated tokens to aggregate multi-token activations, capturing peak feature engagement.

Stage 2: Coefficient Estimation

For each selected feature, the steering coefficient is the mean activation across samples with positive outcomes:

ci=1{j:yj>0}j:yj>0zi,jc_i = \frac{1}{|\{j: y_j > 0\}|} \sum_{j: y_j > 0} z_{i,j}

This anchors the coefficient to the feature’s natural scale during successful generation, exploiting SAE’s non-negative activations.

Stage 3: Inference-Time Steering

At generation time, we add the steering vector to the residual stream:

xt=xt+iciWdec[:,i]\mathbf{x}'_t = \mathbf{x}_t + \sum_i c_i \cdot \mathbf{W}_{\text{dec}}[:, i]

Applied only to generation positions (tnt \geq n where nn is prompt length). The SAE itself is not needed at inference; only the pre-computed steering vectors.

Three Variants

CorrSteer comes in three variants that trade off simplicity against performance:

CorrSteer Variants
Toggle between S (single global feature), A (one per layer), and P (pruned via validation) to see which features are selected across layers.

Feature Activation Frequencies

How often do the selected features actually fire? The frequency distribution across layers reveals where the model’s most active steering features live, and how they differ between tasks.

Feature Activation Frequency Distribution
Each dot is a feature. X = layer, Y = activation frequency. Filled dots are positively correlated; rings are negatively correlated. Size encodes correlation strength.

High-frequency features fire on many inputs and represent broad patterns (e.g., formatting, punctuation), while low-frequency features are highly specific (e.g., particular mathematical operations, refusal phrases). The distribution shifts across layers: early layers tend to have high-frequency syntactic features, while deeper layers surface task-specific semantic features with lower but more discriminative activation rates.

Exploring the Feature Space

What do the selected features look like? Each SAE feature has a human-readable description from Neuronpedia, allowing us to inspect what the model is doing differently when steered.

Each point is an SAE feature. X = layer, Y = correlation with task success, Z = steering coefficient. Color encodes task. Rotate to explore; hover for feature descriptions.

Features cluster into interpretable categories:

Feature Explorer
Browse selected features by task and model. Click a feature card to see full details and Neuronpedia link.

Results

We evaluate on 8 benchmarks across 5 categories: knowledge (MMLU, MMLU-Pro), reasoning (GSM8K), bias (BBQ), factuality (SimpleQA), and safety (HarmBench, XSTest).

Performance Comparison
Accuracy across all methods on Gemma-2 2B. Hover for exact values with standard deviations.
Key Finding

CorrSteer-A matches fine-tuning accuracy on MMLU (55.48% vs 55.75%) while halving the Side Effect Ratio (0.21 vs 0.41). On HarmBench, CorrSteer achieves +27.2% improvement in harmful request refusal.

Side Effect Trade-offs

We introduce the Side Effect Ratio (SER): the fraction of changed answers that become incorrect. Lower SER means the method changes answers more precisely, converting wrong answers to right without breaking correct ones.

Side Effect Ratio Comparison
SER across methods and tasks. CorrSteer variants achieve lower SER than fine-tuning and CAA on most tasks.

On MMLU, CorrSteer-A changes 879 answers compared to fine-tuning’s 2,724, yet achieves comparable accuracy. CorrSteer-A outperforms CorrSteer-S on 5 of 8 tasks, indicating that multi-layer feature combinations produce gains beyond single-feature steering. Positive-only SAE methods (CorrSteer, SPARE, DSG) consistently show lower SER than fine-tuning, because sparse features provide targeted rather than global modifications.

Feature Interpretability

The layer-by-layer distribution of top features reveals how different tasks engage different parts of the network:

Layer x Task Feature Heatmap
Each cell shows the top feature's correlation for that layer-task combination. Click a cell to see the top features and their descriptions.

Safety Analysis

CorrSteer’s safety improvements come from discrimination, not indiscriminate refusal. The XSTest benchmark separates safe prompts (e.g., asking about historical events) from unsafe contrast prompts (similar wording but genuinely harmful intent).

Safety Discrimination by Category
XSTest breakdown: safe categories (left) show near-zero over-refusal. Unsafe contrast categories (right) show appropriate refusal rates.

Safe prompts like historical events, definitions, and figurative language show 0% over-refusal. The steered model correctly identifies these as benign. Meanwhile, contrast categories with genuinely harmful intent show refusal rates of 22-73%.

Cross-Task Transfer

Can features selected for one task improve performance on others?

Cross-Task Feature Transferability
Features from source tasks (rows) applied to target tasks (columns). Color intensity = accuracy gain over baseline.

Three patterns emerge:

  1. MMLU features transfer broadly to MMLU-Pro (+5.67%) and BBQ Ambig (+4.91%), sharing multiple-choice format
  2. BBQ features transfer to MMLU, indicating bias-mitigation features encode general QA capabilities
  3. Task-specific features outperform transferred features in most cases, confirming the value of targeted selection

Ablation Studies

Pooling Strategy Comparison

Max-pooling captures peak feature activation across generated tokens and succeeds across all tasks. Mean-pooling catastrophically fails on multi-token generation tasks (HarmBench: 0%, XSTest: 53.65%) because averaging dilutes sparse signals.

Pooling & Negative Feature Ablations
Left: pooling strategy comparison. Right: positive vs. negative correlation features. Toggle tabs to explore.

Positive vs. Negative Features

Steering with negatively correlated features provides no improvement (Neg-S) or severe degradation (Neg-A: MMLU-Pro drops to 0.66%). SAE activations are non-negative; steering should amplify success, not suppress failure. The ablation chart above (tab 2) shows this clearly.

Control Experiments

Controls confirm real signal

Label permutation: Randomly shuffled correctness labels yield 6.24% MMLU accuracy (near chance), confirming features capture genuine task structure, not artifacts.

Random features: Random selection yields 6.29% MMLU accuracy, comparable to chance, confirming that correlation-based selection is essential, not just the steering mechanism itself.

Efficiency and Scalability

CorrSteer is lightweight and practical:

Performance stabilizes at ~4,000 samples. Diminishing returns beyond this point.

Conclusion

CorrSteer demonstrates that correlation-based feature selection from generation-time SAE activations provides an effective, interpretable, and low-side-effect approach to LLM steering. By treating correlation as a selection heuristic and intervention as the causal test, it bridges the gap between interpretability research and practical model control.

The method’s interpretability advantage is unique: every steering decision is traceable to specific, human-readable SAE features. This transparency is essential for deploying steering in safety-critical applications.


Resources: