Abstract
Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate — not which ones actually change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs. The learned policy identifies features that change model outputs when amplified, going beyond static activation analysis.
Adaptive Feature Masking encourages diverse feature discovery while preserving single-feature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; and layer-wise comparison reveals syntactic features in early layers and semantic features in later layers.
On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves consistent improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes.
The Problem: Static Feature Analysis Falls Short
Sparse autoencoders have emerged as a powerful tool for decomposing the internal representations of language models. By training an encoder-decoder pair with a sparsity penalty, SAEs recover interpretable features from the residual stream — components that correspond to concepts like “mathematical reasoning” or “safety refusal patterns.”
But there’s a fundamental limitation: static analysis tells you which features are present, not which ones matter. Knowing that Feature 7708 (“mathematical operations”) activates on a math problem doesn’t tell you whether amplifying that feature would improve the model’s answer, make it worse, or have no effect at all.
This gap between activation analysis and causal intervention is where CRL enters. Instead of cataloging which features fire, we ask: which features, when amplified, actually change the model’s output?
The key distinction: Traditional SAE analysis is like reading a map — you can see the terrain. CRL is like walking through it — you discover which paths actually lead somewhere different.
Current approaches to SAE-based control face limitations:
- Contrastive methods require carefully curated positive/negative datasets for each feature
- Activation patching requires extensive storage and computation for exhaustive feature probing
- Manual coefficient tuning doesn’t generalize across layers or tasks
CRL addresses these by framing feature selection as a learned decision at every token, using reinforcement learning to discover which features are worth amplifying.
How CRL Works
Formulating Steering as an MDP
CRL treats the control of transformer representations as a Markov Decision Process over SAE features:
where is the residual stream activation at token , is the binary feature selection vector, is the steering coefficient, and is the SAE decoder matrix.
The MDP components are:
- State : the residual stream activation at the intervention layer
- Action : a sparse binary vector selecting SAE features (typically )
- Transition: deterministic, governed by the transformer’s forward pass with the steering intervention
- Reward: task-specific binary signal (correct/incorrect output)
The Policy and Critic Networks
The policy network is a 2-layer MLP with Tanh activation that maps residual stream activations to feature selection logits:
During training, actions are sampled from a categorical distribution over softmax probabilities for unbiased policy gradients. During evaluation, we use deterministic argmax for reproducibility.
The critic network estimates the state value function, enabling advantage estimation for PPO training. Crucially, the critic also serves as an analysis tool — its learned value estimates reveal where the policy succeeds and fails.
Partial Observability
CRL-Token operates under partial observability: the policy sees only the residual stream at layer , without access to the tokens that will be sampled. We mitigate this by using temperature-0 sampling, making the generation deterministic given the residual stream state.
Adaptive Feature Masking
A naive policy would collapse to repeatedly selecting the same high-reward feature. Adaptive Feature Masking (AFM) prevents this by dynamically managing which features are available for selection.
The mask starts small and progressively expands as the model generates tokens — features that naturally activate during generation become available for future policy selections:
- Policy computes logits and masks unavailable features:
- Steering is applied to the residual stream
- Natural SAE activations at the current token expand the mask for the next step
This creates a virtuous cycle: early steering may activate new features that become candidates for later steering, encouraging diverse feature discovery while building on the model’s own representations.
AFM is particularly important for multi-token tasks like GSM8K, where the reasoning chain spans hundreds of tokens and different features become relevant at different stages.
Branch Analysis: Where Steering Decides the Answer
This is where CRL’s interpretability capabilities become most compelling. Branch analysis identifies tokens where the choice of feature determines whether the model produces a correct or incorrect answer.
How Branch Analysis Works
For each generation, we track:
- The context (shared prefix before the branch point)
- The correct path — the feature that leads to a correct answer
- The incorrect path — the feature that leads to a wrong answer
This provides direct causal evidence that specific features are responsible for output quality, not just correlated with it.
Layer-Specific Feature Semantics
Comparing features selected at Layer 10 vs Layer 20 reveals a striking pattern:
Layer 10: Syntactic Features
Layer 10 features capture concrete syntactic patterns — surface-level notation, entity tracking, formatting. Layer 20 features encode abstract semantics — logical derivation structure, planning, and strategic reasoning.
Feature labels have limited predictive value. A feature labeled “emotional relationships” can solve speed calculations, while “mathematical notation” fails on some arithmetic. This indicates learned steering effects diverge from semantic interpretations derived from activation patterns alone.
Out of 8 analyzed branch pairs, Layer 10 features succeed on 3 cases (discrete counting, direct arithmetic), while Layer 20 features succeed on 5 cases (multi-step accumulation, variable reasoning).
Empirical Validation
To confirm that the learned policy selects meaningful features — not arbitrary ones — we compare CRL against baselines on five benchmarks. The goal is not performance gain per se, but to validate that the policy has learned a non-trivial feature selection strategy worth analyzing.
Learned vs. Random Selection
| Method | BBQ Ambig | HarmBench |
|---|---|---|
| No Steering (Base) | 60.17 | 41.46 |
| Random Feature | 58.36 | 45.35 |
| Random + AFM | 60.16 | 46.96 |
| Most-Active Feature | 59.94 | 48.03 |
| CRL (Ours) | 65.86 | 49.12 |
Random feature steering can actually hurt performance (−1.81 on BBQ Ambig), while selecting the most-active SAE feature barely helps. The learned policy discovers features whose causal effects differ from their activation magnitude — precisely the gap that static analysis cannot reveal.
Critic Trajectory Analysis
The learned critic (value function) reveals where in the generation process the model’s confidence changes — and whether task limitations stem from the policy or the value estimator.
Key Observations
For multi-token generation (GSM8K), critic trajectories reveal the temporal dynamics of steering:
- Correct samples (green): Higher gradients with divergence around token 200–400
- Incorrect samples (red): Flat or declining values throughout
- Corrected samples (blue): Show value increases at the ~400-token point, indicating interventions become more effective in later reasoning stages
- Misguided samples (yellow): Initially high values that drop — the steering seemed promising but ultimately led astray
Feature Interpretability
CRL provides per-token intervention logs that reveal which features are selected and when. This enables a new form of mechanistic analysis.
Feature-Token Alignment
Corrected examples show clear semantic coherence: features activate on mathematical tokens, operators, and quantitative units — demonstrating lexical generalization beyond surface forms.
When Features Fail
Misguided examples (select the “Misguided” tab above) reveal failure modes:
- Task-irrelevant features disrupting the reasoning chain
- Token-feature misalignment producing incoherent activations
- Feature interference causing the model to abandon its solution path
Feature Diversity
Feature diversity varies significantly across tasks, measured by entropy of the feature selection distribution:
| Task | Feature Diversity | Interpretation |
|---|---|---|
| GSM8K | 6.695 | Complex reasoning needs diverse features |
| MMLU-Pro | 5.476 | High complexity, broad engagement |
| XSTest | 4.757 | Intermediate diversity |
| HarmBench | 2.938 | Focused refusal uses concentrated features |
Complex reasoning tasks require broader feature engagement, while focused behavioral tasks benefit from concentrated feature activation.
Layer-Wise Analysis
Different layers encode different types of information, and CRL’s layer-wise training reveals this structure.
Residual Stream Norms
Residual stream norms increase with depth, which directly impacts the optimal steering coefficient:
- Early layers (L5–L10):
- Middle layers (L15–L20):
- Late layers (L22–L25):
Large coefficients degrade early-layer performance, while late layers tolerate — and require — stronger interventions.
Discussion
What CRL Reveals as an Analysis Tool
CRL provides three diagnostic capabilities that complement static feature analysis:
1. Branch tracking locates critical decision points — tokens where the choice of a single SAE feature determines whether the output is correct or incorrect. This provides causal evidence for feature importance, going beyond correlational activation analysis.
2. Critic trajectories separate policy from value-estimation limitations. When the critic shows a bottleneck (MMLU), we know the value function is the limiting factor, not the policy’s feature selection. When the critic shows clear discrimination (BBQ), the policy-critic coordination is effective.
3. Layer comparison reveals a syntactic-to-semantic gradient: early layers encode concrete patterns (notation, formatting), while later layers capture abstract structure (logical derivation, strategic planning).
Limitations
- Layer selection is post-hoc: optimal layers are identified via validation, not predicted a priori
- Feature labels are unreliable: SAE feature descriptions often diverge from their actual steering effects
- Single-feature constraint: selecting maximizes interpretability but may leave performance on the table
- Computational cost: while modest (8–9% overhead), CRL requires PPO training for each task-layer combination
Conclusion
Control Reinforcement Learning transforms SAE feature steering from a static analysis tool into a dynamic, token-level intervention framework with built-in interpretability. By training a policy to select features at each generation step, CRL produces mechanistic analysis tools:
- Branch tracking locates critical decision points where feature choice determines output correctness
- Critic trajectories separate policy limitations from value-estimation errors
- Layer comparison reveals syntactic-to-semantic feature gradients across network depth
Validated across knowledge, bias, reasoning, and safety benchmarks, CRL maintains single-feature interpretability throughout. The framework establishes learned feature steering as a practical mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes.