Control Reinforcement Learning

Abstract

Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate — not which ones actually change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs. The learned policy identifies features that change model outputs when amplified, going beyond static activation analysis.

Adaptive Feature Masking encourages diverse feature discovery while preserving single-feature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; and layer-wise comparison reveals syntactic features in early layers and semantic features in later layers.

On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves consistent improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes.

Token-Level Steering Simulation

Figure 2: Watch how different SAE features steer token generation along divergent paths. The same prompt context leads to correct or incorrect answers depending on which feature is amplified. Feature IDs link to Neuronpedia for detailed activation analysis.

The Problem: Static Feature Analysis Falls Short

Sparse autoencoders have emerged as a powerful tool for decomposing the internal representations of language models. By training an encoder-decoder pair with a sparsity penalty, SAEs recover interpretable features from the residual stream — components that correspond to concepts like “mathematical reasoning” or “safety refusal patterns.”

But there’s a fundamental limitation: static analysis tells you which features are present, not which ones matter. Knowing that Feature 7708 (“mathematical operations”) activates on a math problem doesn’t tell you whether amplifying that feature would improve the model’s answer, make it worse, or have no effect at all.

This gap between activation analysis and causal intervention is where CRL enters. Instead of cataloging which features fire, we ask: which features, when amplified, actually change the model’s output?

The key distinction: Traditional SAE analysis is like reading a map — you can see the terrain. CRL is like walking through it — you discover which paths actually lead somewhere different.

Current approaches to SAE-based control face limitations:

Contrastive methods require carefully curated positive/negative datasets for each feature
Activation patching requires extensive storage and computation for exhaustive feature probing
Manual coefficient tuning doesn’t generalize across layers or tasks

CRL addresses these by framing feature selection as a learned decision at every token, using reinforcement learning to discover which features are worth amplifying.

How CRL Works

CRL System Overview

Figure 3: Interactive CRL system diagram. Hover over components to see their role. The policy network observes the residual stream and selects SAE features for steering, while the critic estimates expected outcomes for PPO training.

Formulating Steering as an MDP

CRL treats the control of transformer representations as a Markov Decision Process over SAE features:

$\tilde{\mathbf{x}}_t = \mathbf{x}_t + c \cdot \mathbf{a}_t \mathbf{W}_{dec}$

where $\mathbf{x}_t$ is the residual stream activation at token $t$ , $\mathbf{a}_t$ is the binary feature selection vector, $c$ is the steering coefficient, and $\mathbf{W}_{dec}$ is the SAE decoder matrix.

The MDP components are:

State $\mathbf{s} = \mathbf{x}$ : the residual stream activation at the intervention layer
Action $\mathbf{a} \in \{0,1\}^{d_{dict}}$ : a sparse binary vector selecting SAE features (typically $k=1$ )
Transition: deterministic, governed by the transformer’s forward pass with the steering intervention
Reward: task-specific binary signal (correct/incorrect output)

The Policy and Critic Networks

Policy & Critic Architecture

Figure: The policy observes the residual stream per token and selects SAE features for steering. The critic estimates state values for advantage computation from verifiable rewards.

The policy network $\pi_\theta$ is a 2-layer MLP with Tanh activation that maps residual stream activations to feature selection logits:

$\boldsymbol{\mu} = \pi_\theta(\mathbf{s}), \quad \mathbf{a} = \text{TopK}(\boldsymbol{\mu}, k)$

During training, actions are sampled from a categorical distribution over softmax probabilities for unbiased policy gradients. During evaluation, we use deterministic argmax for reproducibility.

The critic network $V_\phi$ estimates the state value function, enabling advantage estimation for PPO training. Crucially, the critic also serves as an analysis tool — its learned value estimates reveal where the policy succeeds and fails.

Why PPO over DPO/GRPO? PPO provides an explicit value function that we analyze as a diagnostic tool. The critic trajectories reveal bottleneck patterns that would be invisible with preference-based methods.

Partial Observability

CRL-Token operates under partial observability: the policy sees only the residual stream at layer $\ell$ , without access to the tokens that will be sampled. We mitigate this by using temperature-0 sampling, making the generation deterministic given the residual stream state.

Adaptive Feature Masking

A naive policy would collapse to repeatedly selecting the same high-reward feature. Adaptive Feature Masking (AFM) prevents this by dynamically managing which features are available for selection.

$\text{mask}_i^{(t+1)} = \text{mask}_i^{(t)} \lor \text{active}_i^{(t)}$

The mask starts small and progressively expands as the model generates tokens — features that naturally activate during generation become available for future policy selections:

Policy computes logits and masks unavailable features: $\mathbf{a}_t = \text{TopK}(\boldsymbol{\mu} \odot \text{mask}^{(t)}, k)$
Steering is applied to the residual stream
Natural SAE activations at the current token expand the mask for the next step

This creates a virtuous cycle: early steering may activate new features that become candidates for later steering, encouraging diverse feature discovery while building on the model’s own representations.

AFM is particularly important for multi-token tasks like GSM8K, where the reasoning chain spans hundreds of tokens and different features become relevant at different stages.

Branch Analysis: Where Steering Decides the Answer

This is where CRL’s interpretability capabilities become most compelling. Branch analysis identifies tokens where the choice of feature determines whether the model produces a correct or incorrect answer.

How Branch Analysis Works

For each generation, we track:

The context (shared prefix before the branch point)
The correct path — the feature that leads to a correct answer
The incorrect path — the feature that leads to a wrong answer

This provides direct causal evidence that specific features are responsible for output quality, not just correlated with it.

Layer-Specific Feature Semantics

Comparing features selected at Layer 10 vs Layer 20 reveals a striking pattern:

Layer 10: Syntactic Features

L10/16145 “mathematical notation”
L10/5423 “pronoun references”
L10/8428 “clinical examinations” (surface similarity)
L10/8775 “legal contracts” (terminology match)

Layer 20: Semantic Features

L10/12607 “mathematical proofs” (logical structure)
L20/11163 “assessments and strategic plans”
L20/2878 “code formatting” (structural patterns)
L20/998 “needs identification” (requirement semantics)

Layer 10 features capture concrete syntactic patterns — surface-level notation, entity tracking, formatting. Layer 20 features encode abstract semantics — logical derivation structure, planning, and strategic reasoning.

Feature labels have limited predictive value. A feature labeled “emotional relationships” can solve speed calculations, while “mathematical notation” fails on some arithmetic. This indicates learned steering effects diverge from semantic interpretations derived from activation patterns alone.

Out of 8 analyzed branch pairs, Layer 10 features succeed on 3 cases (discrete counting, direct arithmetic), while Layer 20 features succeed on 5 cases (multi-step accumulation, variable reasoning).

Empirical Validation

To confirm that the learned policy selects meaningful features — not arbitrary ones — we compare CRL against baselines on five benchmarks. The goal is not performance gain per se, but to validate that the policy has learned a non-trivial feature selection strategy worth analyzing.

CRL Steering Results

Table 1: Before vs. after CRL steering. Toggle between Gemma 2 2B and [LLaMA](https://texonom.com/f2b6721202d44d469add84d8a366809c) 3.1 8B to compare across tasks and architectures.

Learned vs. Random Selection

Steering Baseline Comparison

Figure 5: Random feature selection can degrade performance (−1.81 on BBQ), confirming the policy learns non-trivial feature selection.

Method	BBQ Ambig	HarmBench
No Steering (Base)	60.17	41.46
Random Feature	58.36	45.35
Random + AFM	60.16	46.96
Most-Active Feature	59.94	48.03
CRL (Ours)	65.86	49.12

Random feature steering can actually hurt performance (−1.81 on BBQ Ambig), while selecting the most-active SAE feature barely helps. The learned policy discovers features whose causal effects differ from their activation magnitude — precisely the gap that static analysis cannot reveal.

Critic Trajectory Analysis

The learned critic (value function) reveals where in the generation process the model’s confidence changes — and whether task limitations stem from the policy or the value estimator.

Interactive Critic Trajectory Explorer

Critic Value Trajectories

Figure 7: Explore critic value trajectories across tasks. GSM8K shows multi-token generation (~1000 steps); HarmBench and XSTest show short-form generation (~30 steps). Toggle between tasks to compare.

Key Observations

For multi-token generation (GSM8K), critic trajectories reveal the temporal dynamics of steering:

Correct samples (green): Higher gradients with divergence around token 200–400
Incorrect samples (red): Flat or declining values throughout
Corrected samples (blue): Show value increases at the ~400-token point, indicating interventions become more effective in later reasoning stages
Misguided samples (yellow): Initially high values that drop — the steering seemed promising but ultimately led astray

Feature Interpretability

CRL provides per-token intervention logs that reveal which features are selected and when. This enables a new form of mechanistic analysis.

Feature-Token Alignment

Token-Level Feature Activation Highlights

Figures 8-11: Interactive token-level visualization of SAE feature activations on GSM8K reasoning chains. Corrected examples show feature activations aligning with mathematical tokens; misguided examples show token-feature misalignment. Hover tokens for details; click feature pills to view on Neuronpedia.

Corrected examples show clear semantic coherence: features activate on mathematical tokens, operators, and quantitative units — demonstrating lexical generalization beyond surface forms.

When Features Fail

Misguided examples (select the “Misguided” tab above) reveal failure modes:

Task-irrelevant features disrupting the reasoning chain
Token-feature misalignment producing incoherent activations
Feature interference causing the model to abandon its solution path

Feature Diversity

Feature Selection Frequency

Figure 13: Feature selection frequency follows a rank-dependent decay, confirming the policy learns selective (not random) feature choices. Click feature IDs to view on Neuronpedia.

Feature diversity varies significantly across tasks, measured by entropy of the feature selection distribution:

Task	Feature Diversity	Interpretation
GSM8K	6.695	Complex reasoning needs diverse features
MMLU-Pro	5.476	High complexity, broad engagement
XSTest	4.757	Intermediate diversity
HarmBench	2.938	Focused refusal uses concentrated features

Complex reasoning tasks require broader feature engagement, while focused behavioral tasks benefit from concentrated feature activation.

Layer-Wise Analysis

Different layers encode different types of information, and CRL’s layer-wise training reveals this structure.

Layer × Coefficient Sensitivity

Figure 14: Interactive heatmap showing accuracy across layer-coefficient combinations. Each task has a different optimal configuration. Hover cells for details.

Residual Stream Norms

Residual Stream Norm Growth

Figure 15: Residual stream norms increase across layers, requiring larger steering coefficients in deeper layers. Hover to see exact values and coefficient ranges.

Residual stream norms increase with depth, which directly impacts the optimal steering coefficient:

Early layers (L5–L10): $c = 10\text{–}20$
Middle layers (L15–L20): $c = 30\text{–}60$
Late layers (L22–L25): $c = 60\text{–}100$

Large coefficients degrade early-layer performance, while late layers tolerate — and require — stronger interventions.

Discussion

What CRL Reveals as an Analysis Tool

CRL provides three diagnostic capabilities that complement static feature analysis:

1. Branch tracking locates critical decision points — tokens where the choice of a single SAE feature determines whether the output is correct or incorrect. This provides causal evidence for feature importance, going beyond correlational activation analysis.

2. Critic trajectories separate policy from value-estimation limitations. When the critic shows a bottleneck (MMLU), we know the value function is the limiting factor, not the policy’s feature selection. When the critic shows clear discrimination (BBQ), the policy-critic coordination is effective.

3. Layer comparison reveals a syntactic-to-semantic gradient: early layers encode concrete patterns (notation, formatting), while later layers capture abstract structure (logical derivation, strategic planning).

Limitations

Layer selection is post-hoc: optimal layers are identified via validation, not predicted a priori
Feature labels are unreliable: SAE feature descriptions often diverge from their actual steering effects
Single-feature constraint: selecting $k=1$ maximizes interpretability but may leave performance on the table
Computational cost: while modest (8–9% overhead), CRL requires PPO training for each task-layer combination

Conclusion

Control Reinforcement Learning transforms SAE feature steering from a static analysis tool into a dynamic, token-level intervention framework with built-in interpretability. By training a policy to select features at each generation step, CRL produces mechanistic analysis tools:

Branch tracking locates critical decision points where feature choice determines output correctness
Critic trajectories separate policy limitations from value-estimation errors
Layer comparison reveals syntactic-to-semantic feature gradients across network depth

Validated across knowledge, bias, reasoning, and safety benchmarks, CRL maintains single-feature interpretability throughout. The framework establishes learned feature steering as a practical mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes.