Articles

Research articles and technical writing.

Filter:

Confidence Manifold: Geometric Analysis of LLM Confidence

2026-02

Interactive exploration of how language models encode confidence — intrinsic dimension analysis, layer evolution, activation steering, and cross-architecture transfer.

20 min read LLM SafetyConfidenceGeometry

CRL: Interpretable Token-Level Steering of LLMs via SAE Features

2026-02

Interactive exploration of Control Reinforcement Learning — learned per-token feature steering with branch analysis, critic trajectories, and layer-wise interpretability.

30 min read SteeringInterpretabilitySAEAlignmentRL

CorrSteer: Steering LLMs via Correlation-based Corrections

2025-08

An interactive exploration of correlation-guided feature selection for controllable language model behavior using Sparse Autoencoders.

25 min read SteeringInterpretabilitySAEAlignment

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

2025-01

Investigating how dataset composition affects sparse autoencoder feature matching and density patterns.

15 min read SAEDatasetInterpretability

Reversing Transformer to Understand In-Context Learning with Phase Change & Feature Dimensionality

2024-08

Understanding in-context learning by reversing transformer representations, exploring phase changes and feature dimensionality.

12 min read InterpretabilityIn-Context LearningTransformer

Superposition Hypothesis for Steering LLM with Sparse Autoencoder

2024-07

Exploring how the superposition hypothesis in neural networks relates to steering language models using sparse autoencoders.

10 min read SAESteeringSuperposition

No articles match the selected tags.