Research articles and technical writing.
Interactive exploration of how language models encode confidence — intrinsic dimension analysis, layer evolution, activation steering, and cross-architecture transfer.
Interactive exploration of Control Reinforcement Learning — learned per-token feature steering with branch analysis, critic trajectories, and layer-wise interpretability.
An interactive exploration of correlation-guided feature selection for controllable language model behavior using Sparse Autoencoders.
Investigating how dataset composition affects sparse autoencoder feature matching and density patterns.
Understanding in-context learning by reversing transformer representations, exploring phase changes and feature dimensionality.
Exploring how the superposition hypothesis in neural networks relates to steering language models using sparse autoencoders.
No articles match the selected tags.