Publications

Research papers and technical writing.

Filter:
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

ICML 2026 · Article
Steering Vectors Mechanistic Interpretability LLM Safety
Automata from Agent Traces: Failure and Next-Step Prediction

Automata from Agent Traces: Failure and Next-Step Prediction

ICML 2026 AIWILD
Agent LLM Safety

Tool Calling is Linearly Readable and Steerable in Language Models

ICML 2026 FAGEN Oral
Mechanistic Interpretability Steering Vectors Agent
AgentRoom: Concurrent Multi-Agent Coding in a CRDT-Backed Shared Workspace

AgentRoom: Concurrent Multi-Agent Coding in a CRDT-Backed Shared Workspace

ICML 2026 FAGEN
Agent
PaaT: Probe as a Tool for Proprioceptive Language Agents

PaaT: Probe as a Tool for Proprioceptive Language Agents

ICML 2026 FAGEN
Agent Mechanistic Interpretability
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

ICLR 2026 Trustworthy AI · Article
Mechanistic Interpretability Sparse Autoencoders Steering Vectors LLM Safety
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

ICLR 2026 Reliable Autonomy · Article
LLM Safety Mechanistic Interpretability
AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AAAI 2026 Demo
Agent LLM Safety
FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Datasets Dependency

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Datasets Dependency

ACL 2025 SRW
Sparse Autoencoders Mechanistic Interpretability
LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

ICML 2025 TAIG
LLM Safety
RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

NAACL 2024 Demo
Summarization