Publications

Research papers and technical writing.

Filter:

OptimismBench: Forecasting Bias and the Alignment Effect in Language Model Judgment

ICML 2026 Forecasting · Article

Forecasting Optimism Bias Alignment LLM Safety

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

ICML 2026 · Article

Steering Vectors Mechanistic Interpretability LLM Safety

Automata from Agent Traces: Failure and Next-Step Prediction

Automata from Agent Traces: Failure and Next-Step Prediction

ICML 2026 AIWILD

Agent LLM Safety

Tool Calling is Linearly Readable and Steerable in Language Models

ICML 2026 FAGEN Oral

Mechanistic Interpretability Steering Vectors Agent

AgentRoom: Concurrent Multi-Agent Coding in a CRDT-Backed Shared Workspace

AgentRoom: Concurrent Multi-Agent Coding in a CRDT-Backed Shared Workspace

ICML 2026 FAGEN · Article

Agent

PaaT: Probe as a Tool for Proprioceptive Language Agents

PaaT: Probe as a Tool for Proprioceptive Language Agents

ICML 2026 FAGEN · Article

Agent Mechanistic Interpretability

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

ICLR 2026 Trustworthy AI · Article

Mechanistic Interpretability Sparse Autoencoders Steering Vectors LLM Safety

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

ICLR 2026 Reliable Autonomy · Article

LLM Safety Mechanistic Interpretability

AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AAAI 2026 Demo

Agent LLM Safety

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Datasets Dependency

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Datasets Dependency

ACL 2025 SRW

Sparse Autoencoders Mechanistic Interpretability

LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

ICML 2025 TAIG

LLM Safety

RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

NAACL 2024 Demo

Summarization