Publications

Research papers and technical writing.

Filter:
ControlRL: Concept Bottleneck Sparse Autoencoders for Scalable Concept Representation Learning

ControlRL: Concept Bottleneck Sparse Autoencoders for Scalable Concept Representation Learning

ICLR 2026 Trustworthy AI · Article
Mechanistic Interpretability Sparse Autoencoders Steering Vectors LLM Safety
Confidence Manifold: Revealing LLM Internal States via Topological Inconsistency Analysis

Confidence Manifold: Revealing LLM Internal States via Topological Inconsistency Analysis

ICLR 2026 Reliable Autonomy · Article
LLM Safety Mechanistic Interpretability
CorrSteer: Steering LLMs via Correlation-based Corrections

CorrSteer: Steering LLMs via Correlation-based Corrections

ICLR 2026 RE-Align · Article
Steering Vectors Mechanistic Interpretability LLM Safety
AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AgentGraph: Trace-to-Graph Platform for Interactive Analysis and Robustness Testing in Agentic AI Systems

AAAI 2026 Demo
Agent LLM Safety
FaithfulSAE: Evaluating Faithfulness of Sparse Autoencoders in Large Language Models

FaithfulSAE: Evaluating Faithfulness of Sparse Autoencoders in Large Language Models

ACL 2025 SRW
Sparse Autoencoders Mechanistic Interpretability
LibVulnWatch: Automated Vulnerability Monitoring Using LLMs for Open-Source Libraries

LibVulnWatch: Automated Vulnerability Monitoring Using LLMs for Open-Source Libraries

ICML 2025 TAIG
LLM Safety
RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

RTSum: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

NAACL 2024 Demo
Summarization