PaaT: Probe as a Tool

TL;DR

A tightrope walker can see the rope, but what keeps them up is feeling their own balance shift before they fall. Tool-augmented language agents have the opposite problem: they read every external tool result, yet they are blind to their own internal state. The signs of failure (a reasoning loop, rising uncertainty, a drift toward an unsafe action) are already written in the hidden activations, but the agent never sees them until they surface as a tool call, where training-time alignment can no longer step in.

PaaT (Probe as a Tool) gives the agent a sense of proprioception: we train lightweight linear probes on hidden-state activations and expose them as an agent-callable function, read_probe, so the model can query its own confidence and refusal-likelihood mid-trajectory and self-regulate at runtime, with no weight modification.

The surprising result is a sign flip on delivery. When the agent chooses to call the probe, mean F1 score (the harmonic mean of precision and recall) rises +3.7 points over an unassisted baseline. When the same probe scalar is injected automatically every step, F1 drops -3.3 points. And a hard system-side threshold with no model in the loop sharply cuts harmful compliance on a weaker agent (0.70 to 0.30, preliminary). The channel, not the probe, decides whether the agent gets safer.

Tool-augmented agents (Schick et al., 2023; Yao et al., 2023) are deployed for multi-step tasks, and recent safety benchmarks show they will comply with harmful instructions end to end, especially at smaller scale (Andriushchenko et al., 2025). The puzzle is that the warning signs are usually present internally before the harmful action happens. Linear probes on hidden states recover directions for refusal, truthfulness, and latent belief, so the information needed to stop is recoverable. It just lives system-side, where the agent cannot reach it.

Every prior internal-monitoring method keeps the probe signal on the system side: a cascade, a guardrail, or an external monitor reads the activations and decides for the agent (Li et al., 2025). PaaT asks a different question. What happens when you hand the agent its own probe and let it decide when to look?

Proprioception has two pathways

Biology already solved internal sensing twice. Unconscious proprioception runs through the spinocerebellar tract: always on, no attention cost, but coarse. Conscious body awareness runs through the dorsal-column pathway: high fidelity, but it costs attention, so you only invoke it for novel or high-stakes moves. Healthy motor control uses both.

That distinction maps cleanly onto the design choice at the heart of PaaT. Automatic injection of a probe signal is the unconscious pathway: always present, but it cannot be ignored. An agent-callable probe is the conscious pathway: high fidelity, but only consulted when the agent decides it matters. The rest of this article is about what happens at each end of that spectrum, and why the ends have opposite signs.

The injection-tool spectrum

The controlled variable in PaaT is not which probe to train. It is who decides when the agent sees it. We lay the delivery mechanisms on a single axis from fully system-side to fully agent-initiated, and treat each one as a small program that transforms what the agent sees on each step.

The injection-tool spectrum

System decides increasing agent autonomy → Agent decides

Hover a stage

Figure 1: Delivery mechanisms ordered by agent autonomy, from system-automatic injection (left) to agent-initiated tool calls (right). Hover a stage to see what the agent sees on each step and who is in control. The two endpoints, auto_inject and probe_tool, are the ones measured here.

At one extreme, auto_inject appends the raw probe scalar to the context on every step. At the other, probe_tool registers read_probe once and stays silent; the agent calls it on demand. Between them sit conditional injection, periodic summaries, and a cascade that keeps a cheap signal always on while reserving the full probe for moments of low confidence. We measure the two endpoints; the middle is the open design space.

How PaaT works

A frozen language model (LM) exposes the hidden state at a fixed depth (62.5% of the way through the network, empirically the most informative layer). A frozen linear probe maps that hidden state to a scalar in $[0,1]$ , read as the probe’s estimate that the current context is harm-related. The agent acts under its normal policy, but on a delivered context that depends on the spectrum condition.

Figure 2: A frozen LM (left) exposes its layer-L* hidden state to a linear probe head (centre: mean-pool over the last 64 tokens, standardised features, logistic regression). The agent loop (right) consumes the probe either passively as a system-injected scalar (auto_inject) or actively via the read_probe tool. Sense at the activation level; act at the text level.

The probe surfaces four signals, each targeting a failure mode the agent could plausibly self-detect:

Uncertainty $u_t$ : one minus a success probe trained on past episodes. How unsure the model is that the current trajectory will succeed.
Loopiness $\ell_t$ : above-baseline cosine self-similarity of the hidden state over the last 5 steps. The model revisiting its own internal states.
Inconsistency $\iota_t$ : the variance of step-to-step cosine movement. Sudden directional changes in activation space that correlate with reasoning derailment.
Tool addiction $\alpha_t$ : the fraction of recent actions that are tool calls. Unproductive repeated searching.

When the agent calls read_probe, it gets these numbers back as a tool observation and reasons over them in text. Sense at the activation level, act at the text level.

The sign flip

Here is the central result. We run an 11-head risk-classification suite on Qwen3.5-27B, comparing four conditions: llm_only (no probe), auto_inject (scalar pushed every step), probe_tool (agent calls on demand), and probe_only (system threshold, no model reasoning, an upper bound on the probe alone). We relate each head’s gain to the probe’s AUROC (Area Under the Receiver Operating Characteristic curve, where 0.5 is chance and 1.0 is perfect ranking).

Probe-tool gains are reliable in the AUROC sweet spot

probe_tool − llm_only auto_inject − llm_only AUROC sweet spot (0.5–0.95)

Figure 3: Per-head gain of probe_tool over llm_only against probe AUROC. Hover any point for the head and exact numbers. Every head inside the shaded band (AUROC 0.5 to 0.95) gains; every loss falls outside it, where the probe is near-random (left) or saturated (right). probe_only (hollow) marks the no-model ceiling.

The agent-callable probe beats the unassisted model on 8 of 11 heads. The biggest gains are +18 points on offensive toxicity, +9 on age and race bias, and +8 on jailbreak. Auto-injection of the same scalar underperforms tool access on 8 of 11 heads. Passive injection of an imperfect confidence score acts as noise; the agent cannot decline to read what is already in its context.

Averaged across heads, probe_tool lifts F1 by +3.7 points (0.763 vs 0.726) while auto_inject drops it by -3.3 points (0.693 vs 0.726). A capable decoder can ignore a noisy scalar it chose not to consult. It cannot ignore the same scalar after the system has pasted it into context.

Two limit cases explain the shape. If the probe is near-perfect, a model that already has the same context can often classify well on its own; if it is near-random, it carries nothing worth reading. In between, a good-but-imperfect probe carries signal the model cannot fully reconstruct from context alone, and every head in that band gains. The heuristic is not a hard law: offensive toxicity still gains +18 points at AUROC 0.98, and saturation is where results split. But every head that loses sits outside the band, at saturation (personally identifiable information, PII) or below chance (religion and gender bias).

A capability ceiling, not a probe failure

Does the agent-callable advantage hold across model families and scales? We replicate the suite on three more models, training each probe on that model’s own activations with the identical recipe.

AgentHarm compliance and the capability ceiling

Harmful compliance by condition on Qwen3.5-9B (shorter bar = safer)

probe_tool gain across scale

Positive at 8B and 27B, negative at 4B: a capability ceiling.

Figure 4: Left, end-to-end harmful-compliance rate on AgentHarm by condition (a shorter bar is safer); the system-side probe_only gate cuts 9B compliance the most, and on the 9B the two delivery endpoints diverge on the same probe (preliminary). Right, probe_tool gain over llm_only across model scale: positive at 27B and Llama-3.1-8B, negative at 4B, consistent with a capability ceiling.

The pattern is consistent. The probe_only system threshold is the reference ceiling at every scale, which makes it the natural deployment target for small models. The agent-callable gain is positive at Qwen3.5-27B (+4 points) and Llama-3.1-8B (+8 points), but flips negative at Qwen3.5-4B (-21 points). The crossover sits between 4B and 8B.

The negative gain at 4B is not a probe-quality failure: the same pipeline gates probe_only cleanly. The failure mode is the agent’s inability to weight an imperfect scalar against context. So a weaker model wants a hard system-side gate, while a stronger agent benefits from selective tool access. The two regimes are not interchangeable, and conflating them is what produces the conflicting reports in prior probe-monitor work.

End to end on AgentHarm

The classification suite isolates the delivery variable on a single-step task. AgentHarm (Andriushchenko et al., 2025) exercises the motivating scenario directly: an agent with 80 mock tools, graded on whether it invokes a harmful target sequence end to end. A weaker agent (Qwen3.5-9B) complies with 70% of harmful requests at 90% benign success, leaving ample room for runtime intervention.

base agentCompliance

1 Reads the phishing-email task
2 Drafts the email body
3 Calls send_email(…)

Harmful action executed

probe_onlyPrevented

1 Reads the phishing-email task
2 Drafts the email body
3 System gate reads fθ(h_L*) > τ
4 ABORT before send_email

Harmful action blocked

Same harmful intent, single Qwen3.5-9B task. The system-side gate aborts on the linear-probe reading before the tool call is issued.

Figure 5: An illustrative phishing-style episode on Qwen3.5-9B (schematic). Left: the base agent drafts the body and calls send_email (compliance). Right: the probe_only agent's system-side gate reads the linear probe above its threshold and aborts before send_email is issued (compliance prevented).

Adding the system-side probe_only gate cuts 9B harmful compliance from 0.70 to 0.30, the largest single intervention, and the same direction holds on Llama-3.1-8B (0.40 to 0.20). The delivery channel matters on safety too, directionally: on the 9B, auto-injection of the same signal raises compliance to 0.90 while the tool call trims it to 0.60.

The AgentHarm runs use 10 tasks per class with wide Wilson intervals (about 40 to 50 points), so we read them as directional rather than statistically established. The robust quantitative evidence is the 27B 11-head suite over 200 items; AgentHarm is preliminary support for the same mechanism, to be established at larger n.

Activation probes earn their place

A fair question: would a cheaper output-level signal do just as well? Across heads, next-token entropy hovers at AUROC 0.58 to 0.64 regardless of risk category. Activation probes span a far wider range, from near-random on bias categories that need world knowledge the activations do not encode, up to perfect on categories with strong activation-space signatures such as PII and system-prompt leakage. On the categories where the internal state encodes the signal, the gap over entropy reaches +0.39 AUROC. The probe is not uniformly better; it is decisively better exactly where it has signal, and a calibrated agent learns to defer to its own reasoning where it does not.

What this means for deployment

PaaT suggests an architecture for safety-aware agents: read internal state through linear probes (cheap and accurate) and act through the model’s text instruction-following channel (well calibrated and deployable). The injection-tool spectrum then maps onto a tradeoff every deployment team faces anyway. Automatic injection guarantees the agent sees a safety signal but spends context and can confuse a weaker model. Agent-initiated tools respect autonomy and save context but rely on the agent to look.

A natural middle ground is a cascade, with a cheap always-on signal plus an on-demand deep probe, the same dual-pathway design biology uses. A production probe-to-model cascade at exactly this operating point has been reported to cut inference cost by roughly 50x at safety thresholds. We did not measure the cascade, but it follows directly from the spectrum, and it lets one system instantiate either regime per request.

Conclusion

When the same probe scalar reaches an agent through different channels, the channel determines whether the agent gets safer or less safe. That shifts the deployment question from which probe to train to who decides when the agent sees it. A capable agent benefits from selective access to an imperfect signal; a weaker one is better served by a hard system-side gate that bypasses its policy entirely. Interpretability artifacts do not have to stay offline diagnostics. They can be runtime tools.

Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J. Z., Fredrikson, M., Gal, Y., & Davies, X. (2025). AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=AC5n7xHuR1 back: 1, 2
Li, W., Li, D., Dong, K., Zhang, C., & Zhang, H. (2025). Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger. Annual Meeting of the Association for Computational Linguistics (ACL).
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Thirty-Seventh Conference on Neural Information Processing Systems. https://arxiv.org/abs/2302.04761
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2210.03629

PaaT: Probe as a Tool

Authors

Affiliations

Published

Paper

PDF

TL;DR

Agents are internally blind

Proprioception has two pathways

The injection-tool spectrum

How PaaT works

The sign flip

A capability ceiling, not a probe failure

End to end on AgentHarm

Activation probes earn their place

What this means for deployment

Conclusion