arxiv: 2605.06890 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Hariom Tatsat , Ariye Shater

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:12 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords mechanistic interpretabilitysparse autoencodersAI agentstool callinglinear probesmodel activationsagent observability

0 comments

The pith

A toolkit of sparse autoencoders and linear probes can identify the internal features that drive tool-use decisions inside AI agents before they act.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a way to inspect what an AI model signals internally right before it decides to call a tool or skip one. It decomposes the model's activations into sparse features, trains simple probes to read from those features whether a tool is needed and how much it matters, then checks the features' importance by turning them off. This internal reading targets the problem that external prompts, output scores, and post-action logs arrive too late to stop early mistakes from derailing long sequences of agent actions. A reader would care because agent failures in enterprise settings often stem from one wrong tool call early in a chain, and catching the signal inside the model offers a chance to diagnose or steer before the error compounds.

Core claim

By training sparse autoencoders on activations from multi-step tool-use trajectories and fitting linear probes to predict tool necessity and consequence, the method locates specific layers and sparse features tied to tool decisions; ablating those features changes the model's subsequent tool behavior, showing they carry functional information about when and whether to act.

What carries the argument

Sparse Autoencoders that break down model activations into a small set of active features, used together with linear probes that read those features to forecast tool calls and their stakes.

If this is right

Agents can be monitored for tool decisions at the activation level before any output is generated.
Ablation experiments can rank which internal features actually control whether a tool is invoked.
The same trained probes and features transfer across different model families when the underlying SAE workflow is repeated.
Early detection of high-stakes tool actions becomes possible inside long-horizon runs where one mistake alters later steps.
Internal observability supplements external evaluation by surfacing why a tool failure occurred rather than only scoring its outcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time systems could read the same probes during deployment to pause or reroute an agent when internal signals indicate an unnecessary or risky tool call.
The approach could be extended to other agent behaviors such as planning steps or memory updates by training new probes on the same SAE features.
In safety reviews, the identified features might serve as targets for editing or constraining models to reduce unintended tool use.
Similar decomposition techniques might apply to non-tool agent actions, revealing whether the same layers handle different kinds of decisions.

Load-bearing premise

The sparse features recovered by the autoencoders and the outputs of the linear probes reflect genuine causal drivers of tool-use choices inside the model rather than incidental patterns in the training data.

What would settle it

If ablating the top-ranked SAE features leaves the model's tool-calling rate and error patterns unchanged on new trajectories, or if the probes lose predictive accuracy when applied to models or tasks outside the training distribution.

Figures

Figures reproduced from arXiv: 2605.06890 by Ariye Shater, Hariom Tatsat.

**Figure 2.** Figure 2: Tool-Need Probe (Probe 1) on the multi-ticker fundamentals trajectory. The signal rises on [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Tool-Need Probe (Probe 1) on the Bitcoin DCA trajectory. The tool-needed signal rises on [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Tool-Risk Probe (Probe 2) on the Bitcoin DCA trajectory. Risk probabilities remain [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are mostly external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs) and linear probes. The framework reads model states before each action and infers both whether a tool is needed and how consequential the next tool action is likely to be. By decomposing activations into sparse features, it identifies the internal layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can reshape the rest of the agentic interaction. More broadly, the paper shows how mechanistic interpretability can support practical internal observability for monitoring tool calls and risk in agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a toolkit using SAEs and linear probes to monitor agent tool decisions internally, but supplies no numbers or controls to show it actually works.

read the letter

The main contribution is a workflow that decomposes activations with sparse autoencoders, trains probes to predict tool need and consequence before each action, and applies the same setup across Nemotron trajectories plus two other models. It targets a genuine pain point in long-horizon agents where external logs arrive too late to catch early mistakes that cascade into higher cost or risk. Framing the goal as adding internal visibility rather than replacing evaluations keeps the claim modest and practical.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes a mechanistic interpretability toolkit for AI agents that employs Sparse Autoencoders (SAEs) to decompose model activations into sparse features and linear probes to infer tool-use needs and consequence before each action. It identifies layers and features associated with tool decisions from multi-step trajectories in the NVIDIA Nemotron function-calling dataset, tests functional importance via feature ablation, and extends the workflow to GPT-OSS 20B and Gemma 3 27B models. The stated goal is to supply internal pre-action visibility into agent failures that external prompts, evaluations, and logs cannot diagnose, particularly in long-horizon settings.

Significance. If the empirical claims hold with proper validation, the work would meaningfully advance practical interpretability for agentic systems by demonstrating how SAEs and probes can surface internal signals tied to tool selection. This addresses a genuine gap between external observability and mechanistic understanding, with potential value for safety monitoring in enterprise workflows where early tool errors cascade.

major comments (3)

Abstract: The abstract describes the toolkit and training workflow but supplies no quantitative results, ablation outcomes, or validation metrics, so the data cannot be checked against the stated claims about identifying layers and features most associated with tool decisions.
Feature ablation procedure: The claim that ablation confirms functional importance of the extracted SAE features requires (a) intervention during the full forward pass at the correct layer, (b) comparison against ablation of matched random or non-tool features, and (c) verification that SAE reconstruction error itself does not alter downstream behavior. None of these controls are described, leaving open the possibility that observed output changes reflect spurious correlations rather than causal mechanistic roles.
Cross-model evaluation: Application of the same workflow to GPT-OSS 20B and Gemma 3 27B is reported without any feature alignment metrics, probe transfer performance, or cross-model consistency checks. This omission prevents assessment of whether the identified features generalize or remain model-specific artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important areas for improving clarity, rigor, and completeness. We address each major comment point by point below, with commitments to revisions that strengthen the empirical claims without misrepresenting the current work.

read point-by-point responses

Referee: Abstract: The abstract describes the toolkit and training workflow but supplies no quantitative results, ablation outcomes, or validation metrics, so the data cannot be checked against the stated claims about identifying layers and features most associated with tool decisions.

Authors: We agree that the abstract would be strengthened by including key quantitative results to allow immediate evaluation of the claims. In the revised version, we will incorporate specific metrics such as probe accuracy for tool-need inference on the Nemotron dataset, the percentage reduction in tool-selection errors from feature ablation, and the primary layers identified as most associated with tool decisions. These additions will be concise and will not exceed abstract length limits while enabling readers to assess the findings directly. revision: yes
Referee: Feature ablation procedure: The claim that ablation confirms functional importance of the extracted SAE features requires (a) intervention during the full forward pass at the correct layer, (b) comparison against ablation of matched random or non-tool features, and (c) verification that SAE reconstruction error itself does not alter downstream behavior. None of these controls are described, leaving open the possibility that observed output changes reflect spurious correlations rather than causal mechanistic roles.

Authors: The referee correctly notes that stronger causal evidence requires these specific controls. Our current description of feature ablation demonstrates output changes but does not detail the full set of baselines and checks. We will revise the methods and results sections to explicitly describe: (a) interventions performed at the identified layer during the complete forward pass, (b) direct comparisons against ablation of randomly selected and non-tool-related features as controls, and (c) verification that SAE reconstruction error alone does not drive behavioral changes (via comparison to full reconstruction baselines). These additions will rule out spurious correlations and support the functional importance claims. revision: yes
Referee: Cross-model evaluation: Application of the same workflow to GPT-OSS 20B and Gemma 3 27B is reported without any feature alignment metrics, probe transfer performance, or cross-model consistency checks. This omission prevents assessment of whether the identified features generalize or remain model-specific artifacts.

Authors: We acknowledge that the cross-model results are presented at a workflow level without quantitative generalization metrics. To address this, we will expand the relevant results subsection with feature alignment metrics (such as cosine similarity between top SAE features across models), probe transfer performance when applying probes trained on Nemotron trajectories to the other models, and consistency checks on layer and feature identification. These additions will clarify the degree of generalization versus model-specific effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical mechanistic interpretability pipeline that trains SAEs and linear probes on an external dataset (NVIDIA Nemotron trajectories) and applies feature ablation to test associations with tool-use decisions. No equations, derivations, or self-referential definitions appear in the abstract or described workflow. The central claims rest on standard SAE reconstruction and probe training followed by intervention tests, without reducing any 'prediction' to a fitted parameter by construction or relying on load-bearing self-citations whose content is unverified. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of mechanistic interpretability that SAEs yield interpretable features and that linear probes can extract decision-relevant signals.

pith-pipeline@v0.9.0 · 5583 in / 1066 out tokens · 42775 ms · 2026-05-11T01:12:09.997281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Alain, G., & Bengio, Y . (2017). Understanding Intermediate Layers Using Linear Classifier Probes.arXiv preprint arXiv:1610.01644. https://arxiv.org/abs/1610.01644 Bricken, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-featur...

work page Pith review arXiv 2017
[2]

Toolformer: Language Models Can Teach Themselves to Use Tools

https://arxiv.org/abs/2302.04761 Tatsat, H., & Shater, A. (2025). Beyond the Black Box: Interpretability of LLMs in Finance.arXiv preprint arXiv:2505.24650. https://arxiv.org/abs/2505.24650 Wang, J., et al. (2025). HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios.arXiv preprint arXiv:2412.16516. https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025