pith. sign in

arxiv: 2605.25310 · v1 · pith:3EBDVL6Unew · submitted 2026-05-25 · 💻 cs.CL

Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

Pith reviewed 2026-06-29 23:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentstool callingresidual streamdependency graphstructural probingactivation patching
0
0 comments X

The pith

The dependency structure among an LLM agent's tool calls is linearly decodable from its residual stream activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simple linear probe can extract the directed graph of tool-call dependencies from internal model activations while the agent is running. This decoding works above random-label and position-only controls and survives tests that swap specific values while keeping the dependency pattern intact. The same signal appears in several multi-hop tool-use settings but fades when call order by itself predicts the graph. Activation patching experiments indicate the information is actively carried forward through layers rather than merely copied from the input.

Core claim

A low-capacity edge probe on the residual stream of Qwen3-32B recovers the tool-call dependency graph at accuracy well above random-label and positional baselines. Counterfactual experiments that corrupt values while preserving structure versus perturbing structure show the probe tracks abstract topology. The non-positional signal appears in three other interactive benchmarks and disappears in single-shot planning where order suffices. Activation patching shifts the probe location, indicating the representation is propagated forward.

What carries the argument

low-capacity edge probe on residual stream activations that classifies directed dependencies between tool calls

Load-bearing premise

The counterfactual contrast between value corruption and structural perturbation, together with the non-substring oracle, fully isolates abstract topology from all other correlated features in the residual stream.

What would settle it

A dataset in which the probe still succeeds after structural edges are randomly rewired while value tokens remain unchanged, or fails after value tokens are changed while edges stay fixed.

Figures

Figures reproduced from arXiv: 2605.25310 by Dimitar Kazakov, Tianda Sun.

Figure 1
Figure 1. Figure 1: The tool-call dependency graph is linearly decodable from an LLM agent’s residual stream. Top: tool-using LLM agents call functions sequentially, and earlier outputs supply arguments to later calls, in￾ducing a latent dependency DAG (an edge i→j iff call i’s output supplies an argument of call j); a low-capacity logistic probe reads each edge from the frozen residual stream. Bottom: the edge becomes linear… view at source ↗
Figure 2
Figure 2. Figure 2: The probe recovers the transitive closure of the dependency DAG — multi-hop provenance edges that are never adjacent in the trajectory. A representative trajectory: solid blue = direct oracle edges (call i’s output feeds call i+1); dashed green = transitive-only edges (i ❀ k with no direct i→k) that the residual-stream probe linearly recovers at AUROC 0.986 — higher than the 0.869 direct-edge AUROC (§4.2.1… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-hop (transitive) provenance is de￾coded more strongly than direct edges, and the two co-emerge in the early-mid stack. Per-trajectory LOGO AUROC of the direct-edge probe (blue) and the transitive-only probe (green) across seven sampled lay￾ers of Qwen3-32B: both saturate by layer 14 (∼22% depth); the transitive signal rises ∼2× faster over L0– L14 (+0.15 vs +0.07 AUROC) and stays ∼+0.12 AU￾ROC above … view at source ↗
Figure 4
Figure 4. Figure 4: The non-positional contribution falls monotonically as call order becomes more position￾predictable. ∆ resid−pos (transitive task) vs. each benchmark’s position-only baseline; bars are 95% paired-bootstrap CIs. Spearman ρ= − 0.80 over n=4 is descriptive only (p ≈ 0.33); the dashed line is a visual guide, not a fitted law. man ρ= − 0.80, n=4), a descriptive tendency only (p≈0.33), not a law ( [PITH_FULL_IM… view at source ↗
Figure 5
Figure 5. Figure 5: Per-trajectory edge-set symmetric differ [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-layer edge-probe AUROC: the depen￾dency signal emerges by layer 14 (∼22% depth) and persists to the final layer, far above the random￾label control. Qwen3-32B (blue) against the Hewitt– Liang random-label control band (±2σ, light red) and the strongest of 500 permutations CTRLmax=0.565 (dashed red); the embedding layer is sign-flipped under LOGO cross-validation. Inset: 500-perm control distri￾bution w… view at source ↗
Figure 8
Figure 8. Figure 8: Activation patching: the dependency repre￾sentation is load-bearing for the downstream read￾out at every layer ≥ 14. Per-layer signed patch effect ∆patch (left, blue) and the fraction of 80 minimal pairs shifting toward the donor’s oracle (right, grey bars); bootstrap CI excludes zero at every layer ≥ 14, rising to 97.5% of pairs at layer 57, and the embedding layer (L0) is exactly zero — a clean negative … view at source ↗
read the original abstract

Tool-using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain-of-thought text, not an agent's run-time call graph. A low-capacity edge probe on the residual stream of Qwen3-32B decodes the tool-call dependency graph well above both a Hewitt--Liang random-label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non-substring oracle. The non-positional component replicates on three further interactive multi-hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single-shot planning. Per-layer activation patching shifts the probe at a later, non-patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent's runtime tool-call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that a low-capacity linear edge probe on the residual stream of Qwen3-32B (and another model family) decodes the directed tool-call dependency graph in LLM agent trajectories above Hewitt-Liang random-label and positional baselines. Counterfactual contrasts (value corruption vs. structural perturbation) plus a non-substring oracle indicate the decoded signal tracks abstract topology rather than identifier values; the non-positional component replicates on three further multi-hop benchmarks, attenuates when call order alone suffices as a proxy, and vanishes in single-shot planning. Per-layer activation patching shifts the probe at a later boundary, suggesting the representation propagates rather than being a passive readout.

Significance. If the controls isolate abstract topology, the result would be the first structural probe of runtime tool-call dependency graphs in agents. Credit is due for the suite of controls (random-label, positional, value corruption, non-substring oracle), replication across benchmarks, and the activation-patching experiment; these elements go beyond simple probe accuracy and address several obvious surface-feature confounds.

major comments (2)
  1. [counterfactual contrast experiments] The section describing the counterfactual contrast between value corruption and structural perturbation: the claim that this isolates abstract topology from all correlated residual features (call-order statistics, argument-length distributions, identifier co-occurrence, execution timing) is load-bearing for the central claim, yet the manuscript provides no ablation demonstrating that the chosen perturbations exhaustively remove those confounds while leaving only topology.
  2. [activation patching results] The section on activation patching: the reported shift at a later, non-patched boundary is presented as evidence of propagation, but without quantitative comparison to a null patching baseline or explicit measurement of whether the realised tool call itself changes, it is unclear whether the patching result supports representation of the dependency graph or merely a downstream readout.
minor comments (3)
  1. [methods] The methods section should report the exact probe architecture, hidden dimension, and training details (including how edges are encoded as binary targets) so that capacity and label construction can be directly assessed.
  2. [replication experiments] Table or figure presenting replication results on the three additional benchmarks should include per-benchmark probe accuracies, control baselines, and effect sizes for direct comparison with the primary Qwen3-32B results.
  3. [abstract and §1] Clarify in the abstract and introduction whether the reported probe accuracies are macro-averaged over edges or micro-averaged, and whether they are computed on held-out trajectories or held-out positions within trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the suite of controls in our work. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [counterfactual contrast experiments] The section describing the counterfactual contrast between value corruption and structural perturbation: the claim that this isolates abstract topology from all correlated residual features (call-order statistics, argument-length distributions, identifier co-occurrence, execution timing) is load-bearing for the central claim, yet the manuscript provides no ablation demonstrating that the chosen perturbations exhaustively remove those confounds while leaving only topology.

    Authors: The counterfactual design contrasts value corruption, which disrupts identifier-specific information while preserving the dependency structure, against structural perturbation, which alters the graph topology while retaining call values and order statistics. This contrast, combined with the non-substring oracle, is intended to isolate topology from the listed confounds. We do not claim the perturbations are exhaustive of every possible residual feature, but the differential effect supports the abstract topology interpretation. We can add further discussion of the perturbation design in a revision to clarify this. revision: partial

  2. Referee: [activation patching results] The section on activation patching: the reported shift at a later, non-patched boundary is presented as evidence of propagation, but without quantitative comparison to a null patching baseline or explicit measurement of whether the realised tool call itself changes, it is unclear whether the patching result supports representation of the dependency graph or merely a downstream readout.

    Authors: The manuscript explicitly notes that the realised tool call does not change under patching, which addresses the concern about behavioral impact. The observed shift in probe accuracy at the later boundary, relative to the patched layer, is presented as evidence of propagation through the network. We agree that a quantitative null baseline (e.g., random activation patching) would strengthen the result and will incorporate such a comparison in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe results rest on held-out evaluation and controls

full rationale

The paper reports an empirical linear probe trained on residual-stream activations to decode tool-call dependency edges. Performance is measured on held-out trajectories against random-label (Hewitt-Liang) and positional baselines, with additional counterfactual contrasts (value corruption vs. structural perturbation) and a non-substring oracle. No equation or claim reduces a reported result to its own fitted parameters by construction; the probe weights are not renamed as a prediction, and no self-citation supplies a uniqueness theorem or ansatz that the present work then treats as external. The central claim therefore remains a statistical decoding result rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the standard mechanistic-interpretability assumption that linear probes can extract meaningful information about internal representations when above strong baselines. No new entities are postulated. The probe weights constitute fitted parameters whose values are not reported.

free parameters (1)
  • linear probe weights
    Weights of the low-capacity edge probe are fitted to residual-stream activations to predict dependency edges.
axioms (1)
  • domain assumption Linear probes on residual streams can recover structural information when performance exceeds random-label and positional baselines
    Invoked throughout the probe experiments and controls described in the abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1428 out tokens · 37584 ms · 2026-06-29T23:08:01.766510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Polar probe linearly decodes semantic structures from LLMs

    Are you still on track!? catching LLM task drift with activations.arXiv preprint. Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Fur- man, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predic- tions from transformers with the tuned lens.arXiv preprint. Pablo J. Diego-Simón and 1 others. 2026. Polar probe linear...

  2. [2]

    InProceedings of NAACL-HLT (Short Papers)

    The geometry of numerical reasoning: Lan- guage models compare numeric properties in linear subspaces. InProceedings of NAACL-HLT (Short Papers). Jiahai Feng, Stuart Russell, and Jacob Steinhardt. 2025. Monitoring latent world states in language models with propositional probes. InInternational Confer- ence on Learning Representations (ICLR). Spotlight. D...

  3. [3]

    InIn- ternational Conference on Learning Representations (ICLR)

    Emergent world representations: Exploring a sequence model trained on a synthetic task. InIn- ternational Conference on Learning Representations (ICLR). Weijiang Li, Yilin Zhu, Rajarshi Das, and Parijat Dube

  4. [4]

    InInternational Conference on Learning Representations (ICLR)

    Do LLMs build spatial world models? evi- dence from grid-world maze tasks. InInternational Conference on Learning Representations (ICLR). Erik Nordby, Tasha Pais, and Aviel Parrack. 2026. Lin- ear probe accuracy scales with model size and bene- fits from multi-layer ensembling.arXiv preprint. Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Char- lie Cheng-Jie ...

  5. [5]

    From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs

    From chains to DAGs: Probing the graph structure of reasoning in LLMs.arXiv preprint arXiv:2601.17593. A Implementation Details Model and inference.Qwen3-32B (Qwen Team,

  6. [6]

    Trajectory generation: do_sample=True, temperature 0.6, top-p 0.95, top-k 20, max_new_tokens=4096, max turns

    (HuggingFace revision prefix 9216db5) served in bf16 on one NVIDIA GH200 120 GB GPU per worker, with a decode-time hook captur- ing the residual stream at each transformer block for every assistant token. Trajectory generation: do_sample=True, temperature 0.6, top-p 0.95, top-k 20, max_new_tokens=4096, max turns

  7. [7]

    yusuf_rossi_9620

    Random seed 42 across Python, NumPy, Py- Torch, and scikit-learn; identical seeds across clean and corrupted runs. The τ-bench commit hash is recorded with the release artefacts. Trajectory selection.We collect 120 retail task instances; after filtering trajectories with <2 tool calls, 105 remain. The pair space comprises 1,129 ordered (i, j) pairs with i...