pith. sign in

arxiv: 2605.30686 · v1 · pith:C3N3ATFAnew · submitted 2026-05-29 · 💻 cs.CR · cs.AI· cs.LG

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

Pith reviewed 2026-06-28 22:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords indirect prompt injectionReAct agentstool callinginjection depthattack success rateprompt sanitizationlarge language models
0
0 comments X

The pith

Injection depth is the dominant factor determining indirect prompt injection success in tool-calling ReAct agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three variables that could affect indirect prompt injection in ReAct agents: the position of the malicious payload among tool responses, the rhetorical style of the payload, and the number of turns the agent may take. Controlled trials across 20 scenarios and two models show that success rates decline sharply as the payload appears later in the sequence for GPT-4o-mini and remain zero at all depths for Claude Haiku. Framing produces a wide but non-significant swing in success at the first position, while permitted turn count has no measurable effect. The results point to early tool-response sanitization as an efficient way to address most observed risk.

Core claim

Injection depth determines attack success rate in tool-calling ReAct agents: for GPT-4o-mini, success drops from 60 percent at depth 1 to zero at depths 4 and 5, while Claude Haiku shows zero success at all depths due to conservative tool use and instruction resistance. Framing changes success from 25 to 75 percent at depth 1 without statistical significance, and turn caps of 3, 5, or 7 leave success unchanged. Sanitizing only the first tool observation would capture 67 percent of the injection successes measured across the studies.

What carries the argument

Injection depth within the sequence of tool observations returned to the agent, which governs whether the model encounters the payload before task completion.

If this is right

  • Sanitizing the first tool response would address two-thirds of the measured injection successes.
  • The number of turns permitted to the agent does not change injection success rate.
  • Payload framing can produce large swings in success at shallow depths even if those swings lack statistical significance at the sample size used.
  • Certain models exhibit complete resistance to the tested injections through a combination of limited tool calls and refusal to follow injected instructions.
  • The depth effect combines model resistance at early positions with agents finishing their tasks before reaching later payloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ReAct agents that complete tasks with fewer tool calls would encounter fewer opportunities for depth-dependent injection.
  • Testing additional models could determine whether the complete resistance observed in one model appears more widely.
  • Layering first-response sanitization with model selection could reduce risk without requiring changes to every tool output.
  • Deployments where tool responses vary in length might shift the depth at which success reliably drops to zero.

Load-bearing premise

The twenty scenarios spanning five attack categories represent real-world tool-calling ReAct deployments and the observed patterns generalize beyond the two tested models.

What would settle it

Measuring attack success rates at depths 4 or 5 that are comparable to or higher than those at depth 1 across a broader set of scenarios or models would show that depth is not dominant.

Figures

Figures reproduced from arXiv: 2605.30686 by Mohammadreza Rashidi.

Figure 1
Figure 1. Figure 1: An LLM-integrated ReAct agent under normal execution (top, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Four-study design matrix. Each study varies exactly one independent variable while holding all others constant; all share the same LangGraph ReAct [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experiment architecture shared across all four studies. The ReAct agent (GPT-4o-mini or Claude Haiku, temperature 0.0) processes up to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Study 1 results (GPT-4o-mini, N = 20 per depth, cap= 5 turns). (A) ASR by injection depth with 95% Wilson CI; the bracket marks the only significant adjacent-depth transition (p = 0.023). (B) Outcome decomposition: blue = injection followed (target action executed); amber = injection encountered but resisted; gray = injection not reached (agent terminated before turn d). Study 1: Pairwise Fisher Exact Test… view at source ↗
Figure 5
Figure 5. Figure 5: Study 1: Pairwise Fisher odds ratios (log scale) with 95% Woolf CI. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Study 1: ASR heatmap by category and injection depth. Darker cells [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Study 2: ASR by depth, GPT-4o-mini versus Claude Haiku. Error [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Study 2: Miss mechanism by depth for Claude Haiku. A large [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Study 3: ASR by framing style. Error bars show 95% Wilson CI. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: shows a heatmap of framing effects per category. Table IX lists the raw counts. TABLE IX STUDY 3: SUCCESSES BY FRAMING AND CATEGORY (N=4 PER CELL) CAL EMA FIL PRI DEL AUTHORITY 3 3 2 2 2 NEUTRAL 0 1 1 1 2 HELPFUL 2 1 3 2 3 PERSONA 3 3 4 2 3 AUTHORITY NEUTRAL HELPFUL PERSONA CAL EMA FIL PRI DEL 75% 0% 50% 75% 75% 25% 25% 75% 50% 25% 75% 100% 50% 25% 50% 50% 50% 50% 75% 75% ASR by Category and Framing (Dept… view at source ↗
Figure 11
Figure 11. Figure 11: Study 4: ASR by depth and turn cap. All three cap conditions produce [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Study 4: Miss mechanism for cap=3 (left) and cap=7 (right). [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Resistance mechanism decomposition. Principled rejection (left) occurs when safety training causes the agent to explicitly decline the injected [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Semantic distance as a hidden confound. FILE [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Framing taxonomy for indirect prompt injection payloads. The four tested conditions (top row) map to rhetorical categories identified by Liu et al. [4], [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Three relaxations of the passive fixed-depth attacker model produce computable upper bounds on attack success rate. R1 (oracle framing) achieves [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed injection position under fixed conditions, leaving three risk dimensions unexplored: where in the tool sequence the payload appears (injection depth), what rhetorical register it uses (framing), and how many turns the agent is permitted (turn cap). We conduct four controlled studies on 20 scenarios spanning five attack categories, totalling 460 trials against GPT-4o-mini and Claude Haiku at a combined API cost under 0.36 USD. Study 1 shows that ASR against GPT-4o-mini decays from 60% at depth 1 to 0% at depths 4 and 5 (Cramer's V = 0.58, p < 0.001; restricted to within-sequence depths 1-3: V = 0.47, p = 0.0013), driven by model resistance at depth 1 and task completion before payload encounter at deeper positions. Study 2 replicates the depth experiment on Claude Haiku, which achieves 0% ASR at every depth through a combination of conservative tool invocation and genuine instruction resistance. Study 3 shows that framing modulates ASR between 25% (neutral) and 75% (persona) at depth 1, a 50-percentage-point range that does not reach statistical significance at N = 20 per condition. Study 4 confirms that ASR is stable across turn caps of 3, 5, and 7, indicating the turn budget is not a risk factor in this setting. Our results establish injection depth as the dominant variable and show that sanitising only the first tool observation captures 67% of measured injection successes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports four controlled empirical studies (460 trials total) on indirect prompt injection in ReAct tool-calling agents using 20 scenarios across five attack categories. It tests GPT-4o-mini and Claude Haiku and finds that attack success rate (ASR) decays sharply with injection depth for GPT-4o-mini (60% at depth 1 to 0% at depths 4-5; Cramer's V=0.58, p<0.001), that Claude Haiku shows 0% ASR at all depths, that payload framing produces a 50-point ASR range (25-75%) that is non-significant at N=20 per condition, and that turn budget (3/5/7) has no effect. The central claims are that depth is the dominant risk factor and that sanitizing only the first tool observation captures 67% of observed injection successes.

Significance. If the depth-dominance pattern and first-observation sanitization efficacy generalize, the work supplies actionable empirical guidance for hardening ReAct agents against indirect prompt injection at modest experimental cost. The use of 460 trials, explicit statistical tests (Cramer's V, p-values), and within-sequence depth controls are methodological strengths that allow direct comparison of risk dimensions.

major comments (3)
  1. [Abstract, Study 1, Study 2] Abstract and Studies 1-2: The claim that 'injection depth [is] the dominant variable' and that first-observation sanitization captures 67% of successes is derived entirely from GPT-4o-mini data; Claude Haiku contributes zero successes at every depth. Because the decay pattern (60% at depth 1, 0% at 4-5) and the 67% figure therefore reflect only one of the two tested models, the dominance conclusion does not yet rest on evidence that the effect persists across model families.
  2. [Abstract, Study 1] Abstract and Study 1: The 20 scenarios spanning five attack categories are treated as sufficient to establish depth as dominant, yet no analysis or sampling justification is supplied showing that the distribution of tool-observation lengths, framing styles, or task-completion timings in these scenarios matches production ReAct deployments. Without such evidence the generalization from the observed decay to 'dominant variable' status remains untested.
  3. [Study 3] Study 3: The reported 50-percentage-point framing range (neutral 25% to persona 75%) at depth 1 is presented as a risk dimension, but the text itself states the difference does not reach significance at N=20 per condition. This non-significance should be reflected in the relative weighting of framing versus depth in the overall conclusions.
minor comments (2)
  1. [Abstract] Abstract reports point estimates for ASR without error bars, confidence intervals, or per-condition sample sizes, reducing interpretability of the 60%, 25%, and 75% figures.
  2. Full prompt templates, tool-return formatting, and exact scenario descriptions are not provided, limiting reproducibility of the 460 trials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback identifying areas where our claims require qualification. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Abstract, Study 1, Study 2] Abstract and Studies 1-2: The claim that 'injection depth [is] the dominant variable' and that first-observation sanitization captures 67% of successes is derived entirely from GPT-4o-mini data; Claude Haiku contributes zero successes at every depth. Because the decay pattern (60% at depth 1, 0% at 4-5) and the 67% figure therefore reflect only one of the two tested models, the dominance conclusion does not yet rest on evidence that the effect persists across model families.

    Authors: The referee is correct that both the depth-decay pattern and the 67% first-observation figure derive exclusively from the GPT-4o-mini experiments. Claude Haiku produced zero injection successes at all depths. We will revise the abstract, the Study 1 and Study 2 sections, and the concluding paragraph to state explicitly that the observed dominance of injection depth and the associated sanitization efficacy apply to GPT-4o-mini, while Claude Haiku exhibits resistance independent of depth. This qualification removes the implication of cross-model generality. revision: yes

  2. Referee: [Abstract, Study 1] Abstract and Study 1: The 20 scenarios spanning five attack categories are treated as sufficient to establish depth as dominant, yet no analysis or sampling justification is supplied showing that the distribution of tool-observation lengths, framing styles, or task-completion timings in these scenarios matches production ReAct deployments. Without such evidence the generalization from the observed decay to 'dominant variable' status remains untested.

    Authors: We agree that the manuscript supplies no sampling justification or distributional analysis demonstrating that the 20 scenarios are representative of production ReAct deployments. The scenarios were constructed to cover five attack categories under controlled conditions. We will revise the abstract and Study 1 to present the depth findings as specific to the tested scenario set rather than asserting that depth is the dominant variable in general. A full matching analysis against production logs lies outside the scope of the present work. revision: partial

  3. Referee: [Study 3] Study 3: The reported 50-percentage-point framing range (neutral 25% to persona 75%) at depth 1 is presented as a risk dimension, but the text itself states the difference does not reach significance at N=20 per condition. This non-significance should be reflected in the relative weighting of framing versus depth in the overall conclusions.

    Authors: The manuscript already notes the non-significance of the framing effect at N=20. We nevertheless accept that presenting framing as a distinct risk dimension without stronger qualification may over-weight it relative to depth. We will revise Study 3 and the conclusions to emphasize that the observed framing range is suggestive but statistically non-significant, thereby subordinating it to the depth results obtained with GPT-4o-mini. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of experimental outcomes

full rationale

The paper reports results from 460 controlled trials across 20 scenarios and two models, with all claims (depth as dominant variable, 67% capture by first-observation sanitization, framing effects, turn-cap stability) derived directly from measured ASR percentages, statistical tests (Cramer's V, p-values), and observed model behaviors. No equations, fitted parameters, derivations, or self-citations are used to generate or reduce any result; the studies are self-contained against the stated experimental conditions without reducing outputs to prior quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions of empirical security research: that the selected scenarios and models are appropriate proxies for real deployments and that conventional statistical tests apply to the categorical success data.

axioms (1)
  • standard math Standard statistical measures such as Cramer's V and p-values are appropriate for assessing associations in categorical attack-success data.
    Invoked in Study 1 for depth effects and Study 3 for framing effects.

pith-pipeline@v0.9.1-grok · 5918 in / 1281 out tokens · 31443 ms · 2026-06-28T22:28:30.182915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

  2. [2]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,” arXiv preprint, vol. arXiv:2302.12173, 2023. [Online]. Available: https://arxiv.org/abs/2302.12173

  3. [3]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” inML Safety Workshop, NeurIPS 2022, 2022. [Online]. Available: https://arxiv.org/abs/2211.09527

  4. [4]

    Prompt Injection attack against LLM-integrated Applications

    Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, and Y . Liu, “Prompt injection attack against LLM-integrated applications,”arXiv preprint, vol. arXiv:2306.05499, 2023. [Online]. Available: https://arxiv.org/abs/2306.05499

  5. [5]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024. [Online]. Available: https://arxiv.org/abs/2403.02691

  6. [6]

    LangGraph: Build resilient language agents as graphs,

    LangChain Inc., “LangGraph: Build resilient language agents as graphs,”

  7. [7]

    Available: https://github.com/langchain-ai/langgraph

    [Online]. Available: https://github.com/langchain-ai/langgraph

  8. [8]

    Struq: Defending against prompt injection with structured queries, 2024

    S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “Struq: Defending against prompt injection with structured queries,”arXiv preprint, vol. arXiv:2402.06363, 2024. [Online]. Available: https://arxiv.org/abs/2402. 06363

  9. [9]

    MELON: Provable defense against indirect prompt injection attacks in AI agents,

    K. Zhu, X. Yang, J. Wang, W. Guo, and W. Y . Wang, “MELON: Provable defense against indirect prompt injection attacks in AI agents,” inInternational Conference on Machine Learning, 2025. [Online]. Available: https://arxiv.org/abs/2502.05174

  10. [11]
  11. [12]

    ICON: Indirect prompt injection defense for agents based on inference-time correction,

    C. Wang, F. Zhang, J. Zhang, Z. Zhang, Y . Wang, L. Huang, J. Gao, Z. Chen, and W. Y . B. Lim, “ICON: Indirect prompt injection defense for agents based on inference-time correction,” arXiv preprint, vol. arXiv:2602.20708, 2025. [Online]. Available: https://arxiv.org/abs/2602.20708

  12. [14]

    SafeAgent: A Runtime Protection Architecture for Agentic Systems

    [Online]. Available: https://arxiv.org/abs/2604.17562

  13. [16]

    Available: https://arxiv.org/abs/2601.04795

    [Online]. Available: https://arxiv.org/abs/2601.04795

  14. [17]

    GPT-4o mini: Advancing cost-efficient intelli- gence,

    OpenAI, “GPT-4o mini: Advancing cost-efficient intelli- gence,” 2024. [Online]. Available: https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/

  15. [18]

    Claude haiku 4.5 (claude-haiku-4-5-20251001),

    Anthropic, “Claude haiku 4.5 (claude-haiku-4-5-20251001),”

  16. [19]

    Available: https://www.anthropic.com/claude/haiku

    [Online]. Available: https://www.anthropic.com/claude/haiku

  17. [20]

    Probable inference, the law of succession, and statistical inference,

    E. B. Wilson, “Probable inference, the law of succession, and statistical inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927