pith. machine review for the scientific record. sign in

arxiv: 2604.10134 · v1 · submitted 2026-04-11 · 💻 cs.CR

Recognition: unknown

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:14 UTC · model grok-4.3

classification 💻 cs.CR
keywords indirect prompt injectionLLM agentsdefense frameworkplanning-based verificationcontext isolationtool use securityattack success rate
0
0 comments X

The pith

PlanGuard stops indirect prompt injection attacks on LLM agents by checking actions against a user-only plan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents that call external tools can be hijacked when attackers hide instructions inside the content the agent retrieves. Most existing defenses filter inputs before the model processes them, but they leave the agent's actual behavior unmonitored. PlanGuard adds an isolated planner that builds a reference list of allowed actions using only the original user instructions, then applies layered checks during execution. Hard constraints block any unauthorized tool calls while an intent verifier decides whether parameter differences are innocent formatting or malicious changes. On the InjecAgent benchmark this reduces successful attacks from 72.8 percent to zero while producing only 1.49 percent false positives, and the method works without retraining any model.

Core claim

PlanGuard is a training-free defense framework built on context isolation. An isolated Planner generates a reference set of valid actions derived solely from the user's instructions. A Hierarchical Verification Mechanism first applies strict hard constraints to prevent unauthorized tool invocations and then uses an Intent Verifier to determine whether any observed parameter deviations represent benign formatting variances or malicious hijacking attempts.

What carries the argument

Isolated Planner that produces a reference set of valid actions from user instructions alone, paired with Hierarchical Verification Mechanism for runtime consistency checks.

If this is right

  • Agents can safely use external tools and process retrieved content without successful hijacking.
  • No model training or fine-tuning is required for the defense to function.
  • Attack success rate falls to zero on the InjecAgent benchmark with a false positive rate of 1.49 percent.
  • The approach remains effective across different underlying language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Runtime behavior verification can serve as a useful second layer alongside input filtering for agent security.
  • The same planning-and-check structure might address other forms of context poisoning in autonomous systems.
  • Integrating dynamic replanning when user goals are complex could further reduce false blocks.

Load-bearing premise

The planner can generate a complete and accurate reference set of valid actions from user instructions alone, and the verifier can reliably separate benign parameter formatting from malicious hijacking.

What would settle it

An attack that causes the agent to invoke an unauthorized tool or alter parameters in a way the hierarchical checks accept as valid, or a legitimate user request that the planner misses and therefore blocks.

Figures

Figures reproduced from arXiv: 2604.10134 by Guangyu Gong, Zizhuang Deng.

Figure 1
Figure 1. Figure 1: Overview of the PlanGuard architecture. The framework decouples the instruction processing into two paths: an Isolated Planner for generating a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison between Vanilla Agent and PlanGuard on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large Language Model (LLM) agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre-processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training-free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations, and subsequently employs an Intent Verifier to validate whether parameter deviations are benign formatting variances or malicious hijacking. Experiments on the InjecAgent benchmark demonstrate that PlanGuard effectively neutralizes these attacks, reducing the Attack Success Rate (ASR) from 72.8% to 0%, while maintaining an acceptable False Positive Rate of 1.49%. Furthermore, our method is model-agnostic and highly compatible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PlanGuard, a training-free defense framework for LLM agents against Indirect Prompt Injection (IPI). It relies on Context Isolation via an isolated Planner that generates a reference set of valid actions solely from user instructions, combined with a Hierarchical Verification Mechanism that applies hard constraints to block unauthorized tool calls and then uses an Intent Verifier to classify parameter deviations as benign or malicious. Experiments on the InjecAgent benchmark are reported to reduce Attack Success Rate (ASR) from 72.8% to 0% while keeping False Positive Rate (FPR) at 1.49%, with the method claimed to be model-agnostic and compatible with existing agents.

Significance. If the results hold under broader conditions, the work would be significant for agent security by introducing runtime consistency verification rather than relying solely on input pre-processing. The training-free design and explicit separation of planning from execution are practical strengths that could enable deployment without retraining. The reported perfect ASR reduction on InjecAgent, if reproducible, would represent a strong empirical outcome for this threat model.

major comments (3)
  1. [Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.
  2. [Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.
  3. [Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.
minor comments (2)
  1. [Abstract] The abstract and method descriptions would benefit from explicit pseudocode or a diagram clarifying the data flow between the Isolated Planner, hard constraints, and Intent Verifier.
  2. [Introduction] Notation for 'valid actions' and 'parameter deviations' is introduced without a formal definition or example in the early sections, which could improve clarity for readers unfamiliar with agent tool-calling formats.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and proposing revisions to enhance the paper's rigor and completeness.

read point-by-point responses
  1. Referee: [Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.

    Authors: We agree that an analysis of the Isolated Planner's completeness would provide valuable context for the reported results. Although the InjecAgent benchmark consists of tasks where user instructions are sufficiently clear to allow complete planning, we will revise the manuscript to include quantitative metrics on planner success rate, coverage for multi-step instructions, and discussion of potential failure modes. This addition will better substantiate the conditions under which the 0% ASR is achieved. revision: yes

  2. Referee: [Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.

    Authors: The referee raises a valid point regarding the lack of targeted evaluation for the Intent Verifier. The hard constraints form the primary defense against unauthorized tools, while the verifier handles nuanced parameter cases. To strengthen this, we will add ablation studies and targeted experiments in the revised version that evaluate the Intent Verifier's performance against disguised and syntactically similar attacks, including cases designed to mimic benign formatting. These experiments will help confirm the robustness of the 0% ASR claim. revision: yes

  3. Referee: [Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.

    Authors: We acknowledge the absence of explicit ablations in the current manuscript. The hierarchical design ensures that hard constraints block most attacks, with the verifier as a secondary check, which contributes to the low FPR. However, to address the concern about benchmark-specific artifacts, we will incorporate ablation studies isolating the hard constraints and Intent Verifier, as well as tests for edge cases like planner isolation failures and mimicking parameter deviations. This will demonstrate the method's effectiveness more comprehensively. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmark is self-contained

full rationale

The paper presents a training-free defense framework (isolated Planner generating reference actions from user instructions alone, followed by hard-constraint then Intent-Verifier stages) whose central performance claims are direct experimental measurements on the external InjecAgent benchmark (ASR reduced from 72.8% to 0%, FPR 1.49%). No equations, fitted parameters, self-citation load-bearing premises, or ansatzes appear in the provided text that would make any result equivalent to its inputs by construction. The framework is described as model-agnostic and compatible with prior methods rather than derived from them. This is the normal case of an applied system paper whose validity rests on external falsifiable testing rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that user instructions alone suffice to define all legitimate actions and that deviations can be classified without access to retrieved content.

axioms (1)
  • domain assumption User instructions alone contain enough information to enumerate a complete reference set of valid actions.
    Invoked by the Context Isolation principle and the design of the isolated Planner.
invented entities (2)
  • Isolated Planner no independent evidence
    purpose: Generate reference valid actions exclusively from user instructions.
    New architectural component introduced to enforce context isolation.
  • Intent Verifier no independent evidence
    purpose: Distinguish benign parameter deviations from malicious hijacking.
    New verification submodule within the Hierarchical Verification Mechanism.

pith-pipeline@v0.9.0 · 5496 in / 1211 out tokens · 38450 ms · 2026-05-10T16:14:04.698455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

Reference graph

Works this paper leans on

21 extracted references · 11 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  2. [2]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 68 539–68 551, 2023

  3. [3]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

  4. [4]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

  5. [5]

    Llm-planner: Few-shot grounded planning for embodied agents with large language models,

    C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2998–3009

  6. [6]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12173

  7. [8]

    Ignore previous prompt: Attack techniques for language models,

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” 2022. [Online]. Available: https://arxiv.org/abs/2211. 09527

  8. [9]

    Prompt Injection attack against LLM-integrated Applications

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, L. Y . Zhang, and Y . Liu, “Prompt injection attack against llm-integrated applications,” 2025. [Online]. Available: https://arxiv.org/abs/2306.05499

  9. [10]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents,” 2024. [Online]. Available: https://arxiv.org/abs/2403.02691

  10. [11]

    Fine-tuned deberta-v3 for prompt injection detection,

    ProtectAI.com, “Fine-tuned deberta-v3 for prompt injection detection,” 2023. [Online]. Available: https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection

  11. [12]

    Detecting Language Model Attacks with Perplexity

    G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,”arXiv preprint arXiv:2308.14132, 2023

  12. [13]

    Secalign: Defending against prompt injection with preference optimization,

    S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 2833–2847

  13. [14]

    {StruQ}: Defending against prompt injection with structured queries,

    S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “{StruQ}: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2383–2400

  14. [15]

    Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,

    T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,” inProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 2023, pp. 431–445

  15. [16]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    H. Wang, C. M. Poskitt, and J. Sun, “Agentspec: Customizable run- time enforcement for safe and reliable llm agents,”arXiv preprint arXiv:2503.18666, 2025

  16. [17]

    Mitigating indirect prompt injection via instruction-following intent analysis,

    M. Kang, C. Xiang, S. Kariyappa, C. Xiao, B. Li, and E. Suh, “Mitigating indirect prompt injection via instruction-following intent analysis,”arXiv preprint arXiv:2512.00966, 2025

  17. [18]

    Jailbroken: How does llm safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 80 079–80 110, 2023

  18. [19]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023

  19. [20]

    Optimization-based prompt injection attack to llm-as-a-judge,

    J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2403.17710

  20. [21]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freireet al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023

  21. [22]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024