arxiv: 2604.10134 · v1 · submitted 2026-04-11 · 💻 cs.CR

Recognition: unknown

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

Guangyu Gong , Zizhuang Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:14 UTC · model grok-4.3

classification 💻 cs.CR

keywords indirect prompt injectionLLM agentsdefense frameworkplanning-based verificationcontext isolationtool use securityattack success rate

0 comments

The pith

PlanGuard stops indirect prompt injection attacks on LLM agents by checking actions against a user-only plan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents that call external tools can be hijacked when attackers hide instructions inside the content the agent retrieves. Most existing defenses filter inputs before the model processes them, but they leave the agent's actual behavior unmonitored. PlanGuard adds an isolated planner that builds a reference list of allowed actions using only the original user instructions, then applies layered checks during execution. Hard constraints block any unauthorized tool calls while an intent verifier decides whether parameter differences are innocent formatting or malicious changes. On the InjecAgent benchmark this reduces successful attacks from 72.8 percent to zero while producing only 1.49 percent false positives, and the method works without retraining any model.

Core claim

PlanGuard is a training-free defense framework built on context isolation. An isolated Planner generates a reference set of valid actions derived solely from the user's instructions. A Hierarchical Verification Mechanism first applies strict hard constraints to prevent unauthorized tool invocations and then uses an Intent Verifier to determine whether any observed parameter deviations represent benign formatting variances or malicious hijacking attempts.

What carries the argument

Isolated Planner that produces a reference set of valid actions from user instructions alone, paired with Hierarchical Verification Mechanism for runtime consistency checks.

If this is right

Agents can safely use external tools and process retrieved content without successful hijacking.
No model training or fine-tuning is required for the defense to function.
Attack success rate falls to zero on the InjecAgent benchmark with a false positive rate of 1.49 percent.
The approach remains effective across different underlying language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Runtime behavior verification can serve as a useful second layer alongside input filtering for agent security.
The same planning-and-check structure might address other forms of context poisoning in autonomous systems.
Integrating dynamic replanning when user goals are complex could further reduce false blocks.

Load-bearing premise

The planner can generate a complete and accurate reference set of valid actions from user instructions alone, and the verifier can reliably separate benign parameter formatting from malicious hijacking.

What would settle it

An attack that causes the agent to invoke an unauthorized tool or alter parameters in a way the hierarchical checks accept as valid, or a legitimate user request that the planner misses and therefore blocks.

Figures

Figures reproduced from arXiv: 2604.10134 by Guangyu Gong, Zizhuang Deng.

**Figure 1.** Figure 1: Overview of the PlanGuard architecture. The framework decouples the instruction processing into two paths: an Isolated Planner for generating a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Performance comparison between Vanilla Agent and PlanGuard on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Large Language Model (LLM) agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre-processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training-free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations, and subsequently employs an Intent Verifier to validate whether parameter deviations are benign formatting variances or malicious hijacking. Experiments on the InjecAgent benchmark demonstrate that PlanGuard effectively neutralizes these attacks, reducing the Attack Success Rate (ASR) from 72.8% to 0%, while maintaining an acceptable False Positive Rate of 1.49%. Furthermore, our method is model-agnostic and highly compatible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlanGuard isolates planning from external content and adds a two-stage check to block indirect injections, cutting ASR to zero on one benchmark while leaving the verifier's reliability largely untested.

read the letter

PlanGuard's main move is to keep a planner completely separate so it only sees the original user instructions and builds a reference list of valid actions. Anything the agent tries later gets checked first against hard rules that reject unauthorized tools, then against an LLM intent verifier that decides whether parameter shifts are harmless formatting or a hijack. This is different from the pre-processing defenses that dominate the literature because it watches behavior instead of just cleaning inputs upfront. The reported drop from 72.8% to 0% attack success rate with 1.49% false positives on InjecAgent is the clearest result, and the training-free, model-agnostic setup makes it simple to add to existing agent code. That combination is the part worth paying attention to if you work on tool-using agents. The soft spots sit in the verification layer. The paper gives no ablations on cases where the planner's reference set is incomplete or where the intent verifier has to judge parameter changes that look like normal formatting but are actually malicious. Without those tests the zero ASR could be tied to the specific attack patterns in the benchmark rather than a general property. The abstract also skips implementation details on how the isolated planner handles ambiguous instructions or how the verifier prompt is written, so reproducing the exact numbers would take extra work. This is for engineers and researchers who need a practical starting point for securing LLM agents against retrieved-content attacks. A reader who already runs agents with tools will see a concrete pattern they can prototype quickly. It deserves a serious referee because the problem is real, the method is straightforward, and the initial numbers are strong enough to justify deeper scrutiny. Send it for review and ask specifically for ablations on the verifier and tests against disguised parameter attacks.

Referee Report

3 major / 2 minor

Summary. The paper proposes PlanGuard, a training-free defense framework for LLM agents against Indirect Prompt Injection (IPI). It relies on Context Isolation via an isolated Planner that generates a reference set of valid actions solely from user instructions, combined with a Hierarchical Verification Mechanism that applies hard constraints to block unauthorized tool calls and then uses an Intent Verifier to classify parameter deviations as benign or malicious. Experiments on the InjecAgent benchmark are reported to reduce Attack Success Rate (ASR) from 72.8% to 0% while keeping False Positive Rate (FPR) at 1.49%, with the method claimed to be model-agnostic and compatible with existing agents.

Significance. If the results hold under broader conditions, the work would be significant for agent security by introducing runtime consistency verification rather than relying solely on input pre-processing. The training-free design and explicit separation of planning from execution are practical strengths that could enable deployment without retraining. The reported perfect ASR reduction on InjecAgent, if reproducible, would represent a strong empirical outcome for this threat model.

major comments (3)

[Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.
[Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.
[Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.

minor comments (2)

[Abstract] The abstract and method descriptions would benefit from explicit pseudocode or a diagram clarifying the data flow between the Isolated Planner, hard constraints, and Intent Verifier.
[Introduction] Notation for 'valid actions' and 'parameter deviations' is introduced without a formal definition or example in the early sections, which could improve clarity for readers unfamiliar with agent tool-calling formats.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and proposing revisions to enhance the paper's rigor and completeness.

read point-by-point responses

Referee: [Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.

Authors: We agree that an analysis of the Isolated Planner's completeness would provide valuable context for the reported results. Although the InjecAgent benchmark consists of tasks where user instructions are sufficiently clear to allow complete planning, we will revise the manuscript to include quantitative metrics on planner success rate, coverage for multi-step instructions, and discussion of potential failure modes. This addition will better substantiate the conditions under which the 0% ASR is achieved. revision: yes
Referee: [Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.

Authors: The referee raises a valid point regarding the lack of targeted evaluation for the Intent Verifier. The hard constraints form the primary defense against unauthorized tools, while the verifier handles nuanced parameter cases. To strengthen this, we will add ablation studies and targeted experiments in the revised version that evaluate the Intent Verifier's performance against disguised and syntactically similar attacks, including cases designed to mimic benign formatting. These experiments will help confirm the robustness of the 0% ASR claim. revision: yes
Referee: [Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.

Authors: We acknowledge the absence of explicit ablations in the current manuscript. The hierarchical design ensures that hard constraints block most attacks, with the verifier as a secondary check, which contributes to the low FPR. However, to address the concern about benchmark-specific artifacts, we will incorporate ablation studies isolating the hard constraints and Intent Verifier, as well as tests for edge cases like planner isolation failures and mimicking parameter deviations. This will demonstrate the method's effectiveness more comprehensively. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmark is self-contained

full rationale

The paper presents a training-free defense framework (isolated Planner generating reference actions from user instructions alone, followed by hard-constraint then Intent-Verifier stages) whose central performance claims are direct experimental measurements on the external InjecAgent benchmark (ASR reduced from 72.8% to 0%, FPR 1.49%). No equations, fitted parameters, self-citation load-bearing premises, or ansatzes appear in the provided text that would make any result equivalent to its inputs by construction. The framework is described as model-agnostic and compatible with prior methods rather than derived from them. This is the normal case of an applied system paper whose validity rests on external falsifiable testing rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that user instructions alone suffice to define all legitimate actions and that deviations can be classified without access to retrieved content.

axioms (1)

domain assumption User instructions alone contain enough information to enumerate a complete reference set of valid actions.
Invoked by the Context Isolation principle and the design of the isolated Planner.

invented entities (2)

Isolated Planner no independent evidence
purpose: Generate reference valid actions exclusively from user instructions.
New architectural component introduced to enforce context isolation.
Intent Verifier no independent evidence
purpose: Distinguish benign parameter deviations from malicious hijacking.
New verification submodule within the Hierarchical Verification Mechanism.

pith-pipeline@v0.9.0 · 5496 in / 1211 out tokens · 38450 ms · 2026-05-10T16:14:04.698455+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

Reference graph

Works this paper leans on

21 extracted references · 11 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[2]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 68 539–68 551, 2023

2023
[3]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

2022
[4]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review arXiv 2022
[5]

Llm-planner: Few-shot grounded planning for embodied agents with large language models,

C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2998–3009

2023
[6]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12173

work page internal anchor Pith review arXiv 2023
[8]

Ignore previous prompt: Attack techniques for language models,

F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” 2022. [Online]. Available: https://arxiv.org/abs/2211. 09527

2022
[9]

Prompt Injection attack against LLM-integrated Applications

Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, L. Y . Zhang, and Y . Liu, “Prompt injection attack against llm-integrated applications,” 2025. [Online]. Available: https://arxiv.org/abs/2306.05499

work page internal anchor Pith review arXiv 2025
[10]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents,” 2024. [Online]. Available: https://arxiv.org/abs/2403.02691

work page internal anchor Pith review arXiv 2024
[11]

Fine-tuned deberta-v3 for prompt injection detection,

ProtectAI.com, “Fine-tuned deberta-v3 for prompt injection detection,” 2023. [Online]. Available: https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection

2023
[12]

Detecting Language Model Attacks with Perplexity

G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,”arXiv preprint arXiv:2308.14132, 2023

work page arXiv 2023
[13]

Secalign: Defending against prompt injection with preference optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 2833–2847

2025
[14]

{StruQ}: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “{StruQ}: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2383–2400

2025
[15]

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,

T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,” inProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 2023, pp. 431–445

2023
[16]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

H. Wang, C. M. Poskitt, and J. Sun, “Agentspec: Customizable run- time enforcement for safe and reliable llm agents,”arXiv preprint arXiv:2503.18666, 2025

work page internal anchor Pith review arXiv 2025
[17]

Mitigating indirect prompt injection via instruction-following intent analysis,

M. Kang, C. Xiang, S. Kariyappa, C. Xiao, B. Li, and E. Suh, “Mitigating indirect prompt injection via instruction-following intent analysis,”arXiv preprint arXiv:2512.00966, 2025

work page arXiv 2025
[18]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 80 079–80 110, 2023

2023
[19]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review arXiv 2023
[20]

Optimization-based prompt injection attack to llm-as-a-judge,

J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2403.17710

work page arXiv 2025
[21]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freireet al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review arXiv 2023
[22]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review arXiv 2024