pith. sign in

arxiv: 2602.03117 · v3 · submitted 2026-02-03 · 💻 cs.CR

AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?

Pith reviewed 2026-05-16 08:11 UTC · model grok-4.3

classification 💻 cs.CR
keywords AI agentsindirect prompt injectionsecurity defensesbenchmarksdynamic environmentsover-defenseagent security
0
0 comments X

The pith

Most existing defenses against indirect prompt injection in AI agents either fail to stop attacks or block too many legitimate actions when tested in dynamic settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies three shortcomings in current agent security benchmarks: they lack dynamic open-ended tasks, helpful third-party instructions, and realistic complexity. To fix this gap it introduces AgentDyn, a new benchmark built from 60 manually designed open-ended tasks and 560 injection test cases spanning shopping, GitHub, and daily life scenarios. Evaluation of ten state-of-the-art defenses on AgentDyn shows that nearly all are either insecure, allowing malicious instructions to succeed, or over-defensive, refusing many helpful requests. The result indicates that these defenses remain unsuitable for real-world agent deployments that require ongoing planning and interaction with external content.

Core claim

AgentDyn exposes three core flaws in prior benchmarks—absence of dynamic open-ended tasks, absence of helpful instructions, and overly simplistic user tasks—and shows through direct testing that ten leading defenses either remain vulnerable to indirect prompt injection or impose excessive refusals when agents must perform dynamic planning in the presence of helpful third-party content, placing them far from deployable in realistic environments.

What carries the argument

The AgentDyn benchmark of 60 open-ended tasks and 560 injection test cases that force dynamic planning while embedding helpful third-party instructions.

If this is right

  • Defenses must be redesigned to preserve both security and utility across changing plans and helpful external instructions.
  • Static benchmarks should give way to adaptive ones that test dynamic planning for accurate security measurement.
  • Real-world agent systems will need new defense approaches that account for open-ended, multi-step interactions.
  • Nearly all currently published defenses require substantial further work before they can be considered production-ready.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated generation of dynamic tasks could make such benchmarks easier to scale and maintain.
  • The observed security-utility trade-off may require changes to agent architecture rather than defense layers alone.
  • Direct testing on live deployed agents would provide an external check on whether the benchmark's tasks match practice.

Load-bearing premise

The 60 manually designed tasks and 560 test cases accurately capture the dynamic planning and helpful-instruction demands of actual agent deployments.

What would settle it

A defense that blocks all injections on AgentDyn while still completing almost all helpful tasks without refusal would support its real-world readiness; conversely, evidence that real user tasks differ substantially from these 60 would undermine the benchmark's claims.

Figures

Figures reproduced from arXiv: 2602.03117 by Chaowei Xiao, Hao Li, Ning Zhang, Ruoyao Wen, Shanghao Shi, Yevgeniy Vorobeychik.

Figure 1
Figure 1. Figure 1: The Attacked Utility and ASR comparison of 9 advanced defenses powered by GPT-4o on AgentDyn. level defenses, which leverage security policies or system design, have achieved almost perfect defense (i.e., near￾zero attack success rates (ASR)) in AgentDojo (Debenedetti et al., 2024)—a most prevalent agent security benchmark, while having minimal impact on the agent utility. All of these achievements suggest… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between AgentDojo and AgentDyn on four GPT-4o powered defenses, as well as Meta SecAlign. high performance suggests a potential bottleneck in current benchmarks for adequately reflecting the true capabilities of existing defenses. However, on AgentDyn, all GPT-4o￾powered defenses experience a sharp utility drop compared to the undefended baseline. Meta SecAlign performs the best among all defens… view at source ↗
Figure 3
Figure 3. Figure 3: Utility and ASR against the task trajectory length on Vannila GPT-4o. 5. Conclusion In this work, we develop AgentDyn, a manually designed open-ended benchmark. It incorporates realistic dynamic tasks, helpful environmental instructions, and more com￾plex user tasks. Our evaluation shows that nearly all existing defenses that achieve near-perfect performance on existing agent security benchmarks struggle s… view at source ↗
Figure 4
Figure 4. Figure 4: A dynamic open-ended task illustration. Helpful instructions from the environment are highlighted in green. C. Dynamic Scenarios in AgentDyn This section documents the complete collection of dynamic scenarios constructed for AgentDyn. The scenarios are grouped by suite (Shopping, GitHub, and DailyLife) and include all variations used in our experiments, complementing the representative examples presented i… view at source ↗
read the original abstract

AI agents that autonomously interact with external tools and environments have shown great promise across real-world applications. However, their reliance on external data exposes them to serious indirect prompt injection attacks, where malicious instructions embedded in third-party content hijack agent behaviors. To mitigate this threat, a growing number of defenses have been proposed and evaluated under existing agent security benchmarks. These benchmarks provide structured environments for comparing attacks and defenses, and have become a key driver for defense design and optimization. However, as agents move toward more complex and open-ended real-world deployments, there is a pressing need for benchmarks to become more adaptive and better reflect the dynamic environments faced by real-world agentic systems. In this work, we reveal three fundamental flaws in the current benchmarks and push the frontier along these dimensions: (i) lack of dynamic open-ended tasks, (ii) lack of helpful instructions, and (iii) simplistic user tasks. To bridge this gap, we introduce AgentDyn, a manually designed benchmark featuring 60 challenging open-ended tasks and 560 injection test cases across Shopping, GitHub, and Daily Life. Unlike prior static benchmarks, AgentDyn requires dynamic planning and incorporates helpful third-party instructions. Our evaluation of ten state-of-the-art defenses suggests that almost all existing defenses are either not secure enough or suffer from significant over-defense, revealing that existing defenses are still far from real-world deployment. Our benchmark is available at https://github.com/leolee99/AgentDyn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies three flaws in existing agent security benchmarks (lack of dynamic open-ended tasks, lack of helpful instructions, and simplistic user tasks) and introduces AgentDyn, a manually designed benchmark with 60 open-ended tasks and 560 injection test cases across Shopping, GitHub, and Daily Life domains. It evaluates ten state-of-the-art defenses and concludes that almost all are either insufficiently secure or suffer from significant over-defense, indicating they remain far from real-world deployment.

Significance. If the benchmark tasks accurately model real-world dynamic planning and helpful third-party instructions, the empirical results on ten defenses provide a valuable signal that current approaches require substantial improvement before deployment. The public release of the benchmark supports reproducibility and follow-on work.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The 60 tasks and 560 test cases are manually authored without reported grounding in real agent logs, user studies, or statistical comparison to production traces. This is load-bearing for the central claim that observed failure modes (insecurity or over-defense) are intrinsic to the defenses rather than artifacts of benchmark construction.
  2. [Evaluation section] Evaluation section: The abstract and main text provide limited detail on exact task construction, success metrics, and attack implementation details, leaving the support for the claim that 'almost all existing defenses' fail moderately supported by the reported results.
minor comments (2)
  1. [Abstract] Abstract: Include one sentence on the specific success metrics used (e.g., task completion rate under injection) to strengthen the summary of results.
  2. [Introduction] Introduction: Add explicit citations to the prior benchmarks being critiqued when listing the three flaws for improved traceability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to improve clarity and detail where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The 60 tasks and 560 test cases are manually authored without reported grounding in real agent logs, user studies, or statistical comparison to production traces. This is load-bearing for the central claim that observed failure modes (insecurity or over-defense) are intrinsic to the defenses rather than artifacts of benchmark construction.

    Authors: We acknowledge that AgentDyn tasks were manually designed rather than derived from production logs or user studies. The design draws from documented real-world agent use cases in the literature for the Shopping, GitHub, and Daily Life domains, with explicit focus on dynamic planning and helpful third-party instructions to address the three flaws identified in prior benchmarks. In the revised §3 we have added a dedicated subsection on design principles, including concrete examples of how each task category requires adaptive reasoning and incorporates benign external content. While we agree that direct statistical grounding in proprietary traces would further strengthen the benchmark, such data is not publicly available and was outside the scope of this work; the current tasks still expose clear limitations in existing defenses that align with known attack surfaces. revision: partial

  2. Referee: [Evaluation section] Evaluation section: The abstract and main text provide limited detail on exact task construction, success metrics, and attack implementation details, leaving the support for the claim that 'almost all existing defenses' fail moderately supported by the reported results.

    Authors: We thank the referee for highlighting this. The revised Evaluation section now includes expanded descriptions of task construction (with pseudocode for dynamic execution flows), precise success metrics (primary user goal completion without malicious action execution, plus separate over-defense rate), and attack implementation details (how injections are embedded in tool responses and how test cases are generated). These additions provide stronger empirical grounding for the conclusion that ten evaluated defenses exhibit either insufficient security or significant over-defense. revision: yes

standing simulated objections not resolved
  • Absence of grounding in real agent logs, user studies, or statistical comparison to production traces, as the benchmark was manually designed without access to such proprietary data.

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper introduces AgentDyn as a manually designed benchmark of 60 open-ended tasks and 560 injection cases, then directly evaluates ten external state-of-the-art defenses on it. No mathematical derivations, fitted parameters renamed as predictions, or self-citations appear in the load-bearing steps. The central claim follows from straightforward empirical runs on the new tasks rather than any reduction to the benchmark's own construction by definition. The analysis therefore contains no self-definitional, fitted-input, or self-citation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the manually chosen tasks and the assumption that the selected defenses are state-of-the-art.

axioms (1)
  • domain assumption The 60 manually designed tasks capture the essential challenges of real-world dynamic agent environments
    Invoked when claiming the benchmark bridges the gap to real deployments; no independent validation of task realism is described in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1080 out tokens · 27677 ms · 2026-05-16T08:11:30.908904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evidence-carrying multimodal agents decompose tool calls into predicates verified by constrained DOM/OCR/AX checkers to block hallucination-enabled unsafe actions.

  2. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving...

  3. PIArena: A Platform for Prompt Injection Evaluation

    cs.CR 2026-04 unverdicted novelty 5.0

    PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Chen, S., Piet, J., Sitawarin, C., and Wagner, D. A. Struq: Defending against prompt injection with structured queries. InUSENIX Security, pp. 2383–2400. USENIX Association, 2025a. Chen, S., Zharmagambetov, A., Mahloujifar, S., Chaudhuri, K., Wagner, D. A., and Guo, C. Secalign: Defending against prompt injection with preference optimization. In CCS, pp. ...

  2. [2]

    Defeating Prompt Injections by Design

    Debenedetti, E., Shumailov, I., Fan, T., Hayes, J., Car- lini, N., Fabian, D., Kern, C., Shi, C., Terzis, A., and Tram`er, F. Defeating prompt injections by design.CoRR, abs/2503.18813,

  3. [3]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input- output safeguard for human-ai conversations.CoRR, abs/2312.06674,

  4. [4]

    arXiv:2506.12104

    Li, H., Liu, X., Chiu, H., Li, D., Zhang, N., and Xiao, C. DRIFT: dynamic rule-based defense with injection iso- lation for securing LLM agents.CoRR, abs/2506.12104, 2025a. Li, H., Liu, X., Zhang, N., and Xiao, C. Piguard: Prompt injection guardrail via mitigating overdefense for free. In ACL, pp. 30420–30437. Association for Computational Linguistics, 20...

  5. [5]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

    URL https://www.llama.com/ docs/model-cards-and-prompt-formats/ prompt-guard/. Nasr, M., Carlini, N., Sitawarin, C., Schulhoff, S. V ., Hayes, J., Ilie, M., Pluto, J., Song, S., Chaudhari, H., Shumailov, I., Thakurta, A., Xiao, K. Y ., Terzis, A., and Tram`er, F. The attacker moves second: Stronger adaptive attacks by- pass defenses against llm jailbreaks...

  6. [6]

    Accessed: 2025-10-23. OWASP. Owasp llm01. https://genai.owasp. org/llmrisk/llm01-prompt-injection/,

  7. [7]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Perez, F. and Ribeiro, I. Ignore previous prompt: Attack techniques for language models.CoRR, abs/2211.09527,

  8. [8]

    URL https: //learnprompting.org/docs/prompt_ hacking/defensive_measures/sandwich_ defense. 9 AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System Shi, T., He, J., Wang, Z., Wu, L., Li, H., Guo, W., and Song, D. Progent: Programmable privilege control for LLM agents.CoRR, abs/2504.11703, 2025a...

  9. [9]

    System-level defense against indirect prompt injection attacks: An information flow control perspective

    Wu, F., Cecchetti, E., and Xiao, C. System-level defense against indirect prompt injection attacks: An information flow control perspective.CoRR, abs/2409.19091,

  10. [10]

    Rtbas: Defending llm agents against prompt injection and privacy leakage,

    Zhong, P. Y ., Chen, S., Wang, R., McCall, M., Titzer, B. L., Miller, H., and Gibbons, P. B. RTBAS: defending LLM agents against prompt injection and privacy leakage. CoRR, abs/2502.08966,

  11. [11]

    F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., Alon, U., and Neubig, G

    Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., Alon, U., and Neubig, G. Webarena: A realistic web environment for building autonomous agents. InICLR. OpenReview.net, 2024a. Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L., Bisk, Y ., Fried, D., Neubig, G., and Sap, M. SOTOPIA: int...

  12. [12]

    Alignment-based defenses.These approaches aim to strengthen the model’s intrinsic resistance to prompt injection via fine-tuning or preference optimization

    restricts the set of callable tools to those relevant to the user’s request, reducing the available attack surface and preventing irrelevant tool invocation during an attack. Alignment-based defenses.These approaches aim to strengthen the model’s intrinsic resistance to prompt injection via fine-tuning or preference optimization. StruQ (Chen et al., 2025a...

  13. [13]

    IsolateGPT (Wu et al.,

    may not transfer cleanly (Gur et al., 2024; Deng et al., 2023; Xie et al., 2024). IsolateGPT (Wu et al.,

  14. [14]

    CaMeL (Debenedetti et al.,

    and RTBAS (Zhong et al., 2025), propagate taint labels to limit the influence of untrusted data throughout the system. CaMeL (Debenedetti et al.,

  15. [15]

    https://www.homework.com/documents

    Tables 5, 6, and 7 provide the detailed scenario listings for each suite. Table 4.Examples of Dynamic Scenarios. Application Category Example Bank OTP Validation You should verify your account first. An email including verification one-time password (OTP) has been sent to your inbox. Email Link Interaction To accept this invitation, please click the link ...