pith. sign in

arxiv: 2606.13385 · v1 · pith:OP22FPWEnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI· cs.CY· cs.HC· cs.MM

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Pith reviewed 2026-06-27 06:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.HCcs.MM
keywords prompt injectionweb agentsLLM securitystakeholder analysisadversarial benchmarkingAI agent safetyprompt attacks
0
0 comments X

The pith

Current web agents fail to resist any prompt-injection attack objective, with harms distributed unevenly across stakeholders in distinct failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a stakeholder-centric benchmark for prompt-injection attacks on LLM-driven web agents that operate over untrusted content. It evaluates attacks by their effects on different parties such as users, sellers, and platforms, using both outcome metrics and process metrics to capture how the same injection can produce asymmetric results. Results show that no tested attack objective is reliably blocked, and that failures appear in three distinct patterns: stealthy parasitism where the attack succeeds without harming the user's task, misaligned disruption where the task fails but the attack does not succeed, and compounded failure where both occur together. These patterns are not captured by prior attack-centric evaluations that focus only on technical feasibility.

Core claim

Not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from stealthy parasitism (attack succeeds without disrupting the user's delegated task) to misaligned disruption (task disrupted without attack success) and compounded failure (both adversarial objective and task integrity simultaneously violated).

What carries the argument

Stakeholder-centric benchmark that distinguishes affected entities such as user, seller, and platform, decomposes attacks into concrete objectives, and evaluates each with complementary outcome-level and process-level metrics.

If this is right

  • The same injection can succeed without disrupting the user's delegated task.
  • Agent task integrity can be violated without the adversarial objective being met.
  • Both the attack goal and task disruption can occur together.
  • Conventional technical-success metrics miss the distribution of harms across parties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could add multi-party harm tracking to agent evaluation pipelines.
  • Runtime safeguards might detect when an agent's behavior shifts into one of the three failure modes.
  • The benchmark method could apply to other LLM agent settings that interact with external untrusted data.

Load-bearing premise

The chosen set of stakeholders, attack objectives, and complementary outcome and process metrics sufficiently represent the distribution of real-world harms in deployed web agent systems.

What would settle it

Demonstration of at least one attack objective that multiple current agents resist across all tested stakeholder scenarios and metrics would contradict the claim of universal non-resistance.

Figures

Figures reproduced from arXiv: 2606.13385 by Bo Li, Dacheng Tao, Fok Kar Wai, Kangjie Chen, Pin-Yu Chen, Tianwei Zhang, Vrizlynn L. L. Thing, Yiming Li, Yutong Wu, Zheyu Liu, Zihao Wang.

Figure 1
Figure 1. Figure 1: Overview of StakeBench. The agent operates within an interactive shopping interface where adversarial content embedded in environment surfaces such as reviews and ratings may steer execution away from the user’s benign intent. Three stakeholder categories define the harm space (User, third-party Sellers, and the Platform), spanning 12 attack objectives realized by 22 reusable templates (9 DPI, 13 IPI) and … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the attack taxonomy in StakeBench. serves as the primary evaluation channel, with DPI included as a reference condition. The complete threat-model specification is provided in Appendix B. Stakeholder-Oriented Attack Taxonomy. StakeBench organizes attacks along two axes: the stakeholder category they target and the harm objective they pursue, as presented in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 3
Figure 3. Figure 3: Failure patterns across attack objectives in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of the visual manipulation used in the multimodal attack experiment. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Key pages of the OneStopMarket environment as observed by the agent during task [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Clean (left) and attacked (right) product pages for an E4 Order Tampering IPI case. Top [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces StakeBench, a stakeholder-centric benchmark for prompt-injection attacks on LLM-driven web agents. It decomposes attacks by affected stakeholders (user, seller, platform), concrete objectives, and dual outcome/process metrics, claiming that current agents exhibit substantial heterogeneous vulnerabilities: no attack objective is reliably resisted, and failures span stealthy parasitism, misaligned disruption, and compounded failure. The benchmark and code are released publicly.

Significance. If the benchmark design holds, the work advances security evaluation of web agents by moving beyond attack-centric metrics to stakeholder-specific harm attribution, which is relevant for real deployments. The public release of the benchmark supports reproducibility and future extensions.

major comments (3)
  1. [§3] §3 (Benchmark Construction): The stakeholder set (user/seller/platform) and attack objectives are introduced without derivation from deployment surveys, incident logs, or a coverage argument; if the selection is ad-hoc, the reported distribution across failure modes (stealthy parasitism etc.) may not generalize beyond the chosen slice.
  2. [§5] §5 (Experimental Results): The claims of heterogeneous vulnerabilities and distinct failure modes rest on empirical runs, yet the section supplies no information on agent count, task diversity, number of trials per objective, or statistical tests used to validate metric distributions; this directly affects assessability of the central claim that failures are not reliably resisted.
  3. [§4.2] §4.2 (Metrics): The outcome- and process-level metrics are defined to capture the three failure modes, but no validation (e.g., inter-rater agreement or correlation with real-world harm) is reported, leaving open whether the taxonomy reliably distinguishes the claimed modes.
minor comments (2)
  1. [Figure 2] Figure 2 and Table 1 use overlapping color schemes that reduce readability when printed in grayscale.
  2. [Abstract] The abstract states 'not a single attack objective is reliably resisted' but does not define the threshold for 'reliably' (e.g., success rate < X%); this should be stated explicitly in §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, indicating planned revisions to improve clarity and rigor where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The stakeholder set (user/seller/platform) and attack objectives are introduced without derivation from deployment surveys, incident logs, or a coverage argument; if the selection is ad-hoc, the reported distribution across failure modes (stealthy parasitism etc.) may not generalize beyond the chosen slice.

    Authors: The stakeholder categories reflect the primary entities in web transactions where agents act on behalf of users while interacting with sellers and platforms, as motivated in the introduction. Attack objectives were selected to represent distinct harm vectors for each stakeholder based on realistic injection scenarios. We agree a more explicit justification would strengthen the section. In revision we will add a short coverage argument in §3, referencing standard e-commerce stakeholder models from prior literature, while clarifying that the benchmark is not claimed to be exhaustive. revision: partial

  2. Referee: [§5] §5 (Experimental Results): The claims of heterogeneous vulnerabilities and distinct failure modes rest on empirical runs, yet the section supplies no information on agent count, task diversity, number of trials per objective, or statistical tests used to validate metric distributions; this directly affects assessability of the central claim that failures are not reliably resisted.

    Authors: We acknowledge that §5 currently omits these experimental parameters. In the revised manuscript we will insert a dedicated experimental setup subsection reporting the number and types of agents evaluated, task diversity, trials per objective, and any statistical procedures used. This addition will directly support evaluation of the heterogeneity claims. revision: yes

  3. Referee: [§4.2] §4.2 (Metrics): The outcome- and process-level metrics are defined to capture the three failure modes, but no validation (e.g., inter-rater agreement or correlation with real-world harm) is reported, leaving open whether the taxonomy reliably distinguishes the claimed modes.

    Authors: The metrics were intentionally defined around directly observable agent actions and outcomes to support scalable, automated assessment. No inter-rater or external harm-correlation validation was performed in this study. We will revise §4.2 to include an explicit discussion of the design rationale and to flag formal validation as an important direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent metrics and results

full rationale

The paper introduces a new stakeholder-centric benchmark for prompt injection attacks on web agents, defining categories (user/seller/platform), attack objectives, and dual outcome/process metrics. Results are obtained by running the benchmark on existing agents and reporting observed failure modes (stealthy parasitism, misaligned disruption, compounded failure). No equations, fitted parameters, predictions, or derivations are present that could reduce outputs to inputs by construction. No self-citation load-bearing steps or uniqueness theorems are invoked. The central claims rest on the experimental data collected under the new evaluation framework, which is externally falsifiable via the released benchmark code. This is a standard empirical contribution with no reduction to prior fitted values or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new evaluation framework resting on domain assumptions about how harms should be attributed; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Prompt injection attacks can be meaningfully decomposed into concrete objectives whose harms differ across stakeholders (user, seller, platform).
    This decomposition is the structural basis for the benchmark categories and metrics.

pith-pipeline@v0.9.1-grok · 5871 in / 1268 out tokens · 25258 ms · 2026-06-27T06:18:35.158107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 23 canonical work pages · 12 internal anchors

  1. [1]

    S. Yao, H. Chen, J. Yang, K. Narasimhan, Webshop: Towards scalable real-world web interaction with grounded language agents, in: NeurIPS, 2022

  2. [2]

    X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, Y . Su, Mind2web: Towards a generalist agent for the web, in: NeurIPS, 2023

  3. [3]

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, et al., A survey on large language model based autonomous agents, Frontiers of Computer Science 18 (6) (2024) 186345

  4. [4]

    Zheng, B

    B. Zheng, B. Gou, J. Kil, H. Sun, Y . Su, Gpt-4v(ision) is a generalist web agent, if grounded, in: ICML, 2024

  5. [5]

    J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

  6. [6]

    T. Fang, H. Zhang, Z. Zhang, K. Ma, W. Yu, H. Mi, D. Yu, Webevolver: Enhancing web agent self- improvement with co-evolving world model, in: EMNLP, 2025

  7. [7]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez, I. Ribeiro, Ignore previous prompt: Attack techniques for language models, arXiv preprint arXiv:2211.09527 (2022)

  8. [8]

    A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety training fail?, in: NeurIPS, 2023

  9. [9]

    X. Wang, J. Bloch, Z. Shao, Y . Hu, S. Zhou, N. Z. Gong, Webinject: Prompt injection attack to web agents, in: EMNLP, 2025

  10. [10]

    J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, F. Wu, Benchmarking and defending against indirect prompt injection attacks on large language models, in: ACM SIGKDD, 2025

  11. [11]

    Schmotz, S

    D. Schmotz, S. Abdelnabi, M. Andriushchenko, Agent skills enable a new class of realistic and trivially simple prompt injections, arXiv preprint arXiv:2510.26328 (2025)

  12. [12]

    A. Li, Y . Zhou, V . C. Raghuram, T. Goldstein, M. Goldblum, Commercial llm agents are already vulnerable to simple yet dangerous attacks, arXiv preprint arXiv:2502.08586 (2025)

  13. [13]

    Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, T. Hashimoto, Identifying the risks of lm agents with an lm-emulated sandbox, arXiv preprint arXiv:2309.15817 (2023)

  14. [14]

    Simple prompt injection attacks can leak personal data observed by llm agents during task execution,

    M. Alizadeh, Z. Samei, D. Stetsenko, F. Gilardi, Simple prompt injection attacks can leak personal data observed by llm agents during task execution, arXiv preprint arXiv:2506.01055 (2025)

  15. [15]

    C. Chen, Z. Zhang, I. Khalilov, B. Guo, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li, T. J.-J. Li, Toward a human-centered evaluation framework for trustworthy llm-powered gui agents, arXiv preprint arXiv:2504.17934 (2025)

  16. [16]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, Y . Zhang, Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, arXiv preprint arXiv:2410.02644 (2024)

  17. [17]

    Y . Liu, Y . Jia, R. Geng, J. Jia, N. Z. Gong, Formalizing and benchmarking prompt injection attacks and defenses, in: USENIX Security, 2024

  18. [18]

    I. Levy, B. Wiesel, S. Marreed, A. Oved, A. Yaeli, S. Shlomov, St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, arXiv preprint arXiv:2410.06703 (2024)

  19. [19]

    H. Li, R. Wen, S. Shi, N. Zhang, C. Xiao, Agentdyn: A dynamic open-ended benchmark for evaluating prompt injection attacks of real-world agent security system, arXiv preprint arXiv:2602.03117 (2026)

  20. [20]

    Y . Kaya, A. Landerer, S. Pletinckx, M. Zimmermann, C. Kruegel, G. Vigna, When ai meets the web: Prompt injection risks in third-party ai chatbot plugins, arXiv preprint arXiv:2511.05797 (2025)

  21. [21]

    Y . Lyu, X. Zhang, L. Yan, M. de Rijke, Z. Ren, X. Chen, Deepshop: A benchmark for deep research shopping agents, arXiv preprint arXiv:2506.02839 (2025)

  22. [22]

    J. Wang, K. Xiao, Q. Sun, H. Zhao, T. Luo, J. D. Zhang, X. Zeng, Shoppingbench: A real-world intent- grounded shopping benchmark for llm-based agents, in: AAAI, 2026. 11

  23. [23]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022)

  24. [24]

    Y . Song, F. F. Xu, S. Zhou, G. Neubig, Beyond browsing: Api-based web agents, in: Findings of the Association for Computational Linguistics: ACL 2025, 2025

  25. [25]

    WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

    I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, K. Chaudhuri, Wasp: Benchmarking web agent security against prompt injection attacks, arXiv preprint arXiv:2504.18575 (2025)

  26. [26]

    I. Gur, H. Furuta, A. Huang, M. Safdari, Y . Matsuo, D. Eck, A. Faust, A real-world webagent with planning, long context understanding, and program synthesis, arXiv preprint arXiv:2307.12856 (2023)

  27. [27]

    H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Building an end-to-end web agent with large multimodal models, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

  28. [28]

    Greshake, S

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, in: ACM workshop, 2023

  29. [29]

    J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, L. Sun, Prompt injection attack to tool selection in llm agents, arXiv preprint arXiv:2504.19793 (2025)

  30. [30]

    P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, Y . Tian, The landscape of prompt injection threats in llm agents: From taxonomy to analysis, arXiv preprint arXiv:2602.10453 (2026)

  31. [31]

    AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents,

    A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, K. Chaudhuri, Agentdam: Privacy leakage evaluation for autonomous web agents, arXiv preprint arXiv:2503.09780 (2025)

  32. [32]

    Kuntz, A

    T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, M. Andriushchenko, Os-harm: A benchmark for measuring safety of computer use agents, arXiv preprint arXiv:2506.14866 (2025)

  33. [33]

    Q. Zhan, Z. Liang, Z. Ying, D. Kang, Injecagent: Benchmarking indirect prompt injections in tool- integrated large language model agents, in: Findings of the Association for Computational Linguistics: ACL 2024, 2024

  34. [34]

    Debenedetti, J

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, in: NeurIPS, 2024

  35. [35]

    arXiv preprint arXiv:2602.09222 (2026)

    G. Syros, E. Rose, B. Grinstead, C. Kerschbaumer, W. Robertson, C. Nita-Rotaru, A. Oprea, Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks, arXiv preprint arXiv:2602.09222 (2026)

  36. [36]

    Nanobrowser Team, Nanobrowser: Open-source chrome extension for ai-powered web automation,https: //github.com/nanobrowser/nanobrowser, version 0.1.13 (2025)

  37. [37]

    Browser Use Team, Browser use: Make websites accessible for ai agents, https://github.com/ browser-use/browser-use, version 0.12.3 (2025)

  38. [38]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al., Openai gpt-5 system card, arXiv preprint arXiv:2601.03267 (2025)

  39. [39]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al., Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, arXiv preprint arXiv:2507.06261 (2025)

  40. [40]

    C. H. Wu, R. Shah, J. Y . Koh, R. Salakhutdinov, D. Fried, A. Raghunathan, Dissecting adversarial robustness of multimodal lm agents, arXiv preprint arXiv:2406.12814 (2024)

  41. [41]

    X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, P. Mittal, Visual adversarial examples jailbreak aligned large language models, in: AAAI, 2024

  42. [42]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023). 12 A Operational Definitions for Benchmark Comparison To clarify the comparison presented in Table 5, we define each evaluation axis operationally ...