Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Bo Li; Dacheng Tao; Fok Kar Wai; Kangjie Chen; Pin-Yu Chen; Tianwei Zhang; Vrizlynn L. L. Thing; Yiming Li; Yutong Wu; Zheyu Liu

arxiv: 2606.13385 · v1 · pith:OP22FPWEnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI· cs.CY· cs.HC· cs.MM

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Zihao Wang , Yiming Li , Yutong Wu , Zheyu Liu , Kangjie Chen , Fok Kar Wai , Pin-Yu Chen , Vrizlynn L. L. Thing

show 3 more authors

Bo Li Dacheng Tao Tianwei Zhang

This is my paper

Pith reviewed 2026-06-27 06:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.HCcs.MM

keywords prompt injectionweb agentsLLM securitystakeholder analysisadversarial benchmarkingAI agent safetyprompt attacks

0 comments

The pith

Current web agents fail to resist any prompt-injection attack objective, with harms distributed unevenly across stakeholders in distinct failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a stakeholder-centric benchmark for prompt-injection attacks on LLM-driven web agents that operate over untrusted content. It evaluates attacks by their effects on different parties such as users, sellers, and platforms, using both outcome metrics and process metrics to capture how the same injection can produce asymmetric results. Results show that no tested attack objective is reliably blocked, and that failures appear in three distinct patterns: stealthy parasitism where the attack succeeds without harming the user's task, misaligned disruption where the task fails but the attack does not succeed, and compounded failure where both occur together. These patterns are not captured by prior attack-centric evaluations that focus only on technical feasibility.

Core claim

Not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from stealthy parasitism (attack succeeds without disrupting the user's delegated task) to misaligned disruption (task disrupted without attack success) and compounded failure (both adversarial objective and task integrity simultaneously violated).

What carries the argument

Stakeholder-centric benchmark that distinguishes affected entities such as user, seller, and platform, decomposes attacks into concrete objectives, and evaluates each with complementary outcome-level and process-level metrics.

If this is right

The same injection can succeed without disrupting the user's delegated task.
Agent task integrity can be violated without the adversarial objective being met.
Both the attack goal and task disruption can occur together.
Conventional technical-success metrics miss the distribution of harms across parties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could add multi-party harm tracking to agent evaluation pipelines.
Runtime safeguards might detect when an agent's behavior shifts into one of the three failure modes.
The benchmark method could apply to other LLM agent settings that interact with external untrusted data.

Load-bearing premise

The chosen set of stakeholders, attack objectives, and complementary outcome and process metrics sufficiently represent the distribution of real-world harms in deployed web agent systems.

What would settle it

Demonstration of at least one attack objective that multiple current agents resist across all tested stakeholder scenarios and metrics would contradict the claim of universal non-resistance.

Figures

Figures reproduced from arXiv: 2606.13385 by Bo Li, Dacheng Tao, Fok Kar Wai, Kangjie Chen, Pin-Yu Chen, Tianwei Zhang, Vrizlynn L. L. Thing, Yiming Li, Yutong Wu, Zheyu Liu, Zihao Wang.

**Figure 1.** Figure 1: Overview of StakeBench. The agent operates within an interactive shopping interface where adversarial content embedded in environment surfaces such as reviews and ratings may steer execution away from the user’s benign intent. Three stakeholder categories define the harm space (User, third-party Sellers, and the Platform), spanning 12 attack objectives realized by 22 reusable templates (9 DPI, 13 IPI) and … view at source ↗

**Figure 2.** Figure 2: Overview of the attack taxonomy in StakeBench. serves as the primary evaluation channel, with DPI included as a reference condition. The complete threat-model specification is provided in Appendix B. Stakeholder-Oriented Attack Taxonomy. StakeBench organizes attacks along two axes: the stakeholder category they target and the harm objective they pursue, as presented in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 3.** Figure 3: Failure patterns across attack objectives in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Example of the visual manipulation used in the multimodal attack experiment. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Key pages of the OneStopMarket environment as observed by the agent during task [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Clean (left) and attacked (right) product pages for an E4 Order Tampering IPI case. Top [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes prompt injection around stakeholder harms and three failure modes but the methods details are missing so the claims stay hard to judge.

read the letter

This paper's main point is that prompt injection in web agents needs to be evaluated by who gets hurt and in what way, not just whether the attack succeeds. It introduces a benchmark that splits stakeholders into user, seller, and platform, ties attacks to concrete objectives for each, and tracks both outcome and process metrics.

The new piece is the three-mode taxonomy: stealthy parasitism (attack works but user task stays intact), misaligned disruption (task breaks but attack goal fails), and compounded failure (both happen). The results claim no objective is reliably blocked and that these modes appear across agents. That framing is absent from earlier attack-centric benchmarks, and it does surface a real gap—standard tests would miss cases where an agent completes the user's request while still causing harm elsewhere.

The open-sourced benchmark is a practical step that others can extend. The work also correctly notes that the same injection can produce asymmetric costs depending on the target stakeholder.

The soft spot is the missing experimental backbone. The abstract states the patterns but gives no agent count, task set, statistical checks, or validation of the metrics. The stakeholder and objective choices also lack any derivation from logs or surveys, so the representativeness concern stands: the reported heterogeneity could be an artifact of the slice chosen rather than a general property. Without those details the central claim cannot be assessed.

This is for people working on LLM agent security who want to think about real deployment costs. It deserves peer review because the question is worth asking and the benchmark could be refined, even if the current evidence is too thin to stand on its own.

Referee Report

3 major / 2 minor

Summary. The paper introduces StakeBench, a stakeholder-centric benchmark for prompt-injection attacks on LLM-driven web agents. It decomposes attacks by affected stakeholders (user, seller, platform), concrete objectives, and dual outcome/process metrics, claiming that current agents exhibit substantial heterogeneous vulnerabilities: no attack objective is reliably resisted, and failures span stealthy parasitism, misaligned disruption, and compounded failure. The benchmark and code are released publicly.

Significance. If the benchmark design holds, the work advances security evaluation of web agents by moving beyond attack-centric metrics to stakeholder-specific harm attribution, which is relevant for real deployments. The public release of the benchmark supports reproducibility and future extensions.

major comments (3)

[§3] §3 (Benchmark Construction): The stakeholder set (user/seller/platform) and attack objectives are introduced without derivation from deployment surveys, incident logs, or a coverage argument; if the selection is ad-hoc, the reported distribution across failure modes (stealthy parasitism etc.) may not generalize beyond the chosen slice.
[§5] §5 (Experimental Results): The claims of heterogeneous vulnerabilities and distinct failure modes rest on empirical runs, yet the section supplies no information on agent count, task diversity, number of trials per objective, or statistical tests used to validate metric distributions; this directly affects assessability of the central claim that failures are not reliably resisted.
[§4.2] §4.2 (Metrics): The outcome- and process-level metrics are defined to capture the three failure modes, but no validation (e.g., inter-rater agreement or correlation with real-world harm) is reported, leaving open whether the taxonomy reliably distinguishes the claimed modes.

minor comments (2)

[Figure 2] Figure 2 and Table 1 use overlapping color schemes that reduce readability when printed in grayscale.
[Abstract] The abstract states 'not a single attack objective is reliably resisted' but does not define the threshold for 'reliably' (e.g., success rate < X%); this should be stated explicitly in §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, indicating planned revisions to improve clarity and rigor where appropriate.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The stakeholder set (user/seller/platform) and attack objectives are introduced without derivation from deployment surveys, incident logs, or a coverage argument; if the selection is ad-hoc, the reported distribution across failure modes (stealthy parasitism etc.) may not generalize beyond the chosen slice.

Authors: The stakeholder categories reflect the primary entities in web transactions where agents act on behalf of users while interacting with sellers and platforms, as motivated in the introduction. Attack objectives were selected to represent distinct harm vectors for each stakeholder based on realistic injection scenarios. We agree a more explicit justification would strengthen the section. In revision we will add a short coverage argument in §3, referencing standard e-commerce stakeholder models from prior literature, while clarifying that the benchmark is not claimed to be exhaustive. revision: partial
Referee: [§5] §5 (Experimental Results): The claims of heterogeneous vulnerabilities and distinct failure modes rest on empirical runs, yet the section supplies no information on agent count, task diversity, number of trials per objective, or statistical tests used to validate metric distributions; this directly affects assessability of the central claim that failures are not reliably resisted.

Authors: We acknowledge that §5 currently omits these experimental parameters. In the revised manuscript we will insert a dedicated experimental setup subsection reporting the number and types of agents evaluated, task diversity, trials per objective, and any statistical procedures used. This addition will directly support evaluation of the heterogeneity claims. revision: yes
Referee: [§4.2] §4.2 (Metrics): The outcome- and process-level metrics are defined to capture the three failure modes, but no validation (e.g., inter-rater agreement or correlation with real-world harm) is reported, leaving open whether the taxonomy reliably distinguishes the claimed modes.

Authors: The metrics were intentionally defined around directly observable agent actions and outcomes to support scalable, automated assessment. No inter-rater or external harm-correlation validation was performed in this study. We will revise §4.2 to include an explicit discussion of the design rationale and to flag formal validation as an important direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent metrics and results

full rationale

The paper introduces a new stakeholder-centric benchmark for prompt injection attacks on web agents, defining categories (user/seller/platform), attack objectives, and dual outcome/process metrics. Results are obtained by running the benchmark on existing agents and reporting observed failure modes (stealthy parasitism, misaligned disruption, compounded failure). No equations, fitted parameters, predictions, or derivations are present that could reduce outputs to inputs by construction. No self-citation load-bearing steps or uniqueness theorems are invoked. The central claims rest on the experimental data collected under the new evaluation framework, which is externally falsifiable via the released benchmark code. This is a standard empirical contribution with no reduction to prior fitted values or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new evaluation framework resting on domain assumptions about how harms should be attributed; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Prompt injection attacks can be meaningfully decomposed into concrete objectives whose harms differ across stakeholders (user, seller, platform).
This decomposition is the structural basis for the benchmark categories and metrics.

pith-pipeline@v0.9.1-grok · 5871 in / 1268 out tokens · 25258 ms · 2026-06-27T06:18:35.158107+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 23 canonical work pages · 12 internal anchors

[1]

S. Yao, H. Chen, J. Yang, K. Narasimhan, Webshop: Towards scalable real-world web interaction with grounded language agents, in: NeurIPS, 2022

2022
[2]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, Y . Su, Mind2web: Towards a generalist agent for the web, in: NeurIPS, 2023

2023
[3]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, et al., A survey on large language model based autonomous agents, Frontiers of Computer Science 18 (6) (2024) 186345

2024
[4]

Zheng, B

B. Zheng, B. Gou, J. Kil, H. Sun, Y . Su, Gpt-4v(ision) is a generalist web agent, if grounded, in: ICML, 2024

2024
[5]

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

2024
[6]

T. Fang, H. Zhang, Z. Zhang, K. Ma, W. Yu, H. Mi, D. Yu, Webevolver: Enhancing web agent self- improvement with co-evolving world model, in: EMNLP, 2025

2025
[7]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez, I. Ribeiro, Ignore previous prompt: Attack techniques for language models, arXiv preprint arXiv:2211.09527 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety training fail?, in: NeurIPS, 2023

2023
[9]

X. Wang, J. Bloch, Z. Shao, Y . Hu, S. Zhou, N. Z. Gong, Webinject: Prompt injection attack to web agents, in: EMNLP, 2025

2025
[10]

J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, F. Wu, Benchmarking and defending against indirect prompt injection attacks on large language models, in: ACM SIGKDD, 2025

2025
[11]

Schmotz, S

D. Schmotz, S. Abdelnabi, M. Andriushchenko, Agent skills enable a new class of realistic and trivially simple prompt injections, arXiv preprint arXiv:2510.26328 (2025)

work page arXiv 2025
[12]

A. Li, Y . Zhou, V . C. Raghuram, T. Goldstein, M. Goldblum, Commercial llm agents are already vulnerable to simple yet dangerous attacks, arXiv preprint arXiv:2502.08586 (2025)

work page arXiv 2025
[13]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, T. Hashimoto, Identifying the risks of lm agents with an lm-emulated sandbox, arXiv preprint arXiv:2309.15817 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Simple prompt injection attacks can leak personal data observed by llm agents during task execution,

M. Alizadeh, Z. Samei, D. Stetsenko, F. Gilardi, Simple prompt injection attacks can leak personal data observed by llm agents during task execution, arXiv preprint arXiv:2506.01055 (2025)

work page arXiv 2025
[15]

C. Chen, Z. Zhang, I. Khalilov, B. Guo, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li, T. J.-J. Li, Toward a human-centered evaluation framework for trustworthy llm-powered gui agents, arXiv preprint arXiv:2504.17934 (2025)

work page arXiv 2025
[16]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, Y . Zhang, Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, arXiv preprint arXiv:2410.02644 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Y . Liu, Y . Jia, R. Geng, J. Jia, N. Z. Gong, Formalizing and benchmarking prompt injection attacks and defenses, in: USENIX Security, 2024

2024
[18]

I. Levy, B. Wiesel, S. Marreed, A. Oved, A. Yaeli, S. Shlomov, St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, arXiv preprint arXiv:2410.06703 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

H. Li, R. Wen, S. Shi, N. Zhang, C. Xiao, Agentdyn: A dynamic open-ended benchmark for evaluating prompt injection attacks of real-world agent security system, arXiv preprint arXiv:2602.03117 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Y . Kaya, A. Landerer, S. Pletinckx, M. Zimmermann, C. Kruegel, G. Vigna, When ai meets the web: Prompt injection risks in third-party ai chatbot plugins, arXiv preprint arXiv:2511.05797 (2025)

work page arXiv 2025
[21]

Y . Lyu, X. Zhang, L. Yan, M. de Rijke, Z. Ren, X. Chen, Deepshop: A benchmark for deep research shopping agents, arXiv preprint arXiv:2506.02839 (2025)

work page arXiv 2025
[22]

J. Wang, K. Xiao, Q. Sun, H. Zhao, T. Luo, J. D. Zhang, X. Zeng, Shoppingbench: A real-world intent- grounded shopping benchmark for llm-based agents, in: AAAI, 2026. 11

2026
[23]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Y . Song, F. F. Xu, S. Zhou, G. Neubig, Beyond browsing: Api-based web agents, in: Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025
[25]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, K. Chaudhuri, Wasp: Benchmarking web agent security against prompt injection attacks, arXiv preprint arXiv:2504.18575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

I. Gur, H. Furuta, A. Huang, M. Safdari, Y . Matsuo, D. Eck, A. Faust, A real-world webagent with planning, long context understanding, and program synthesis, arXiv preprint arXiv:2307.12856 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Building an end-to-end web agent with large multimodal models, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

2024
[28]

Greshake, S

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, in: ACM workshop, 2023

2023
[29]

J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, L. Sun, Prompt injection attack to tool selection in llm agents, arXiv preprint arXiv:2504.19793 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, Y . Tian, The landscape of prompt injection threats in llm agents: From taxonomy to analysis, arXiv preprint arXiv:2602.10453 (2026)

work page arXiv 2026
[31]

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents,

A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, K. Chaudhuri, Agentdam: Privacy leakage evaluation for autonomous web agents, arXiv preprint arXiv:2503.09780 (2025)

work page arXiv 2025
[32]

Kuntz, A

T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, M. Andriushchenko, Os-harm: A benchmark for measuring safety of computer use agents, arXiv preprint arXiv:2506.14866 (2025)

work page arXiv 2025
[33]

Q. Zhan, Z. Liang, Z. Ying, D. Kang, Injecagent: Benchmarking indirect prompt injections in tool- integrated large language model agents, in: Findings of the Association for Computational Linguistics: ACL 2024, 2024

2024
[34]

Debenedetti, J

E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, in: NeurIPS, 2024

2024
[35]

Syros, E

G. Syros, E. Rose, B. Grinstead, C. Kerschbaumer, W. Robertson, C. Nita-Rotaru, A. Oprea, Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks, arXiv preprint arXiv:2602.09222 (2026)

work page arXiv 2026
[36]

Nanobrowser Team, Nanobrowser: Open-source chrome extension for ai-powered web automation,https: //github.com/nanobrowser/nanobrowser, version 0.1.13 (2025)

2025
[37]

Browser Use Team, Browser use: Make websites accessible for ai agents, https://github.com/ browser-use/browser-use, version 0.12.3 (2025)

2025
[38]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al., Openai gpt-5 system card, arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al., Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

C. H. Wu, R. Shah, J. Y . Koh, R. Salakhutdinov, D. Fried, A. Raghunathan, Dissecting adversarial robustness of multimodal lm agents, arXiv preprint arXiv:2406.12814 (2024)

work page arXiv 2024
[41]

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, P. Mittal, Visual adversarial examples jailbreak aligned large language models, in: AAAI, 2024

2024
[42]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023). 12 A Operational Definitions for Benchmark Comparison To clarify the comparison presented in Table 5, we define each evaluation axis operationally ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

S. Yao, H. Chen, J. Yang, K. Narasimhan, Webshop: Towards scalable real-world web interaction with grounded language agents, in: NeurIPS, 2022

2022

[2] [2]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, Y . Su, Mind2web: Towards a generalist agent for the web, in: NeurIPS, 2023

2023

[3] [3]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, et al., A survey on large language model based autonomous agents, Frontiers of Computer Science 18 (6) (2024) 186345

2024

[4] [4]

Zheng, B

B. Zheng, B. Gou, J. Kil, H. Sun, Y . Su, Gpt-4v(ision) is a generalist web agent, if grounded, in: ICML, 2024

2024

[5] [5]

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

2024

[6] [6]

T. Fang, H. Zhang, Z. Zhang, K. Ma, W. Yu, H. Mi, D. Yu, Webevolver: Enhancing web agent self- improvement with co-evolving world model, in: EMNLP, 2025

2025

[7] [7]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez, I. Ribeiro, Ignore previous prompt: Attack techniques for language models, arXiv preprint arXiv:2211.09527 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety training fail?, in: NeurIPS, 2023

2023

[9] [9]

X. Wang, J. Bloch, Z. Shao, Y . Hu, S. Zhou, N. Z. Gong, Webinject: Prompt injection attack to web agents, in: EMNLP, 2025

2025

[10] [10]

J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, F. Wu, Benchmarking and defending against indirect prompt injection attacks on large language models, in: ACM SIGKDD, 2025

2025

[11] [11]

Schmotz, S

D. Schmotz, S. Abdelnabi, M. Andriushchenko, Agent skills enable a new class of realistic and trivially simple prompt injections, arXiv preprint arXiv:2510.26328 (2025)

work page arXiv 2025

[12] [12]

A. Li, Y . Zhou, V . C. Raghuram, T. Goldstein, M. Goldblum, Commercial llm agents are already vulnerable to simple yet dangerous attacks, arXiv preprint arXiv:2502.08586 (2025)

work page arXiv 2025

[13] [13]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, T. Hashimoto, Identifying the risks of lm agents with an lm-emulated sandbox, arXiv preprint arXiv:2309.15817 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Simple prompt injection attacks can leak personal data observed by llm agents during task execution,

M. Alizadeh, Z. Samei, D. Stetsenko, F. Gilardi, Simple prompt injection attacks can leak personal data observed by llm agents during task execution, arXiv preprint arXiv:2506.01055 (2025)

work page arXiv 2025

[15] [15]

C. Chen, Z. Zhang, I. Khalilov, B. Guo, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li, T. J.-J. Li, Toward a human-centered evaluation framework for trustworthy llm-powered gui agents, arXiv preprint arXiv:2504.17934 (2025)

work page arXiv 2025

[16] [16]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, Y . Zhang, Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, arXiv preprint arXiv:2410.02644 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Y . Liu, Y . Jia, R. Geng, J. Jia, N. Z. Gong, Formalizing and benchmarking prompt injection attacks and defenses, in: USENIX Security, 2024

2024

[18] [18]

I. Levy, B. Wiesel, S. Marreed, A. Oved, A. Yaeli, S. Shlomov, St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, arXiv preprint arXiv:2410.06703 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

H. Li, R. Wen, S. Shi, N. Zhang, C. Xiao, Agentdyn: A dynamic open-ended benchmark for evaluating prompt injection attacks of real-world agent security system, arXiv preprint arXiv:2602.03117 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Y . Kaya, A. Landerer, S. Pletinckx, M. Zimmermann, C. Kruegel, G. Vigna, When ai meets the web: Prompt injection risks in third-party ai chatbot plugins, arXiv preprint arXiv:2511.05797 (2025)

work page arXiv 2025

[21] [21]

Y . Lyu, X. Zhang, L. Yan, M. de Rijke, Z. Ren, X. Chen, Deepshop: A benchmark for deep research shopping agents, arXiv preprint arXiv:2506.02839 (2025)

work page arXiv 2025

[22] [22]

J. Wang, K. Xiao, Q. Sun, H. Zhao, T. Luo, J. D. Zhang, X. Zeng, Shoppingbench: A real-world intent- grounded shopping benchmark for llm-based agents, in: AAAI, 2026. 11

2026

[23] [23]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Y . Song, F. F. Xu, S. Zhou, G. Neubig, Beyond browsing: Api-based web agents, in: Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025

[25] [25]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, K. Chaudhuri, Wasp: Benchmarking web agent security against prompt injection attacks, arXiv preprint arXiv:2504.18575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

I. Gur, H. Furuta, A. Huang, M. Safdari, Y . Matsuo, D. Eck, A. Faust, A real-world webagent with planning, long context understanding, and program synthesis, arXiv preprint arXiv:2307.12856 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Building an end-to-end web agent with large multimodal models, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024

2024

[28] [28]

Greshake, S

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, in: ACM workshop, 2023

2023

[29] [29]

J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, L. Sun, Prompt injection attack to tool selection in llm agents, arXiv preprint arXiv:2504.19793 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, Y . Tian, The landscape of prompt injection threats in llm agents: From taxonomy to analysis, arXiv preprint arXiv:2602.10453 (2026)

work page arXiv 2026

[31] [31]

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents,

A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, K. Chaudhuri, Agentdam: Privacy leakage evaluation for autonomous web agents, arXiv preprint arXiv:2503.09780 (2025)

work page arXiv 2025

[32] [32]

Kuntz, A

T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, M. Andriushchenko, Os-harm: A benchmark for measuring safety of computer use agents, arXiv preprint arXiv:2506.14866 (2025)

work page arXiv 2025

[33] [33]

Q. Zhan, Z. Liang, Z. Ying, D. Kang, Injecagent: Benchmarking indirect prompt injections in tool- integrated large language model agents, in: Findings of the Association for Computational Linguistics: ACL 2024, 2024

2024

[34] [34]

Debenedetti, J

E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, F. Tramèr, Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, in: NeurIPS, 2024

2024

[35] [35]

Syros, E

G. Syros, E. Rose, B. Grinstead, C. Kerschbaumer, W. Robertson, C. Nita-Rotaru, A. Oprea, Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks, arXiv preprint arXiv:2602.09222 (2026)

work page arXiv 2026

[36] [36]

Nanobrowser Team, Nanobrowser: Open-source chrome extension for ai-powered web automation,https: //github.com/nanobrowser/nanobrowser, version 0.1.13 (2025)

2025

[37] [37]

Browser Use Team, Browser use: Make websites accessible for ai agents, https://github.com/ browser-use/browser-use, version 0.12.3 (2025)

2025

[38] [38]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al., Openai gpt-5 system card, arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al., Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

C. H. Wu, R. Shah, J. Y . Koh, R. Salakhutdinov, D. Fried, A. Raghunathan, Dissecting adversarial robustness of multimodal lm agents, arXiv preprint arXiv:2406.12814 (2024)

work page arXiv 2024

[41] [41]

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, P. Mittal, Visual adversarial examples jailbreak aligned large language models, in: AAAI, 2024

2024

[42] [42]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023). 12 A Operational Definitions for Benchmark Comparison To clarify the comparison presented in Table 5, we define each evaluation axis operationally ...

work page internal anchor Pith review Pith/arXiv arXiv 2023