WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

· 2026 · cs.CR · arXiv 2604.12284

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Web agents powered by vision-language models (VLMs) enable autonomous interaction with web environments by perceiving and acting on both visual and textual webpage content to accomplish user-specified tasks. However, they are highly vulnerable to prompt injection attacks, where adversarial instructions embedded in HTML or rendered screenshots can manipulate agent behavior and lead to harmful outcomes such as information leakage. Existing defenses, including system prompt defenses and direct fine-tuning of agents, have shown limited effectiveness. To address this issue, we propose a defense framework in which a web agent operates in parallel with a dedicated guard agent, decoupling prompt injection detection from the agent's own reasoning. Building on this framework, we introduce WebAgentGuard, a reasoning-driven, multimodal guard model for prompt injection detection. We construct a synthetic multimodal dataset using GPT-5 spanning 164 topics and 230 visual and UI design styles, and train the model via reasoning-intensive supervised fine-tuning followed by reinforcement learning. Experiments across multiple benchmarks show that WebAgentGuard consistently outperforms strong baselines while preserving agent utility, without introducing additional latency.

representative citing papers

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

cs.CR · 2026-05-14 · unverdicted · novelty 5.0

WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.

citing papers explorer

Showing 1 of 1 citing paper.

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections cs.CR · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

fields

years

verdicts

representative citing papers

citing papers explorer