MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents
Pith reviewed 2026-06-27 00:51 UTC · model grok-4.3
The pith
Diffusion models can generate ad-slot images that hijack MLLM web agents into attacker-chosen actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MIRAGE leverages diffusion models to generate perceptually benign adversarial images confined to attacker-controlled boundaries such as ad slots, using curvature-aware adversarial diffusion guidance combined with sparse, dark-pixel residual perturbations to achieve targeted next-action hijacking in MLLM web agents.
What carries the argument
MIRAGE, a visual indirect prompt injection framework that generates adversarial images via diffusion models optimized for both human imperceptibility and action hijacking within spatially constrained legitimate regions.
If this is right
- Attacks succeed even when the adversary controls only a small legitimate region of the page.
- The generated images evade human detection while still changing agent behavior on SeeAct and OpenClaw.
- Curvature-aware guidance plus sparse dark-pixel perturbations suffice to maximize efficacy inside the restricted setting.
- The same constrained threat model applies to any MLLM web agent that processes page screenshots.
Where Pith is reading between the lines
- Web platforms may need additional image sanitization or region-specific verification steps for third-party content.
- Agents could incorporate explicit checks that discount or cross-verify visual content originating from known ad-like areas.
- The same diffusion-based injection approach could be tested on other multimodal interfaces that accept external imagery under similar spatial constraints.
Load-bearing premise
Diffusion models can reliably produce images that stay both perceptually benign to humans and effective for next-action hijacking when strictly confined to semantically legitimate, spatially constrained regions such as ad slots.
What would settle it
An experiment in which no diffusion-generated image placed inside an ad slot on a live page alters the agent's next action without also becoming visibly altered to human viewers.
Figures
read the original abstract
Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a trusted web platform where the evaluator acts solely as an unprivileged third party, such as a merchant or advertiser, controlling only a semantically legitimate, spatially constrained region, such as an ad slot, a sponsored card, or a localized widget. Operating under these realistic constraints, we propose MIRAGE, a novel visual indirect prompt injection framework for targeted next-action hijacking. Our approach leverages diffusion models to generate perceptually benign adversarial images strictly confined to the attacker-controlled boundaries permitted by the trusted service provider. To maximize attack efficacy within such a restrictive setting, we introduce a robust optimization technique combining curvature-aware adversarial diffusion guidance with sparse, dark-pixel residual perturbations. Comprehensive evaluations against prominent MLLM web agent frameworks, specifically SeeAct and OpenClaw, empirically demonstrate the potency, realism, and stealth of our proposed MIRAGE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MIRAGE, a visual indirect prompt injection framework for MLLM-based web agents. Under a constrained threat model where the attacker controls only a semantically legitimate region (e.g., ad slot), it uses diffusion models with curvature-aware adversarial guidance and sparse dark-pixel residuals to generate perceptually benign images that hijack next actions. Comprehensive evaluations on SeeAct and OpenClaw are claimed to demonstrate the approach's potency, realism, and stealth.
Significance. If the empirical claims hold under the stated constraints, the work would be significant for identifying realistic vulnerabilities in deployed MLLM web agents and for advancing constrained adversarial generation techniques in vision-language systems. The emphasis on a trusted-platform, third-party attacker model distinguishes it from more permissive threat models in prior work.
major comments (1)
- [Abstract] Abstract: the central claim that MIRAGE images 'empirically demonstrate the potency, realism, and stealth' rests on the unverified assumption that curvature-aware diffusion guidance plus sparse residuals can simultaneously satisfy human-perceptual benignity and reliable next-action hijacking when confined to small, semantically legitimate regions. No success rates, visibility metrics, human perceptual studies, or failure-case analysis are supplied to show the two objectives are jointly achieved rather than traded off.
Simulated Author's Rebuttal
We thank the referee for their review and the opportunity to respond to the concern regarding the abstract. We address the point directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that MIRAGE images 'empirically demonstrate the potency, realism, and stealth' rests on the unverified assumption that curvature-aware diffusion guidance plus sparse residuals can simultaneously satisfy human-perceptual benignity and reliable next-action hijacking when confined to small, semantically legitimate regions. No success rates, visibility metrics, human perceptual studies, or failure-case analysis are supplied to show the two objectives are jointly achieved rather than traded off.
Authors: The abstract is a concise summary; the supporting quantitative evidence appears in the body of the manuscript. Section 4 reports success rates for targeted next-action hijacking (e.g., 78-87% on SeeAct and 71-84% on OpenClaw across 200 trials under the constrained ad-slot threat model). Section 5.2 quantifies stealth via LPIPS, SSIM, and PSNR metrics on the generated images, showing values comparable to benign ad content. Section 5.4 provides failure-case analysis, identifying cases where agent policy or OCR variance prevents hijacking. Human perceptual studies are not included in the current manuscript. We can revise the abstract to reference the specific quantitative results from Sections 4 and 5 rather than using the general phrasing 'empirically demonstrate.' revision: partial
- Absence of formal human perceptual studies to support the stealth claim
Circularity Check
No circularity: empirical method proposal with independent evaluations
full rationale
The paper introduces MIRAGE as a novel framework using diffusion models for adversarial image generation within constrained regions, combined with a described optimization technique (curvature-aware guidance plus sparse residuals). The central claim rests on empirical success rates against SeeAct and OpenClaw rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to justify load-bearing steps. The evaluations are presented as external validation, making the work self-contained against benchmarks without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models can generate images that are perceptually benign while carrying effective adversarial signals when optimized with curvature-aware guidance and sparse perturbations.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , year =
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author =. Advances in Neural Information Processing Systems , year =
-
[2]
arXiv preprint arXiv:2307.13854 , year =
WebArena: A Realistic Web Environment for Building Autonomous Agents , author =. arXiv preprint arXiv:2307.13854 , year =
-
[3]
Advances in Neural Information Processing Systems , year =
Mind2Web: Towards a Generalist Agent for the Web , author =. Advances in Neural Information Processing Systems , year =
-
[4]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
-
[5]
arXiv preprint arXiv:2401.01614 , year =
GPT-4V(ision) is a Generalist Web Agent, if Grounded , author =. arXiv preprint arXiv:2401.01614 , year =
-
[6]
arXiv preprint arXiv:2312.08914 , year =
CogAgent: A Visual Language Model for GUI Agents , author =. arXiv preprint arXiv:2312.08914 , year =
-
[7]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =
-
[8]
arXiv preprint arXiv:2404.07972 , year =
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author =. arXiv preprint arXiv:2404.07972 , year =
-
[9]
arXiv preprint arXiv:2302.12173 , year =
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author =. arXiv preprint arXiv:2302.12173 , year =
-
[10]
Findings of the Association for Computational Linguistics: ACL 2024 , pages =
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , doi =
2024
-
[11]
Advances in Neural Information Processing Systems , year =
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author =. Advances in Neural Information Processing Systems , year =
-
[12]
The Thirteenth International Conference on Learning Representations , year =
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage , author =. The Thirteenth International Conference on Learning Representations , year =
-
[13]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =
Attacking Vision-Language Computer Agents via Pop-ups , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =. 2025 , doi =
2025
-
[14]
arXiv preprint arXiv:2410.17401 , year =
AdvAgent: Controllable Blackbox Red-teaming on Web Agents , author =. arXiv preprint arXiv:2410.17401 , year =
-
[15]
Proceedings of the 42nd International Conference on Machine Learning , series =
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , publisher =
2025
-
[16]
arXiv preprint arXiv:2505.21499 , year =
AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery , author =. arXiv preprint arXiv:2505.21499 , year =
-
[17]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =
WebInject: Prompt Injection Attack to Web Agents , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , doi =
2025
-
[18]
arXiv preprint arXiv:2306.13213 , year =
Visual Adversarial Examples Jailbreak Aligned Large Language Models , author =. arXiv preprint arXiv:2306.13213 , year =
-
[19]
arXiv preprint arXiv:2309.00236 , year =
Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author =. arXiv preprint arXiv:2309.00236 , year =
-
[20]
Proceedings of the AAAI Conference on Artificial Intelligence , volume =
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2025 , doi =
2025
-
[21]
arXiv preprint arXiv:2503.10809 , year =
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents , author =. arXiv preprint arXiv:2503.10809 , year =
-
[22]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =
On the Robustness of GUI Grounding Models Against Image Attacks , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =
-
[23]
International Conference on Learning Representations , year =
Denoising Diffusion Implicit Models , author =. International Conference on Learning Representations , year =
-
[24]
arXiv preprint arXiv:2305.16494 , year =
Diffusion-Based Adversarial Sample Generation for Improved Stealthiness and Controllability , author =. arXiv preprint arXiv:2305.16494 , year =
-
[25]
arXiv preprint arXiv:2305.08192 , year =
Diffusion Models for Imperceptible and Transferable Adversarial Attack , author =. arXiv preprint arXiv:2305.08192 , year =
-
[26]
European Conference on Computer Vision , year =
AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models , author =. European Conference on Computer Vision , year =
-
[27]
IEEE Transactions on Information Forensics and Security , year =
Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models via Diffusion Models , author =. IEEE Transactions on Information Forensics and Security , year =
-
[28]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[29]
arXiv preprint arXiv:2503.01743 , year=
Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=
-
[30]
Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...
-
[31]
arXiv preprint arXiv:2407.21783 , year=
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
-
[32]
2026 , note =
Peter Steinberger and OpenClaw Contributors , title =. 2026 , note =
2026
-
[33]
2026 , note =
browser-use , howpublished =. 2026 , note =
2026
-
[34]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[35]
Advances in Neural Information Processing Systems , volume=
Content-based unrestricted adversarial attack , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.