MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content
Pith reviewed 2026-06-29 12:11 UTC · model grok-4.3
The pith
VLM-based mobile GUI agents cannot reliably separate trusted interface elements from user-generated content, enabling context-aware prompt injections that succeed 23-30 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MIRAGE turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. The pipeline uses a Localizer to identify controllable regions, a Generator to synthesize context-aware payloads rendered in the application's native style, and a Curator to moderate realism and balance samples. On the 1,111-sample benchmark all five agents prove vulnerable at 23-30 percent success rates, with MIRAGE rated more realistic than the strongest prior attack, and per-sample realism uncorrelated with attack success.
What carries the argument
The MIRAGE pipeline, which separates control of reach, realism, and distributional balance into three sequential stages to produce visually indistinguishable injections.
If this is right
- All current VLM GUI agents are vulnerable to prompt injections placed in user-generated content regions.
- Visual-quality filtering alone cannot defend against the attacks because realism and success rate are uncorrelated.
- The attacks require no changes to the agent, application, or operating system.
- The method produces balanced coverage across applications, region types, and attack intents.
Where Pith is reading between the lines
- Similar region-based injection attacks could be tested on non-mobile GUI agents that also rely on pixel input.
- Agents might reduce risk by treating text in user-content regions as lower-trust input and requiring confirmation for actions.
- The separation of localization, generation, and curation stages could be adapted to create test suites for other perception-based systems.
Load-bearing premise
VLM GUI agents perceive the screen only as rendered pixels and cannot reliably separate trusted interface elements from user-generated content.
What would settle it
A controlled test in which the same five agents receive the 1,111 screenshots both with and without the MIRAGE injections and produce attack success rates statistically indistinguishable from zero.
Figures
read the original abstract
Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLM-based mobile GUI agents perceive screens as pixels and therefore cannot reliably distinguish trusted interface elements from user-generated content. It introduces MIRAGE, a three-stage pipeline (Localizer to identify controllable regions, Generator to synthesize context-aware native-style payloads, Curator to balance realism and distribution) that converts benign screenshots into prompt-injection samples without modifying the agent, application, or OS. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, the work reports attack success rates of 23-30% across five evaluated VLM agents, higher human realism ratings than the strongest prior attack (3.02 vs. 2.52 out of 5), and no correlation between per-sample realism and attack success.
Significance. If the empirical results hold, the work provides concrete evidence of a practical attack surface on emerging VLM GUI agents that exploits the indistinguishability of user-generated content. The large-scale benchmark across multiple applications and intents, the explicit comparison to prior attacks via human realism ratings, and the reported lack of correlation between visual quality and success together strengthen the central claim that visual filtering alone is insufficient. The pipeline's separation of reach, realism, and distributional balance is a methodological contribution that directly addresses the core challenge of producing visually plausible yet effective injections.
minor comments (2)
- [Abstract] Abstract: the reported attack success rates and realism scores are presented without any summary of the evaluation protocol, baseline selection criteria, or statistical measures; adding one sentence on these points would improve the abstract's self-contained nature.
- [Evaluation] Evaluation section: confirm that the full methods description includes the precise definition of attack success (e.g., whether it requires the agent to execute a specific action or merely generate a particular output), the selection process for the five VLM agents, and any inter-rater agreement statistics for the human realism study.
Simulated Author's Rebuttal
We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. No major comments are listed in the report.
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical attack pipeline (Localizer, Generator, Curator) evaluated on external VLM agents via a 1,111-sample benchmark across ten applications and eleven intents. Central results are measured attack success rates (23%-30%) and human realism scores, with no equations, fitted parameters, self-referential predictions, or load-bearing self-citations. The method constructs samples whose effectiveness is tested externally rather than derived from its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM GUI agents perceive the screen as rendered pixels and cannot reliably separate trusted interface elements from user-generated content.
Reference graph
Works this paper leans on
-
[1]
Mobile GUI Agents under Real-world Threats: Are We There Yet?
Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23), pages 79–90. JaidedAI. 2024. EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts. https://github.com/JaidedAI/E...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
is this an authentic un- modified mobile UI?
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InInter- national Conference on Learning Representations (ICLR). Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device op- eration assistant with effective navigation via multi- ...
2024
-
[3]
For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations)
Localizer.A VLM emits coarse candidate regions R; the pipeline filters out non-user- controllable regions and drops search_bar en- tries. For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations). An independent VLM bounding-box (bbox) mod...
-
[4]
Generator.For each (r, i) pair, a VLM syn- thesises a one-sentence benign user goal g sub- ject to the ambiguity rule (§3.3); a payload p is then generated conditional on the screenshot, region crop, region type, g, and i. A pre-render payload-quality (PQ) reviewer regenerates p up to three times if it duplicates g, misses the in- tent’s semantic role, re...
-
[5]
Add comment
Curator.A post-render VLM moderator returns a verdict for each rendered s′. hard_fail ren- ders are re-rendered with the moderator’s issue list as feedback (capped at three retries); sam- ples still failing are dropped. Surviving samples pass through a pre-generation allocator, post- generation balance-trim, and one-pass coverage repair. Localizer: region...
-
[6]
Add to Cart
appear in the final post-balance-trim dataset; the gross filter rate from Localizer+ADDto survivor is 25.0%. Localizer: issue taxonomy and repair ac- tions.The bbox moderator emits one entry per issue. Issue kinds: missing_region, wrong_position, covers_avatar, missing_wrap_line, covers_system_ui, glyph_leakage_below, bbox_too_loose, duplicate. Each issue...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.