MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Gelei Deng; Ji Jie; Leo Yu Zhang; Lida Zhao; Ruoqi Guo; Yiheng Xiong; Yi Liu; Ying Zhang; Yuekang Li; Yuxiao Lu

arxiv: 2605.28116 · v1 · pith:6ACXHPBEnew · submitted 2026-05-27 · 💻 cs.CR · cs.AI· cs.CL

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Ruoqi Guo , Yi Liu , Gelei Deng , Yiheng Xiong , Yuekang Li , Ying Zhang , Leo Yu Zhang , Lida Zhao

show 2 more authors

Ji Jie Yuxiao Lu

This is my paper

Pith reviewed 2026-06-29 12:11 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords prompt injectionGUI agentsvision-language modelsmobile securityadversarial examplesuser-generated contentcontext-aware attacksVLM vulnerabilities

0 comments

The pith

VLM-based mobile GUI agents cannot reliably separate trusted interface elements from user-generated content, enabling context-aware prompt injections that succeed 23-30 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MIRAGE, a three-stage pipeline that converts ordinary mobile app screenshots into prompt-injection attacks by inserting attacker text only into regions where users normally add content. It evaluates the method on 1,111 samples across ten applications and eleven attack intents, finding every one of the five tested VLM agents vulnerable. The injected examples score higher on human realism judgments than earlier attacks, yet visual quality does not predict whether an attack succeeds. This shows that pixel-based perception alone leaves agents open to instructions that appear as ordinary user input.

Core claim

MIRAGE turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. The pipeline uses a Localizer to identify controllable regions, a Generator to synthesize context-aware payloads rendered in the application's native style, and a Curator to moderate realism and balance samples. On the 1,111-sample benchmark all five agents prove vulnerable at 23-30 percent success rates, with MIRAGE rated more realistic than the strongest prior attack, and per-sample realism uncorrelated with attack success.

What carries the argument

The MIRAGE pipeline, which separates control of reach, realism, and distributional balance into three sequential stages to produce visually indistinguishable injections.

If this is right

All current VLM GUI agents are vulnerable to prompt injections placed in user-generated content regions.
Visual-quality filtering alone cannot defend against the attacks because realism and success rate are uncorrelated.
The attacks require no changes to the agent, application, or operating system.
The method produces balanced coverage across applications, region types, and attack intents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar region-based injection attacks could be tested on non-mobile GUI agents that also rely on pixel input.
Agents might reduce risk by treating text in user-content regions as lower-trust input and requiring confirmation for actions.
The separation of localization, generation, and curation stages could be adapted to create test suites for other perception-based systems.

Load-bearing premise

VLM GUI agents perceive the screen only as rendered pixels and cannot reliably separate trusted interface elements from user-generated content.

What would settle it

A controlled test in which the same five agents receive the 1,111 screenshots both with and without the MIRAGE injections and produce attack success rates statistically indistinguishable from zero.

Figures

Figures reproduced from arXiv: 2605.28116 by Gelei Deng, Ji Jie, Leo Yu Zhang, Lida Zhao, Ruoqi Guo, Yiheng Xiong, Yi Liu, Ying Zhang, Yuekang Li, Yuxiao Lu.

**Figure 2.** Figure 2: End-to-end pipeline of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example annotated overlay shown to the Cura [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: LLM-judge realism (mean of four criteria, [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Per-(model, application) ASR (%). Rows are [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: CLIP t-distributed stochastic neighbor em [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIRAGE shows VLM GUI agents can be fooled by styled text placed in ordinary user-content areas at 23-30% success, using a three-stage pipeline that improves realism over prior attacks.

read the letter

The core finding is that current VLM-based mobile GUI agents treat the screen as pixels and therefore fall for prompt injections that look like normal user-generated content. MIRAGE splits the job into a Localizer that finds controllable regions, a Generator that makes context-fitting text in the app's native style, and a Curator that keeps the samples balanced and realistic. That separation is the main technical step beyond earlier injection work.

The 1,111-sample test across ten apps and eleven intents is a reasonable scale, and the result that all five agents are vulnerable at similar rates is straightforward. The higher human realism score (3.02 vs 2.52) and the lack of correlation between realism and success are useful observations; they suggest simple visual filters won't close the gap.

The main limitation visible from the abstract is that the evaluation protocol, exact agent prompts, and statistical details are not spelled out, so the numbers are hard to reproduce without the full methods. If the paper supplies those, the empirical claim holds up; if not, the numbers stay hard to trust. No circularity or self-referential fitting appears in the reported results.

This is for people working on agent security or VLM robustness rather than core theory. It deserves a serious referee because the attack surface is practical and the benchmark gives a concrete starting point, even if the write-up needs tightening on reproducibility.

Referee Report

0 major / 2 minor

Summary. The paper claims that VLM-based mobile GUI agents perceive screens as pixels and therefore cannot reliably distinguish trusted interface elements from user-generated content. It introduces MIRAGE, a three-stage pipeline (Localizer to identify controllable regions, Generator to synthesize context-aware native-style payloads, Curator to balance realism and distribution) that converts benign screenshots into prompt-injection samples without modifying the agent, application, or OS. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, the work reports attack success rates of 23-30% across five evaluated VLM agents, higher human realism ratings than the strongest prior attack (3.02 vs. 2.52 out of 5), and no correlation between per-sample realism and attack success.

Significance. If the empirical results hold, the work provides concrete evidence of a practical attack surface on emerging VLM GUI agents that exploits the indistinguishability of user-generated content. The large-scale benchmark across multiple applications and intents, the explicit comparison to prior attacks via human realism ratings, and the reported lack of correlation between visual quality and success together strengthen the central claim that visual filtering alone is insufficient. The pipeline's separation of reach, realism, and distributional balance is a methodological contribution that directly addresses the core challenge of producing visually plausible yet effective injections.

minor comments (2)

[Abstract] Abstract: the reported attack success rates and realism scores are presented without any summary of the evaluation protocol, baseline selection criteria, or statistical measures; adding one sentence on these points would improve the abstract's self-contained nature.
[Evaluation] Evaluation section: confirm that the full methods description includes the precise definition of attack success (e.g., whether it requires the agent to execute a specific action or merely generate a particular output), the selection process for the five VLM agents, and any inter-rater agreement statistics for the human realism study.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. No major comments are listed in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical attack pipeline (Localizer, Generator, Curator) evaluated on external VLM agents via a 1,111-sample benchmark across ten applications and eleven intents. Central results are measured attack success rates (23%-30%) and human realism scores, with no equations, fitted parameters, self-referential predictions, or load-bearing self-citations. The method constructs samples whose effectiveness is tested externally rather than derived from its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that pixel-only perception prevents agents from distinguishing trusted UI from user content; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption VLM GUI agents perceive the screen as rendered pixels and cannot reliably separate trusted interface elements from user-generated content.
Opening premise of the abstract that enables the attack surface.

pith-pipeline@v0.9.1-grok · 5817 in / 1331 out tokens · 34009 ms · 2026-06-29T12:11:42.737170+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23), pages 79–90. JaidedAI. 2024. EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts. https://github.com/JaidedAI/E...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

is this an authentic un- modified mobile UI?

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InInter- national Conference on Learning Representations (ICLR). Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device op- eration assistant with effective navigation via multi- ...

2024
[3]

For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations)

Localizer.A VLM emits coarse candidate regions R; the pipeline filters out non-user- controllable regions and drops search_bar en- tries. For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations). An independent VLM bounding-box (bbox) mod...
[4]

Generator.For each (r, i) pair, a VLM syn- thesises a one-sentence benign user goal g sub- ject to the ambiguity rule (§3.3); a payload p is then generated conditional on the screenshot, region crop, region type, g, and i. A pre-render payload-quality (PQ) reviewer regenerates p up to three times if it duplicates g, misses the in- tent’s semantic role, re...
[5]

Add comment

Curator.A post-render VLM moderator returns a verdict for each rendered s′. hard_fail ren- ders are re-rendered with the moderator’s issue list as feedback (capped at three retries); sam- ples still failing are dropped. Surviving samples pass through a pre-generation allocator, post- generation balance-trim, and one-pass coverage repair. Localizer: region...
[6]

Add to Cart

appear in the final post-balance-trim dataset; the gross filter rate from Localizer+ADDto survivor is 25.0%. Localizer: issue taxonomy and repair ac- tions.The bbox moderator emits one entry per issue. Issue kinds: missing_region, wrong_position, covers_avatar, missing_wrap_line, covers_system_ui, glyph_leakage_below, bbox_too_loose, duplicate. Each issue...

2026

[1] [1]

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23), pages 79–90. JaidedAI. 2024. EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts. https://github.com/JaidedAI/E...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

is this an authentic un- modified mobile UI?

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. InInter- national Conference on Learning Representations (ICLR). Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device op- eration assistant with effective navigation via multi- ...

2024

[3] [3]

For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations)

Localizer.A VLM emits coarse candidate regions R; the pipeline filters out non-user- controllable regions and drops search_bar en- tries. For each text-bearing r∈R , EasyOCR is run within r and the box is tightened to the de- tected text (candidates without high-confidence text are removed as VLM hallucinations). An independent VLM bounding-box (bbox) mod...

[4] [4]

Generator.For each (r, i) pair, a VLM syn- thesises a one-sentence benign user goal g sub- ject to the ambiguity rule (§3.3); a payload p is then generated conditional on the screenshot, region crop, region type, g, and i. A pre-render payload-quality (PQ) reviewer regenerates p up to three times if it duplicates g, misses the in- tent’s semantic role, re...

[5] [5]

Add comment

Curator.A post-render VLM moderator returns a verdict for each rendered s′. hard_fail ren- ders are re-rendered with the moderator’s issue list as feedback (capped at three retries); sam- ples still failing are dropped. Surviving samples pass through a pre-generation allocator, post- generation balance-trim, and one-pass coverage repair. Localizer: region...

[6] [6]

Add to Cart

appear in the final post-balance-trim dataset; the gross filter rate from Localizer+ADDto survivor is 25.0%. Localizer: issue taxonomy and repair ac- tions.The bbox moderator emits one entry per issue. Issue kinds: missing_region, wrong_position, covers_avatar, missing_wrap_line, covers_system_ui, glyph_leakage_below, bbox_too_loose, duplicate. Each issue...

2026