Mobile GUI Agents under Real-world Threats: Are We There Yet?
Pith reviewed 2026-05-19 06:54 UTC · model grok-4.3
The pith
Mobile GUI agents are misled by third-party content such as ads and user posts at average rates of 42 percent and 36.1 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing a scalable app content instrumentation framework, the authors enable reproducible insertion of third-party content into commercial applications and demonstrate that every examined GUI agent suffers significant performance loss, reaching an average misleading rate of 42.0 percent in dynamic environments and 36.1 percent in static environments.
What carries the argument
Scalable app content instrumentation framework that performs flexible, targeted modifications inside existing applications to create realistic third-party content states.
Load-bearing premise
The content modifications inside the apps produce screen states that match genuine third-party material without adding detectable artifacts or changing app behavior in ways that would not happen outside the test setup.
What would settle it
Agents completing the 122 dynamic tasks and the 3,000 static scenarios at success rates comparable to clean benchmarks would show that the third-party content does not cause the reported degradation.
Figures
read the original abstract
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that current mobile GUI agents, both open-source and commercial, are vulnerable to degradation from real-world third-party contents (ads, user-generated posts, media) in apps. It introduces a scalable instrumentation framework for targeted content modifications in existing applications, constructs a benchmark with 122 reproducible dynamic tasks and over 3,000 static GUI scenarios from commercial apps, and reports average misleading rates of 42.0% (dynamic) and 36.1% (static) across evaluated agents, arguing that pre-deployment validation under such threats is missing from standard benchmarks.
Significance. If the results hold under validated conditions, the work provides a timely empirical measurement of robustness gaps in LLM-powered GUI agents, which is significant for the security and deployment of autonomous device-control systems. The released framework and benchmark at agenthazard.github.io represent a concrete contribution that enables future reproducible studies on agent resilience to untrusted content, moving beyond static benchmarks to more realistic threat models.
major comments (2)
- [Section 3] Section 3 (instrumentation framework): The headline misleading rates (42.0% dynamic, 36.1% static) rest on the unvalidated assumption that the framework's content modifications faithfully emulate genuine third-party material without introducing layout, visual, or semantic artifacts that would not occur in the wild. No quantitative validation (e.g., visual similarity metrics, element-hierarchy distribution comparisons, or user studies) is reported to confirm that modified screens match the distribution of naturally occurring third-party content in commercial apps; this directly affects whether the measured degradation reflects authentic threats.
- [Section 4] Section 4 / experimental protocol: The abstract and results report concrete misleading rates on 122 tasks and >3,000 scenarios, but the manuscript does not detail data-exclusion rules, task selection criteria, or statistical tests (e.g., significance levels or confidence intervals on the rates). Without these, it is not possible to verify whether post-hoc choices influenced the reported averages or whether the degradation is statistically robust across agents.
minor comments (2)
- [Section 3] The description of the static dataset construction could clarify how the >3,000 scenarios were sampled from commercial apps to avoid selection bias toward particularly vulnerable states.
- [Section 4] Figure or table presenting per-agent misleading rates would benefit from error bars or per-category breakdowns (e.g., ad vs. user-generated content) to improve interpretability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our paper. We address each of the major comments below and outline the revisions we intend to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Section 3] Section 3 (instrumentation framework): The headline misleading rates (42.0% dynamic, 36.1% static) rest on the unvalidated assumption that the framework's content modifications faithfully emulate genuine third-party material without introducing layout, visual, or semantic artifacts that would not occur in the wild. No quantitative validation (e.g., visual similarity metrics, element-hierarchy distribution comparisons, or user studies) is reported to confirm that modified screens match the distribution of naturally occurring third-party content in commercial apps; this directly affects whether the measured degradation reflects authentic threats.
Authors: We appreciate the referee highlighting the importance of validating the realism of our content modifications. The instrumentation framework is designed to perform in-place replacements of third-party content elements (such as ad banners or user posts) within the original app layouts, thereby preserving the structural and visual integrity by construction. However, we acknowledge that the manuscript does not provide quantitative comparisons to naturally occurring content. In the revised version, we will add a validation subsection in Section 3, including visual similarity metrics (e.g., SSIM and LPIPS) between modified and unmodified screens, as well as comparisons of UI element distributions and a qualitative analysis of semantic fidelity using examples from real apps. revision: yes
-
Referee: [Section 4] Section 4 / experimental protocol: The abstract and results report concrete misleading rates on 122 tasks and >3,000 scenarios, but the manuscript does not detail data-exclusion rules, task selection criteria, or statistical tests (e.g., significance levels or confidence intervals on the rates). Without these, it is not possible to verify whether post-hoc choices influenced the reported averages or whether the degradation is statistically robust across agents.
Authors: We agree that additional details on the experimental protocol are necessary for full reproducibility and to demonstrate statistical robustness. The 122 dynamic tasks were selected to represent a diverse set of common user interactions across popular commercial apps, with reproducibility ensured through fixed starting states and scripted actions. For the static dataset, scenarios were sampled from over 3,000 GUI states collected from real app executions. In the revision, we will expand Section 4 to explicitly describe the task selection criteria, any data exclusion rules applied (such as filtering out incomplete recordings), and include statistical analyses, including 95% confidence intervals computed via bootstrapping on the misleading rates for each agent. revision: yes
Circularity Check
No circularity: empirical measurement study with direct observation of agent behavior
full rationale
The paper conducts an empirical evaluation by introducing a content instrumentation framework, constructing dynamic and static test suites from commercial apps, and directly measuring misleading rates from agent actions against ground-truth task success. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The reported rates (42.0% dynamic, 36.1% static) are computed outcomes of the experiments rather than reductions to inputs by construction. The work is self-contained as a measurement campaign without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based GUI agents interpret screen content and follow natural-language instructions without built-in provenance checks for third-party elements.
Reference graph
Works this paper leans on
-
[1]
Naveed Akhtar and Ajmal Mian. 2018. Threat of Adversar- ial Attacks on Deep Learning in Computer Vision: A Survey. arXiv:1801.00553 [cs.CV] https://arxiv.org/abs/1801.00553
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Real Attack- ers Don’t Compute Gradients
Giovanni Apruzzese, Hyrum S. Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin A. Roundy. 2022. "Real Attack- ers Don’t Compute Gradients": Bridging the Gap Between Adver- sarial ML Research and Practice. arXiv:2212.14315 [cs.CR] https: //arxiv.org/abs/2212.14315
-
[3]
Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. arXiv:1801.01944 [cs.LG] https: //arxiv.org/abs/1801.01944
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. arXiv:2401.10935 [cs.HC] https://arxiv.org/abs/2401.10935
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/ 2306.06070
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. arXiv:2410.05243 [cs.AI] https://arxiv.org/abs/2410.05243
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] https://arxiv.org/abs/2401.13919
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV] https: //arxiv.org/abs/2312.08914
-
[9]
Joyce, Dev Amlani, Charles Nicholas, and Edward Raff
Robert J. Joyce, Dev Amlani, Charles Nicholas, and Edward Raff. 2021. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. arXiv:2111.15031 [cs.LG] https://arxiv.org/abs/2111.15031
- [10]
-
[11]
Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. 2024. MobileSafetyBench: Evaluating Safety of Au- tonomous Agents in Mobile Device Control. arXiv:2410.17520 [cs.LG] https://arxiv.org/abs/2410.17520
-
[12]
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/ abs/2312.03003
-
[13]
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge
-
[14]
arXiv:2005.03776 [cs.CL] https://arxiv.org/abs/2005.03776
Mapping Natural Language Instructions to Mobile UI Action Sequences. arXiv:2005.03776 [cs.CL] https://arxiv.org/abs/2005.03776
-
[15]
Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhu- osheng Zhang, and Hai Zhao. 2024. Caution for the Environment: Multimodal Agents Are Susceptible to Environmental Distractions. doi:10.48550/arXiv.2408.02544 arXiv:2408.02544
-
[16]
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large Language Models: A Survey. arXiv:2402.06196 [cs.CL] https://arxiv. org/abs/2402.06196
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zheng- mian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. ...
-
[18]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Conference ac...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyama- gundlu, Timothy Lillicrap, and Oriana Riva. 2024. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. doi:10.48550/arXiv.2405.14573 arX...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.14573 2024
- [20]
-
[21]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "Do Anything Now": Characterizing and Evalu- ating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv:2308.03825 [cs.CR] https://arxiv.org/abs/2308.03825
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. World of Bits: An Open-Domain Platform for Web- Based Agents. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70) , Doina Precup and Yee Whye Teh (Eds.). PMLR, 3135–3144. https: //proceedings.mlr.press...
work page 2017
- [23]
-
[24]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science 18, 6 (Dec. 2024), 186345. doi:10.1007/s11704-024-40231-1 arXiv:2308.11432 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11704-024-40231-1 2024
-
[25]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu
-
[26]
arXiv:2308.15272 [cs.AI] https://arxiv.org/abs/2308.15272
AutoDroid: LLM-powered Task Automation in Android. arXiv:2308.15272 [cs.AI] https://arxiv.org/abs/2308.15272
- [27]
- [29]
-
[30]
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. 2024. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. arXiv:2410.23218 [cs.CL] https://arxiv.org/ abs/2410.23218
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
- [33]
- [34]
- [35]
- [36]
-
[37]
Shuai Zhao, Jinming Wen, Anh Luu, Junbo Zhao, and Jie Fu. 2023. Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. In Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12303–1...
-
[40]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su
-
[41]
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv:2401.01614 [cs.IR] https://arxiv.org/abs/2401.01614
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Envi- ronment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.