pith. sign in

arxiv: 2605.27761 · v1 · pith:CU6ENH3Vnew · submitted 2026-05-26 · 💻 cs.CV · cs.SE

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Pith reviewed 2026-06-29 17:49 UTC · model grok-4.3

classification 💻 cs.CV cs.SE
keywords AndroidDailyGRADEGUI agentsmobile benchmarksclosed-source applicationsvisual trajectory evaluationautomatic assessmentdaily tasks
0
0 comments X

The pith

AndroidDaily benchmark enables verifiable evaluation of GUI agents on 94 closed-source Android apps via visual guidelines alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create a verifiable way to test mobile GUI agents on real closed-source Android applications that do not reveal their internal states. It does this by releasing AndroidDaily with 350 tasks spanning 94 apps and developing GRADE, an evaluator that judges performance using only visible screen information and predefined guidelines. A sympathetic reader would care because previous benchmarks relied on simulated or open apps, leaving practical performance unknown. GRADE reaches 87.37 percent agreement with people, and the best tested model succeeds on 62 percent of tasks.

Core claim

The central claim is that by defining tasks through external observable guidelines in three categories—operational obligations, output quality, and negative constraints—GRADE can automatically and reliably evaluate whether an agent's sequence of screen actions completes a task on closed-source apps, matching human judgment at 87.37 percent while revealing that leading models achieve at most 62 percent success.

What carries the argument

GRADE, a process-aware evaluator that applies three-tiered external guidelines (operational obligations, output quality, negative constraints) to visual trajectories for step-level diagnostic judgments.

If this is right

  • Real-world closed-source apps become usable for benchmarking GUI agents in daily categories like shopping and transportation.
  • Evaluation of long-horizon mobile interactions becomes automatic without needing hidden states.
  • The 62 percent success rate identifies a measurable gap in current model capabilities for practical workflows.
  • Step-level diagnostic judgments allow targeted identification of where agents fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • GRADE could be adapted to provide feedback signals during agent training on similar tasks.
  • The benchmark structure may extend to other closed platforms such as iOS applications.
  • Collecting more tasks in underrepresented categories would test whether the evaluation method holds at larger scale.

Load-bearing premise

The three-tiered external guidelines suffice to judge task success accurately from visual trajectories without access to any internal application state.

What would settle it

A calculation showing systematic disagreement between GRADE and humans on tasks whose success depends on unobservable internal changes such as account balances or order confirmations.

Figures

Figures reproduced from arXiv: 2605.27761 by Daxin Jiang, Fang Xu, Haolong Yan, Hongbing Li, Jiahe Lv, Jia Wang, Junbo Qi, Kaijun Tan, Litao Liu, Osamu Yoshie, Xiangyu Zhang, Xin Huang, Yeqing Shen, Yifan Sui, Zheng Ge, Zhimin Fan, Ziyang Meng.

Figure 1
Figure 1. Figure 1: Relationship between per-step inference latency [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AndroidDaily and GRADE. Left: the Task Construction Pipeline builds AndroidDaily with 350 realistic [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark statistics of AndroidDaily. (a) Top-20 apps by task frequency (English app names) over 350 tasks, led by [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The GRADE evaluation pipeline. The Evidence Layer replays the agent trajectory to assemble structured evidence. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Latency-Induced Click Misalignment. (a) At the prediction state, the model accurately plans spatial [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two more failure cases under realistic mobile evaluation. (Left) Repetitive Operation Loop: The agent becomes trapped [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces AndroidDaily, a benchmark comprising 350 realistic daily-use tasks across 94 closed-source Android applications in categories including transportation, shopping, local services, entertainment, content creation, social media, and utilities. It proposes GRADE, a process-aware evaluator that uses a three-tiered system of observable external guidelines (operational obligations, output quality, and negative constraints) to generate step-level diagnostic judgments from visual trajectories alone, without access to internal application states. The central empirical results are that GRADE achieves 87.37% agreement with human evaluators and that the strongest tested model attains a 62.0% success rate, indicating a gap between current capabilities and practical mobile workflows.

Significance. If the results hold, this work would be significant for GUI agent and mobile automation research by addressing the longstanding limitation that prior benchmarks rely on simulated environments or open-source apps with accessible internal states. The empirical human-agreement validation of GRADE supplies direct evidence that external visual criteria can support verifiable evaluation on opaque real-world apps, which could enable more realistic and scalable benchmarking. The scale (350 tasks, 94 apps) and focus on daily-use scenarios add practical value if the evaluation protocol is fully documented.

major comments (2)
  1. [Abstract] Abstract: The abstract reports concrete numbers (87.37% agreement, 62% success) but supplies no details on task curation, human rater protocol, statistical tests, or inter-rater reliability. These omissions are load-bearing for the central claim that the three-tiered guidelines suffice to determine task success from visual trajectories alone.
  2. [GRADE description] The premise that the three-tiered external guidelines are sufficient to determine task success from visual trajectories alone (without internal state) is stated as the foundation of GRADE, yet the manuscript provides no concrete examples of guideline application, disagreement resolution, or edge cases across the 350 tasks. This makes it difficult to assess whether the 87.37% agreement generalizes beyond the reported sample.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity of our claims regarding GRADE's evaluation protocol. We address each major comment below and commit to revisions where the manuscript can be improved without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports concrete numbers (87.37% agreement, 62% success) but supplies no details on task curation, human rater protocol, statistical tests, or inter-rater reliability. These omissions are load-bearing for the central claim that the three-tiered guidelines suffice to determine task success from visual trajectories alone.

    Authors: We agree that the abstract's brevity omits these details, which are load-bearing for the central claim. The full manuscript details task curation in Section 3, the human rater protocol and inter-rater reliability metrics (including Fleiss' kappa) in Section 4.2, and statistical significance tests in Section 4.3. In the revision, we will expand the abstract with one additional sentence summarizing the human validation protocol and evaluation methodology while remaining within length constraints. revision: yes

  2. Referee: [GRADE description] The premise that the three-tiered external guidelines are sufficient to determine task success from visual trajectories alone (without internal state) is stated as the foundation of GRADE, yet the manuscript provides no concrete examples of guideline application, disagreement resolution, or edge cases across the 350 tasks. This makes it difficult to assess whether the 87.37% agreement generalizes beyond the reported sample.

    Authors: We acknowledge that the absence of concrete examples limits the reader's ability to evaluate generalization. The manuscript reports aggregate agreement but does not illustrate individual guideline applications or disagreement cases. In the revised version, we will add a new subsection (or appendix) with 4-5 representative examples drawn from the 350 tasks, covering guideline application, resolution of disagreements between GRADE and humans, and edge cases such as partial completions or ambiguous visual outcomes. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a benchmark (AndroidDaily) and an evaluator (GRADE) based on observable external guidelines, then reports direct empirical measurements: 87.37% agreement between GRADE and human evaluators on the same visual trajectories, plus a 62.0% success rate for the strongest model. No equations, fitted parameters, derivations, or self-citation chains appear in the load-bearing claims. The central premise that the three-tiered guidelines suffice for verifiable judgments is tested by the agreement metric itself rather than assumed or reduced to prior inputs by construction. The work is therefore self-contained empirical benchmark construction with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that external visual guidelines can substitute for internal state access; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption External visual guidelines (operational obligations, output quality, negative constraints) are sufficient to determine task success from screen trajectories alone
    This premise underpins GRADE and is invoked when the abstract states that the evaluator works without hidden internal states.
invented entities (1)
  • GRADE evaluator no independent evidence
    purpose: Automatic step-level diagnostic evaluation in closed-source environments
    New process-aware evaluator introduced by the paper; no independent falsifiable evidence outside the reported human agreement is provided.

pith-pipeline@v0.9.1-grok · 5802 in / 1353 out tokens · 42769 ms · 2026-06-29T17:49:09.943632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 32 canonical work pages · 11 internal anchors

  1. [1]

    Reyna Abhyankar, Qi Qi, and Yiying Zhang. 2025. Osworld-human: Bench- marking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042 (2025)

  2. [2]

    Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. 2024. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164(2024)

  3. [3]

    2026.System Card: Claude Opus 4.6

    Anthropic. 2026.System Card: Claude Opus 4.6. Technical Report. Anthropic. https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd. pdf

  4. [4]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  5. [5]

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. 2023. Introducing our Multimodal Models. https://www.adept.ai/blog/fuyu-8b

  6. [6]

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al

  7. [7]

    Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264(2024)

  8. [8]

    ByteDance Seed. 2026. Seed1.8 Model Card: Towards Generalized Real-World Agency.arXiv preprint arXiv:2603.20633(2026)

  9. [9]

    2026.Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

    ByteDance Seed. 2026.Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity. Technical Report. ByteDance. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/ seed2/0214/Seed2.0%20Model%20Card.pdf

  10. [10]

    Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, et al . 2024. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. InNeurIPS 2024 Workshop on Open-World Agents

  11. [11]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9313–9332

  12. [12]

    Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, De- han Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. 2025. The BrowserGym Ec...

  13. [13]

    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. InProceedings of the 30th annual ACM symposium on user interface software and technology. 845–854

  14. [14]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36 (2023), 28091–28114

  15. [15]

    Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. 2026. UI-Venus-1.5 Technical Report.arXiv preprint arXiv:2602.09082(2026)

  16. [16]

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. 2023. Assist- gui: Task-oriented desktop graphical user interface automation.arXiv preprint arXiv:2312.13108(2023)

  17. [17]

    2025.Gemini 3 Flash

    Google DeepMind. 2025.Gemini 3 Flash. Technical Report. Google DeepMind. https://deepmind.google/models/gemini/flash/

  18. [18]

    2025.Gemini 3 Pro Model Card

    Google DeepMind. 2025.Gemini 3 Pro Model Card. Technical Report. Google DeepMind. https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf

  19. [19]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.The Innovation(2024)

  20. [20]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6864–6890

  21. [21]

    GLM-V Team: Wenyi Hong, Wenmeng Yu, Xiaotao Gu, et al. 2025. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.arXiv preprint arXiv:2507.01006(2025)

  22. [22]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14281–14290

  23. [23]

    Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Srinivas Sunkara, Victor Cărbune, Jason Lin, Maria Wang, Yun Zhu, and Jindong Chen. 2025. Screenqa: Large- scale question-answer pairs over mobile app screenshots. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

  24. [24]

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. 2024. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision. Springer, 161–178

  25. [25]

    Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2024. Critiquellm: To- wards an informative critique generation model for evaluation of large language model generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  26. [26]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 881–905

  27. [27]

    Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. 2025. Mo- bileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments.arXiv preprint arXiv:2512.19432(2025)

  28. [28]

    Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. 2026. Tool- PRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents.arXiv preprint arXiv:2601.12294(2026)

  29. [29]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)

  30. [30]

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. 2025. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia. 8778–8786

  31. [31]

    Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. 2020. Widget captioning: Generating natural language description for mobile user AndroidDaily interface elements. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 5495–5510

  32. [32]

    Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, et al. 2026. GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents. arXiv preprint arXiv:2603.15039(2026)

  33. [33]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. Droidbot: a lightweight ui-guided test input generator for android. In2017 IEEE/ACM 39th international conference on software engineering companion (ICSE-C). IEEE, 23–26

  34. [34]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe twelfth international conference on learning representations

  35. [35]

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang

  36. [36]

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

    Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. InInternational Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802

  37. [37]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326 (2025)

  38. [38]

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking envi- ronment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)

  39. [39]

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lilli- crap. 2023. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems36 (2023), 59708–59728

  40. [40]

    Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. 2025. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923(2025)

  41. [41]

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al . 2024. Cradle: Em- powering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186(2024)

  42. [42]

    Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. 2024. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15641–15653

  43. [43]

    Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2words: Automatic mobile ui summarization with multimodal learning. InThe 34th Annual ACM Symposium on User Interface Software and Technology. 498–510

  44. [44]

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems37 (2024), 2686–2710

  45. [45]

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. 2025. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478(2025)

  46. [46]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th annual international conference on Mobile computing and networking. 543–557

  47. [47]

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456(2024)

  48. [48]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al . 2024. OS- ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218(2024)

  49. [49]

    Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, et al. 2025. Agentprm: Process reward models for llm agents via step-wise promise and progress.arXiv preprint arXiv:2511.08325(2025)

  50. [50]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024. Os- world: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  51. [51]

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. 2026. Mobile-Agent-v3. 5: Multi-platform Fundamental GUI Agents.arXiv preprint arXiv:2602.16855(2026)

  52. [52]

    Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, and Guohao Li. 2024. CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. arXiv:2407.01511 [cs.AI] https://arxiv.org/ abs/2407.01511

  53. [53]

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. 2026. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876(2026)

  54. [54]

    Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, et al. 2025. GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2512.02423(2025)

  55. [55]

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. 2025. Step-gui technical report. arXiv preprint arXiv:2512.15431(2025)

  56. [56]

    Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, et al . 2025. Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539 (2025)

  57. [57]

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. 2025. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)

  58. [58]

    Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision. Springer, 240–255

  59. [59]

    Zhuosheng Zhan and Aston Zhang. 2023. You Only Look at Screens: Multimodal Chain-of-Action Agents.arXiv preprint arXiv:2309.11436(2023)

  60. [60]

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. 2025. Ufo: A ui-focused agent for windows os interaction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)...

  61. [61]

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–20

  62. [62]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  63. [63]

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al . 2025. MAI-UI Technical Report: Real-World Centric Foundation GUI Agents.arXiv preprint arXiv:2512.22047(2025)

  64. [64]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

  65. [65]

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631(2023)