pith. sign in

arxiv: 2605.10347 · v2 · pith:26HI5JTCnew · submitted 2026-05-11 · 💻 cs.AI · cs.CL

How Mobile World Model Guides GUI Agents?

Pith reviewed 2026-05-25 05:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords mobile world modelsGUI agentsrenderable codemultimodal supervisionout-of-distribution executiontrajectory generationaction prediction
0
0 comments X

The pith

Renderable code reconstruction in mobile world models achieves high in-distribution fidelity and provides effective multimodal supervision for GUI agent data construction, while text-based feedback is more robust for online out-of-distribut

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains mobile world models in four modalities—delta text, full text, diffusion images, and renderable code—then tests which ones best guide GUI agents on benchmarks and downstream tasks. It establishes that renderable code matches real data closely enough to create useful training examples, whereas text feedback handles novel situations during live execution. Generated trajectories from these models can be fed into agent training to raise end-to-end success rates even though the trajectories themselves diverge from the original data distribution. The work also shows that using a world model to double-check actions after the fact adds little value for agents that already act with high confidence.

Core claim

By filtering and annotating mobile world-model data and training models across delta text, full text, diffusion-based images, and renderable code, these models reach state-of-the-art on MobileWorldBench and Code2WorldBench. Downstream tests on AITZ, AndroidControl, and AndroidWorld show renderable code reconstruction achieves high in-distribution fidelity and supplies effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution execution. World-model-generated trajectories supply transferable interaction experience that improves agents' end-to-end task performance although the data do not preserve the original distribution.

What carries the argument

Mobile world models in four modalities (delta text, full text, diffusion images, renderable code) that predict future states to supply either prior perception, training supervision, or post-hoc verification for GUI agents.

Load-bearing premise

The three downstream evaluation environments and the filtered training data are representative enough to support general claims about which modality works best for arbitrary mobile GUI agents and long-horizon tasks.

What would settle it

A controlled test on a new mobile environment outside the three evaluated ones in which text-based feedback loses its OOD advantage or generated trajectories cease to raise end-to-end agent performance.

Figures

Figures reproduced from arXiv: 2605.10347 by Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang.

Figure 1
Figure 1. Figure 1: Overview of empirical results across prediction formats, test-time guidance, and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between text-based and image-based world models for GUI state prediction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different world-modeling paradigms across four generation settings. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Four data filtering meth￾ods during our data construction. Graph-level transition deduplication. To reduce repeated state transitions, we merge transition triples with similar start states, similar next states, and the same action type. Following the node-merging strategy in Mobile3M [13], for two triples τn = (sn, an, sn+1) and τr = (sr, ar, sr+1), we treat them as duplicates if D(sn, sr) > 0.95, D(sn+1, … view at source ↗
Figure 5
Figure 5. Figure 5: Overall SR on AndroidWorld under two agent frameworks. Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall success rate on AndroidWorld with M3A agents. Bars report overall SR, and labels [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Entropy statistics and entropy-conditioned behavior [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Test-time scaling trends on AITZ (ID), AndroidControl (ID), and GUI-Odyssey (OOD). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: AndroidControl performance during training with World-Model imagination [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of agent fine-tuned on World-Model trajectories. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Six-dimensional radar plots for offline task navigation. Each subplot corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Entropy-range analysis for GUI world-model feedback. Left: accuracy trends across [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: AITZ downstream task case study with HTML-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: AndroidControl downstream task case study with delta-text world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: AndroidControl downstream task case study with text-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: False-negative penalty caused by world model hallucination. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure of HTML-based world models in simulating dynamic text input. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: High repetition rate of candidate actions in small-scale models. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Diffusion Image case studies on Mobile GUI state prediction, cases 1–2. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Diffusion Image case studies on Mobile GUI state prediction, cases 3–4. [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Code2Image Case Study 1: Action is Click on the search bar at the top of the screen to search for the arts. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Code2Image Case Study 2: Action is Click on the Moon tab at the bottom left corner of the screen to view the details. interaction logic, producing a systematic centering bias in the collected trajectories that degrades click accuracy after fine-tuning. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Code2Image Case Study 3: Action is Click on save story from the options. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Code2Image Case Study 4: Action is Swipe up to view the romeo and juliet file. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Code2Image Case Study 5: Action is Swipe up to view more details. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Code2Image Case Study 6: Input text is Literature art. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Bad Case studies of the 40-step denoising process used by diffusion-based world models [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Qualitative comparison between real Android screens and corresponding world-model [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
read the original abstract

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper filters and annotates mobile GUI data to train world models in four modalities (delta text, full text, diffusion images, renderable code). These achieve SOTA on MobileWorldBench and Code2WorldBench. Downstream evaluations on AITZ, AndroidControl, and AndroidWorld yield three findings: renderable code provides high in-distribution fidelity and effective multimodal supervision while text feedback is more robust for online OOD execution; generated trajectories improve end-to-end agent performance without preserving the original distribution; and world models are more effective as prior perception or training supervision than as post-hoc verifiers, especially for overconfident low-entropy agents.

Significance. If the empirical modality comparisons hold after proper controls, the work offers concrete guidance on representation choices for mobile world models and demonstrates that generated trajectories can transfer interaction experience. The multi-modality training and SOTA benchmark results are strengths; the downstream utility findings could inform design of long-horizon GUI agents if the in-distribution vs. OOD distinction is rigorously established.

major comments (2)
  1. [Downstream evaluation on AITZ, AndroidControl, and AndroidWorld] Downstream evaluation section: no explicit metric (action-sequence divergence, visual embedding distance, task-horizon statistics, or similar) is reported to quantify distribution shift between the filtered training trajectories and the three evaluation environments (AITZ, AndroidControl, AndroidWorld). This quantification is load-bearing for the central claim that text-based feedback is more robust specifically for online OOD execution versus renderable code for in-distribution fidelity.
  2. [Findings on generated trajectories] Findings paragraph and associated tables/figures: the reported improvements from generated trajectories on end-to-end task performance are presented without controls for multiple comparisons or statistical significance testing across the modality variants and environments, weakening the second finding.
minor comments (2)
  1. [Methods] Notation for the four modalities (delta text, full text, diffusion-based images, renderable code) should be introduced with consistent abbreviations in the methods section for clarity in later comparisons.
  2. [Benchmark results] The abstract states 'SoTA performance' on the two benchmarks; the main text should include the exact prior baselines and margins for each modality to allow direct verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important aspects of rigor in the downstream evaluation and statistical presentation. We address each below.

read point-by-point responses
  1. Referee: [Downstream evaluation on AITZ, AndroidControl, and AndroidWorld] Downstream evaluation section: no explicit metric (action-sequence divergence, visual embedding distance, task-horizon statistics, or similar) is reported to quantify distribution shift between the filtered training trajectories and the three evaluation environments (AITZ, AndroidControl, AndroidWorld). This quantification is load-bearing for the central claim that text-based feedback is more robust specifically for online OOD execution versus renderable code for in-distribution fidelity.

    Authors: We agree that explicit quantification of distribution shift is necessary to rigorously support the ID versus OOD distinction in our modality findings. In the revision we will add metrics including action-sequence edit distance and cosine distances in a shared visual embedding space between the filtered training trajectories and each of the three evaluation environments. revision: yes

  2. Referee: [Findings on generated trajectories] Findings paragraph and associated tables/figures: the reported improvements from generated trajectories on end-to-end task performance are presented without controls for multiple comparisons or statistical significance testing across the modality variants and environments, weakening the second finding.

    Authors: We acknowledge the absence of formal statistical controls. In the revised manuscript we will include paired statistical tests (with multiple-comparison correction) across modality variants and environments and report the resulting p-values alongside the performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation chain is self-contained against external benchmarks

full rationale

The paper filters and annotates data, trains four modality-specific world models, reports SoTA on MobileWorldBench and Code2WorldBench, then measures downstream effects on the independent environments AITZ, AndroidControl, and AndroidWorld. No equations, fitted parameters, or self-citations are shown to reduce any reported gain (in-distribution fidelity, OOD robustness, or end-to-end improvement) to a quantity defined by the paper's own inputs. The modality ranking and trajectory-transfer claims rest on observable benchmark outcomes rather than definitional or self-referential reductions, satisfying the default expectation of an empirical study with low circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard supervised training of vision-language and diffusion models plus the assumption that the chosen mobile datasets are representative.

pith-pipeline@v0.9.0 · 5806 in / 1179 out tokens · 21743 ms · 2026-05-25T05:57:58.113939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  5. [5]

    Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

    Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, et al. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

  6. [6]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  7. [7]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

  8. [8]

    Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  9. [9]

    Mobiledreamer: Generative sketch world model for gui agent

    Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent. arXiv preprint arXiv:2601.04035, 2026

  10. [10]

    Vimo: A generative visual gui world model for app agents

    Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

  11. [11]

    Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

    Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

  12. [12]

    Code2world: A gui world model via renderable code generation

    Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026

  13. [13]

    Mobilevlm: A vision-language model for better intra-and inter-ui understanding

    Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Liujian Liujianfeng, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10231–10251, 2024

  14. [14]

    An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

  15. [15]

    Amex: Android multi-annotation expo dataset for mobile gui agents

    Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2138–2156, 2025. 13

  16. [16]

    Android in the zoo: Chain-of-action-thought for gui agents

    Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024

  17. [17]

    On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

  18. [18]

    Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

  19. [19]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  20. [20]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  21. [21]

    Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  23. [23]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  24. [24]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

  25. [25]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  26. [26]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  27. [27]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  28. [28]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

  29. [29]

    Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

    Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

  30. [30]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025. 14

  31. [31]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

  32. [32]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

  33. [33]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  34. [34]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

  35. [35]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

  36. [36]

    Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

  37. [37]

    Webworld: A large-scale world model for web agent training

    Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026

  38. [38]

    Mobile-bench: An evaluation benchmark for llm-based mobile agents

    Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

  39. [39]

    Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. InThe Fourteenth International Conference on Learning Representations, 2026

  40. [40]

    Cogagent: A visual language model for gui agents, 2023

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023

  41. [41]

    Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions

    Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, and Rui Yan. Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 883–893, 2025

  42. [42]

    Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

    Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

  43. [43]

    Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

    Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, et al. Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

  44. [44]

    Llm-based agents for tool learning: A survey: W

    Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

  45. [45]

    Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025. 15

  46. [46]

    Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

  47. [47]

    Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

    Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

  48. [48]

    Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

    Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

  49. [49]

    Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

    Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, and Peng Han. Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

  50. [50]

    Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy

    Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, 2024

  51. [51]

    hallucinations

    Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, and Heng Ji. From word to world: Can large language models be implicit text-based world models?arXiv preprint arXiv:2512.18832, 2025. 16 A Full Related Work A.1 Mobile GUI Agents The evolution of mobile GUI agents marks a paradigm shift from rule-...

  52. [52]

    Result: Page navigation, popup opens, toggle switches, or focus change

    click: A tap on a button, icon, or link. Result: Page navigation, popup opens, toggle switches, or focus change

  53. [53]

    Result: Context menu appears or item selection mode triggers

    long_press: A sustained touch. Result: Context menu appears or item selection mode triggers

  54. [54]

    (New content appears, old content moves off-screen)

    scroll: The content shifts vertically or horizontally. (New content appears, old content moves off-screen). 4.input_text: Text appears in an input field (without an explicit enter press)

  55. [55]

    inferred_action

    open_app: The screen transitions from a launcher/home screen to a specific app interface. 6.navigate_home: Returns to the device home screen/launcher. 7.navigate_back: Returns to the previous screen (reverse navigation). 8.wait: No significant visual change, or a loading spinner continues spinning. 9.none: The transition is hallucinated, broken, illogical...

  56. [56]

    Goal progress first: does this action move the task toward completion at this specific step?

  57. [57]

    valid" or

    Prediction reliability second: is the predicted next page trustworthy enough to support that progress judgment? 26 Treat textual realism as evidence quality, not the objective. If realism is high but progress is weak, do not mark valid. Now provide your judgment on the selected action in JSON format. Your response must include: • Reason: Explain primarily...

  58. [58]

    Therefore, participant risks, risk disclosure, and IRB approval are not applicable

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...