pith. machine review for the scientific record. sign in

arxiv: 2605.10347 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

How Mobile World Model Guides GUI Agents?

Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:23 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords mobile GUI agentsworld modelsGUI navigationtrajectory generationself-reflectionvision-language modelsmobile interfaces
0
0 comments X

The pith

World models improve mobile GUI agent performance as training supervision but show limited value in post-hoc self-reflection for overconfident agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores the role of world models in guiding mobile GUI agents by predicting future states in different forms. The authors train models using delta text, full text, image diffusion, and renderable code, which reach state-of-the-art on prediction benchmarks. They then test how these models help in training agents and in test-time verification. The results show that trajectories from world models provide useful experience for training even without matching real distributions, yet self-reflection helps little when agents have low action entropy. Overall, world models work best as sources of prior perception and supervision rather than universal verifiers.

Core claim

By training world models in four modalities and evaluating on AITZ, AndroidControl, and AndroidWorld, we find that renderable code achieves high fidelity for multimodal supervision while text feedback is robust for out-of-distribution execution. World-model-generated trajectories transfer interaction experience to improve end-to-end task performance despite not preserving original distributions. For overconfident agents with low action entropy, posterior self-reflection offers limited gains, indicating world models are more effective as prior perception or training supervision than as post-hoc verifiers.

What carries the argument

Multimodal world models that generate future text, images, or code representations of GUI states to provide supervision and reflection signals for agents.

If this is right

  • Renderable code reconstruction supports effective data construction through high in-distribution fidelity.
  • Text-based world model feedback proves more robust for online out-of-distribution execution.
  • Generated trajectories supply transferable interaction experience that boosts agents' end-to-end performance.
  • Posterior self-reflection yields limited improvements for agents exhibiting low action entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar world model approaches could be adapted to desktop or web GUI agents beyond mobile platforms.
  • Integrating world models early in agent training pipelines may yield better results than using them only for verification.
  • Future work could explore combining multiple modalities to balance fidelity and robustness in agent guidance.

Load-bearing premise

The evaluations on the selected benchmarks and agents isolate the effects of world models without major influences from data filtering or specific benchmark designs.

What would settle it

A clear falsifier would be if agents trained without world model trajectories perform equally well on the downstream tasks, or if self-reflection significantly improves performance even for low-entropy agents.

Figures

Figures reproduced from arXiv: 2605.10347 by Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang.

Figure 1
Figure 1. Figure 1: Overview of empirical results across prediction formats, test-time guidance, and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between text-based and image-based world models for GUI state prediction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different world-modeling paradigms across four generation settings. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Four data filtering meth￾ods during our data construction. Graph-level transition deduplication. To reduce repeated state transitions, we merge transition triples with similar start states, similar next states, and the same action type. Following the node-merging strategy in Mobile3M [13], for two triples τn = (sn, an, sn+1) and τr = (sr, ar, sr+1), we treat them as duplicates if D(sn, sr) > 0.95, D(sn+1, … view at source ↗
Figure 5
Figure 5. Figure 5: Overall SR on AndroidWorld under two agent frameworks. Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall success rate on AndroidWorld with M3A agents. Bars report overall SR, and labels [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Entropy statistics and entropy-conditioned behavior [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Test-time scaling trends on AITZ (ID), AndroidControl (ID), and GUI-Odyssey (OOD). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: AndroidControl performance during training with World-Model imagination [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of agent fine-tuned on World-Model trajectories. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Six-dimensional radar plots for offline task navigation. Each subplot corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Entropy-range analysis for GUI world-model feedback. Left: accuracy trends across [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: AITZ downstream task case study with HTML-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: AndroidControl downstream task case study with delta-text world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: AndroidControl downstream task case study with text-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: False-negative penalty caused by world model hallucination. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure of HTML-based world models in simulating dynamic text input. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: High repetition rate of candidate actions in small-scale models. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Diffusion Image case studies on Mobile GUI state prediction, cases 1–2. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Diffusion Image case studies on Mobile GUI state prediction, cases 3–4. [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Code2Image Case Study 1: Action is Click on the search bar at the top of the screen to search for the arts. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Code2Image Case Study 2: Action is Click on the Moon tab at the bottom left corner of the screen to view the details. interaction logic, producing a systematic centering bias in the collected trajectories that degrades click accuracy after fine-tuning. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Code2Image Case Study 3: Action is Click on save story from the options. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Code2Image Case Study 4: Action is Swipe up to view the romeo and juliet file. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Code2Image Case Study 5: Action is Swipe up to view more details. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Code2Image Case Study 6: Input text is Literature art. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Bad Case studies of the 40-step denoising process used by diffusion-based world models [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Qualitative comparison between real Android screens and corresponding world-model [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
read the original abstract

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper trains mobile world models in four modalities (delta text, full text, diffusion-based images, renderable code) after filtering and annotating data. It claims SoTA results on MobileWorldBench and Code2WorldBench, and reports three findings from downstream evaluations on AITZ, AndroidControl, and AndroidWorld: renderable code offers high in-distribution fidelity while text is more robust for OOD; generated trajectories supply transferable experience that improves agent performance (despite distribution shift); and posterior self-reflection yields limited gains for low-entropy overconfident agents, implying world models are better used as priors or training supervision than universal verifiers.

Significance. If the empirical claims are supported by detailed metrics, ablations, and controls, the work would provide useful guidance on modality selection for GUI world models and clarify when generated trajectories can safely augment agent training versus when they introduce confounding shifts. The three findings, if isolated from filtering artifacts, could influence practical choices between prior supervision and post-hoc verification in mobile agents.

major comments (3)
  1. [Abstract] Abstract: the assertion of SoTA performance on MobileWorldBench and Code2WorldBench supplies no quantitative metrics, baselines, training details, or error analysis, rendering the central empirical claim unverifiable from the provided text.
  2. [Abstract] Abstract (second finding): the claim that world-model-generated trajectories improve end-to-end performance on AITZ, AndroidControl, and AndroidWorld rests on evaluations that do not ablate the data filtering and annotation step while holding it fixed; without this control, gains cannot be attributed to the world models rather than curation, especially given the explicit admission that the generated data do not preserve the original distribution.
  3. [Abstract] Abstract (third finding): the statement that posterior self-reflection provides limited gains for overconfident low-entropy agents lacks any description of how action entropy is computed, how agent strengths are quantified, or the precise experimental setup isolating this effect from other variables.
minor comments (2)
  1. [Abstract] Abstract: 'delta text' is used without definition or illustrative example, leaving its distinction from full text unclear.
  2. [Abstract] Abstract: the three findings are presented without any supporting numbers, tables, or figure references, which reduces the reader's ability to assess their magnitude and robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the abstract requires additional quantitative support and methodological clarity to make the empirical claims verifiable. We will revise the abstract accordingly in the next version of the manuscript while preserving the core findings. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of SoTA performance on MobileWorldBench and Code2WorldBench supplies no quantitative metrics, baselines, training details, or error analysis, rendering the central empirical claim unverifiable from the provided text.

    Authors: We agree that the abstract should be self-contained. The main paper (Sections 4.1 and 4.2) contains the full quantitative results, including accuracy metrics, baseline comparisons, training hyperparameters, and error analysis for both benchmarks. We will revise the abstract to report the key performance numbers and briefly note the training setup and error analysis, ensuring the SoTA claim is directly verifiable from the abstract text. revision: yes

  2. Referee: [Abstract] Abstract (second finding): the claim that world-model-generated trajectories improve end-to-end performance on AITZ, AndroidControl, and AndroidWorld rests on evaluations that do not ablate the data filtering and annotation step while holding it fixed; without this control, gains cannot be attributed to the world models rather than curation, especially given the explicit admission that the generated data do not preserve the original distribution.

    Authors: We acknowledge the need for a tighter control. The current experiments compare agents trained with versus without world-model trajectories under the same overall pipeline, and we explicitly note the distribution shift. To isolate the world-model contribution from curation effects, we will add a controlled ablation in the revised manuscript that holds the filtering and annotation protocol fixed while varying only the source of the trajectories (real filtered data vs. generated data). This will strengthen attribution of the observed gains. revision: yes

  3. Referee: [Abstract] Abstract (third finding): the statement that posterior self-reflection provides limited gains for overconfident low-entropy agents lacks any description of how action entropy is computed, how agent strengths are quantified, or the precise experimental setup isolating this effect from other variables.

    Authors: We agree that the abstract would benefit from these details. In the full paper, action entropy is the Shannon entropy of the policy's action distribution, agent strength is measured by baseline task success rate without guidance, and the setup compares low-entropy versus higher-entropy agents with and without self-reflection on identical task sets. We will add a concise description of the entropy computation, strength quantification, and experimental isolation to the revised abstract, with a pointer to the detailed protocol in Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical data pipeline and benchmark evaluation

full rationale

The paper presents an empirical workflow of filtering/annotating mobile GUI data, training four modality-specific world models, and measuring downstream task performance on AITZ, AndroidControl, and AndroidWorld. No equations, derivations, or first-principles claims appear that reduce to author-defined quantities by construction. Performance numbers are obtained from standard training and evaluation loops; the three findings are direct observations from those runs rather than self-referential predictions. Self-citations, if present, are not load-bearing for any central result. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised training of vision-language and diffusion models plus the assumption that the chosen benchmarks measure real-world utility; no new free parameters, axioms, or invented entities are introduced beyond those already standard in the field.

axioms (1)
  • standard math Standard assumptions of supervised learning and diffusion model training hold for the mobile GUI data
    Invoked implicitly when claiming SoTA performance after training on filtered and annotated data.

pith-pipeline@v0.9.0 · 5575 in / 1259 out tokens · 40149 ms · 2026-05-12T04:23:32.673720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  5. [5]

    Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

    Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, et al. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

  6. [6]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  7. [7]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

  8. [8]

    arXiv preprint arXiv:2411.06559 , year=

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  9. [9]

    Mobiledreamer: Generative sketch world model for gui agent

    Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent. arXiv preprint arXiv:2601.04035, 2026

  10. [10]

    Vimo: A generative visual gui world model for app agents

    Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

  11. [11]

    Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

    Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

  12. [12]

    Code2world: A gui world model via renderable code generation

    Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026

  13. [13]

    Mobilevlm: A vision-language model for better intra-and inter-ui understanding

    Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Liujian Liujianfeng, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10231–10251, 2024

  14. [14]

    An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

  15. [15]

    Amex: Android multi-annotation expo dataset for mobile gui agents

    Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2138–2156, 2025. 13

  16. [16]

    Android in the zoo: Chain-of-action-thought for gui agents

    Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024

  17. [17]

    On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

  18. [18]

    Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

  19. [19]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  20. [20]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  21. [21]

    Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  23. [23]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  24. [24]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

  25. [25]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  26. [26]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  27. [27]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  28. [28]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

  29. [29]

    Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,

    Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

  30. [30]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025. 14

  31. [31]

    Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

  32. [32]

    arXiv preprint arXiv:2501.11733 , year=

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

  33. [33]

    Gui-r1: A generalist r1-style vision-language action model for gui agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  34. [34]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

  35. [35]

    arXiv preprint arXiv:2504.14239 , year=

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

  36. [36]

    Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

  37. [37]

    Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,

    Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026

  38. [38]

    Mobile-bench: An evaluation benchmark for llm-based mobile agents

    Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

  39. [39]

    Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. InThe Fourteenth International Conference on Learning Representations, 2026

  40. [40]

    Cogagent: A visual language model for gui agents, 2023

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023

  41. [41]

    Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions

    Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, and Rui Yan. Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 883–893, 2025

  42. [42]

    Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

    Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

  43. [43]

    Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

    Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, et al. Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

  44. [44]

    Llm-based agents for tool learning: A survey: W

    Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

  45. [45]

    Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025. 15

  46. [46]

    Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

  47. [47]

    Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

    Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

  48. [48]

    Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

    Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

  49. [49]

    Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

    Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, and Peng Han. Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

  50. [50]

    Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy

    Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, 2024

  51. [51]

    hallucinations

    Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, and Heng Ji. From word to world: Can large language models be implicit text-based world models?arXiv preprint arXiv:2512.18832, 2025. 16 A Full Related Work A.1 Mobile GUI Agents The evolution of mobile GUI agents marks a paradigm shift from rule-...

  52. [52]

    Result: Page navigation, popup opens, toggle switches, or focus change

    click: A tap on a button, icon, or link. Result: Page navigation, popup opens, toggle switches, or focus change

  53. [53]

    Result: Context menu appears or item selection mode triggers

    long_press: A sustained touch. Result: Context menu appears or item selection mode triggers

  54. [54]

    (New content appears, old content moves off-screen)

    scroll: The content shifts vertically or horizontally. (New content appears, old content moves off-screen). 4.input_text: Text appears in an input field (without an explicit enter press)

  55. [55]

    inferred_action

    open_app: The screen transitions from a launcher/home screen to a specific app interface. 6.navigate_home: Returns to the device home screen/launcher. 7.navigate_back: Returns to the previous screen (reverse navigation). 8.wait: No significant visual change, or a loading spinner continues spinning. 9.none: The transition is hallucinated, broken, illogical...

  56. [56]

    Goal progress first: does this action move the task toward completion at this specific step?

  57. [57]

    valid" or

    Prediction reliability second: is the predicted next page trustworthy enough to support that progress judgment? 26 Treat textual realism as evidence quality, not the objective. If realism is high but progress is weak, do not mark valid. Now provide your judgment on the selected action in JSON format. Your response must include: • Reason: Explain primarily...

  58. [58]

    Therefore, participant risks, risk disclosure, and IRB approval are not applicable

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...