arxiv: 2605.10347 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

How Mobile World Model Guides GUI Agents?

Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:23 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords mobile GUI agentsworld modelsGUI navigationtrajectory generationself-reflectionvision-language modelsmobile interfaces

0 comments

The pith

World models improve mobile GUI agent performance as training supervision but show limited value in post-hoc self-reflection for overconfident agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores the role of world models in guiding mobile GUI agents by predicting future states in different forms. The authors train models using delta text, full text, image diffusion, and renderable code, which reach state-of-the-art on prediction benchmarks. They then test how these models help in training agents and in test-time verification. The results show that trajectories from world models provide useful experience for training even without matching real distributions, yet self-reflection helps little when agents have low action entropy. Overall, world models work best as sources of prior perception and supervision rather than universal verifiers.

Core claim

By training world models in four modalities and evaluating on AITZ, AndroidControl, and AndroidWorld, we find that renderable code achieves high fidelity for multimodal supervision while text feedback is robust for out-of-distribution execution. World-model-generated trajectories transfer interaction experience to improve end-to-end task performance despite not preserving original distributions. For overconfident agents with low action entropy, posterior self-reflection offers limited gains, indicating world models are more effective as prior perception or training supervision than as post-hoc verifiers.

What carries the argument

Multimodal world models that generate future text, images, or code representations of GUI states to provide supervision and reflection signals for agents.

If this is right

Renderable code reconstruction supports effective data construction through high in-distribution fidelity.
Text-based world model feedback proves more robust for online out-of-distribution execution.
Generated trajectories supply transferable interaction experience that boosts agents' end-to-end performance.
Posterior self-reflection yields limited improvements for agents exhibiting low action entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar world model approaches could be adapted to desktop or web GUI agents beyond mobile platforms.
Integrating world models early in agent training pipelines may yield better results than using them only for verification.
Future work could explore combining multiple modalities to balance fidelity and robustness in agent guidance.

Load-bearing premise

The evaluations on the selected benchmarks and agents isolate the effects of world models without major influences from data filtering or specific benchmark designs.

What would settle it

A clear falsifier would be if agents trained without world model trajectories perform equally well on the downstream tasks, or if self-reflection significantly improves performance even for low-entropy agents.

Figures

Figures reproduced from arXiv: 2605.10347 by Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang.

**Figure 2.** Figure 2: Comparison between text-based and image-based world models for GUI state prediction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of different world-modeling paradigms across four generation settings. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Four data filtering methods during our data construction. Graph-level transition deduplication. To reduce repeated state transitions, we merge transition triples with similar start states, similar next states, and the same action type. Following the node-merging strategy in Mobile3M [13], for two triples τn = (sn, an, sn+1) and τr = (sr, ar, sr+1), we treat them as duplicates if D(sn, sr) > 0.95, D(sn+1, … view at source ↗

**Figure 5.** Figure 5: Overall SR on AndroidWorld under two agent frameworks. Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overall success rate on AndroidWorld with M3A agents. Bars report overall SR, and labels [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Entropy statistics and entropy-conditioned behavior [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Test-time scaling trends on AITZ (ID), AndroidControl (ID), and GUI-Odyssey (OOD). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: AndroidControl performance during training with World-Model imagination [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Analysis of agent fine-tuned on World-Model trajectories. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Six-dimensional radar plots for offline task navigation. Each subplot corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Entropy-range analysis for GUI world-model feedback. Left: accuracy trends across [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: AITZ downstream task case study with HTML-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: AndroidControl downstream task case study with delta-text world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: AndroidControl downstream task case study with text-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: False-negative penalty caused by world model hallucination. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Failure of HTML-based world models in simulating dynamic text input. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: High repetition rate of candidate actions in small-scale models. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Diffusion Image case studies on Mobile GUI state prediction, cases 1–2. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Diffusion Image case studies on Mobile GUI state prediction, cases 3–4. [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Code2Image Case Study 1: Action is Click on the search bar at the top of the screen to search for the arts. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Code2Image Case Study 2: Action is Click on the Moon tab at the bottom left corner of the screen to view the details. interaction logic, producing a systematic centering bias in the collected trajectories that degrades click accuracy after fine-tuning. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Code2Image Case Study 3: Action is Click on save story from the options. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Code2Image Case Study 4: Action is Swipe up to view the romeo and juliet file. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Code2Image Case Study 5: Action is Swipe up to view more details. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

**Figure 26.** Figure 26: Code2Image Case Study 6: Input text is Literature art. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: Bad Case studies of the 40-step denoising process used by diffusion-based world models [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗

**Figure 28.** Figure 28: Qualitative comparison between real Android screens and corresponding world-model [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗

read the original abstract

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares four modalities for mobile world models and reports three findings on their use for GUI agents, but missing numbers and possible filtering confounds make the claims hard to assess.

read the letter

This paper's main contribution is an empirical study of four output modalities for mobile world models and how they guide GUI agents on tasks like those in AITZ and AndroidWorld. It reports that renderable code excels at high-fidelity reconstruction for training data while text-based outputs are more reliable during online OOD runs, that generated trajectories can still boost end-to-end performance even with distribution shift, and that post-hoc reflection helps little for low-entropy agents.

Referee Report

3 major / 2 minor

Summary. The paper trains mobile world models in four modalities (delta text, full text, diffusion-based images, renderable code) after filtering and annotating data. It claims SoTA results on MobileWorldBench and Code2WorldBench, and reports three findings from downstream evaluations on AITZ, AndroidControl, and AndroidWorld: renderable code offers high in-distribution fidelity while text is more robust for OOD; generated trajectories supply transferable experience that improves agent performance (despite distribution shift); and posterior self-reflection yields limited gains for low-entropy overconfident agents, implying world models are better used as priors or training supervision than universal verifiers.

Significance. If the empirical claims are supported by detailed metrics, ablations, and controls, the work would provide useful guidance on modality selection for GUI world models and clarify when generated trajectories can safely augment agent training versus when they introduce confounding shifts. The three findings, if isolated from filtering artifacts, could influence practical choices between prior supervision and post-hoc verification in mobile agents.

major comments (3)

[Abstract] Abstract: the assertion of SoTA performance on MobileWorldBench and Code2WorldBench supplies no quantitative metrics, baselines, training details, or error analysis, rendering the central empirical claim unverifiable from the provided text.
[Abstract] Abstract (second finding): the claim that world-model-generated trajectories improve end-to-end performance on AITZ, AndroidControl, and AndroidWorld rests on evaluations that do not ablate the data filtering and annotation step while holding it fixed; without this control, gains cannot be attributed to the world models rather than curation, especially given the explicit admission that the generated data do not preserve the original distribution.
[Abstract] Abstract (third finding): the statement that posterior self-reflection provides limited gains for overconfident low-entropy agents lacks any description of how action entropy is computed, how agent strengths are quantified, or the precise experimental setup isolating this effect from other variables.

minor comments (2)

[Abstract] Abstract: 'delta text' is used without definition or illustrative example, leaving its distinction from full text unclear.
[Abstract] Abstract: the three findings are presented without any supporting numbers, tables, or figure references, which reduces the reader's ability to assess their magnitude and robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the abstract requires additional quantitative support and methodological clarity to make the empirical claims verifiable. We will revise the abstract accordingly in the next version of the manuscript while preserving the core findings. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of SoTA performance on MobileWorldBench and Code2WorldBench supplies no quantitative metrics, baselines, training details, or error analysis, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract should be self-contained. The main paper (Sections 4.1 and 4.2) contains the full quantitative results, including accuracy metrics, baseline comparisons, training hyperparameters, and error analysis for both benchmarks. We will revise the abstract to report the key performance numbers and briefly note the training setup and error analysis, ensuring the SoTA claim is directly verifiable from the abstract text. revision: yes
Referee: [Abstract] Abstract (second finding): the claim that world-model-generated trajectories improve end-to-end performance on AITZ, AndroidControl, and AndroidWorld rests on evaluations that do not ablate the data filtering and annotation step while holding it fixed; without this control, gains cannot be attributed to the world models rather than curation, especially given the explicit admission that the generated data do not preserve the original distribution.

Authors: We acknowledge the need for a tighter control. The current experiments compare agents trained with versus without world-model trajectories under the same overall pipeline, and we explicitly note the distribution shift. To isolate the world-model contribution from curation effects, we will add a controlled ablation in the revised manuscript that holds the filtering and annotation protocol fixed while varying only the source of the trajectories (real filtered data vs. generated data). This will strengthen attribution of the observed gains. revision: yes
Referee: [Abstract] Abstract (third finding): the statement that posterior self-reflection provides limited gains for overconfident low-entropy agents lacks any description of how action entropy is computed, how agent strengths are quantified, or the precise experimental setup isolating this effect from other variables.

Authors: We agree that the abstract would benefit from these details. In the full paper, action entropy is the Shannon entropy of the policy's action distribution, agent strength is measured by baseline task success rate without guidance, and the setup compares low-entropy versus higher-entropy agents with and without self-reflection on identical task sets. We will add a concise description of the entropy computation, strength quantification, and experimental isolation to the revised abstract, with a pointer to the detailed protocol in Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical data pipeline and benchmark evaluation

full rationale

The paper presents an empirical workflow of filtering/annotating mobile GUI data, training four modality-specific world models, and measuring downstream task performance on AITZ, AndroidControl, and AndroidWorld. No equations, derivations, or first-principles claims appear that reduce to author-defined quantities by construction. Performance numbers are obtained from standard training and evaluation loops; the three findings are direct observations from those runs rather than self-referential predictions. Self-citations, if present, are not load-bearing for any central result. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised training of vision-language and diffusion models plus the assumption that the chosen benchmarks measure real-world utility; no new free parameters, axioms, or invented entities are introduced beyond those already standard in the field.

axioms (1)

standard math Standard assumptions of supervised learning and diffusion model training hold for the mobile GUI data
Invoked implicitly when claiming SoTA performance after training on filtered and annotated data.

pith-pipeline@v0.9.0 · 5575 in / 1259 out tokens · 40149 ms · 2026-05-12T04:23:32.673720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear
world-model-generated trajectories can provide transferable interaction experience in the training process

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, et al. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

work page arXiv 2026
[6]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review arXiv 2025
[7]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[8]

arXiv preprint arXiv:2411.06559 , year=

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024
[9]

Mobiledreamer: Generative sketch world model for gui agent

Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent. arXiv preprint arXiv:2601.04035, 2026

work page arXiv 2026
[10]

Vimo: A generative visual gui world model for app agents

Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

work page arXiv 2025
[11]

Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

work page arXiv 2026
[12]

Code2world: A gui world model via renderable code generation

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026

work page arXiv 2026
[13]

Mobilevlm: A vision-language model for better intra-and inter-ui understanding

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Liujian Liujianfeng, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10231–10251, 2024

work page 2024
[14]

An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

work page 2023
[15]

Amex: Android multi-annotation expo dataset for mobile gui agents

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2138–2156, 2025. 13

work page 2025
[16]

Android in the zoo: Chain-of-action-thought for gui agents

Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024

work page 2024
[17]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024
[18]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025
[19]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

work page arXiv 2025
[22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025
[25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review arXiv 2024
[26]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[27]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[28]

Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

work page 2024
[29]

Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,

Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

work page arXiv 2024
[30]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025. 14

work page 2025
[31]

Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

work page arXiv 2024
[32]

arXiv preprint arXiv:2501.11733 , year=

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

work page arXiv 2025
[33]

Gui-r1: A generalist r1-style vision-language action model for gui agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page arXiv 2025
[34]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

work page arXiv 2025
[35]

arXiv preprint arXiv:2504.14239 , year=

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

work page arXiv 2025
[36]

Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

work page 2026
[37]

Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,

Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026

work page arXiv 2026
[38]

Mobile-bench: An evaluation benchmark for llm-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

work page 2024
[39]

Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[40]

Cogagent: A visual language model for gui agents, 2023

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023

work page 2023
[41]

Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions

Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, and Rui Yan. Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 883–893, 2025

work page 2025
[42]

Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

work page arXiv 2025
[43]

Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, et al. Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

work page arXiv 2026
[44]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025
[45]

Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025. 15

work page arXiv 2025
[46]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[47]

Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

work page arXiv 2025
[48]

Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

work page arXiv 2026
[49]

Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, and Peng Han. Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

work page arXiv 2026
[50]

Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy

Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, 2024

work page 2024
[51]

hallucinations

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, and Heng Ji. From word to world: Can large language models be implicit text-based world models?arXiv preprint arXiv:2512.18832, 2025. 16 A Full Related Work A.1 Mobile GUI Agents The evolution of mobile GUI agents marks a paradigm shift from rule-...

work page arXiv 2025
[52]

Result: Page navigation, popup opens, toggle switches, or focus change

click: A tap on a button, icon, or link. Result: Page navigation, popup opens, toggle switches, or focus change

work page
[53]

Result: Context menu appears or item selection mode triggers

long_press: A sustained touch. Result: Context menu appears or item selection mode triggers

work page
[54]

(New content appears, old content moves off-screen)

scroll: The content shifts vertically or horizontally. (New content appears, old content moves off-screen). 4.input_text: Text appears in an input field (without an explicit enter press)

work page
[55]

inferred_action

open_app: The screen transitions from a launcher/home screen to a specific app interface. 6.navigate_home: Returns to the device home screen/launcher. 7.navigate_back: Returns to the previous screen (reverse navigation). 8.wait: No significant visual change, or a loading spinner continues spinning. 9.none: The transition is hallucinated, broken, illogical...

work page
[56]

Goal progress first: does this action move the task toward completion at this specific step?

work page
[57]

valid" or

Prediction reliability second: is the predicted next page trustworthy enough to support that progress judgment? 26 Treat textual realism as evidence quality, not the objective. If realism is high but progress is weak, do not mark valid. Now provide your judgment on the selected action in JSON format. Your response must include: • Reason: Explain primarily...

work page
[58]

Therefore, participant risks, risk disclosure, and IRB approval are not applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page