How Mobile World Model Guides GUI Agents?

Bo An; Heng Qu; Jian Luan; Jiaxing Li; Kun Huang; Pengzhi Gao; Weikai Xu; Wei Liu; Xiaolin Hu; Yuhan Chen

arxiv: 2605.10347 · v2 · pith:26HI5JTCnew · submitted 2026-05-11 · 💻 cs.AI · cs.CL

How Mobile World Model Guides GUI Agents?

Weikai Xu , Kun Huang , Yunren Feng , Jiaxing Li , Yuhan Chen , Yuxuan Liu , Zhizheng Jiang , Heng Qu

show 5 more authors

Pengzhi Gao Wei Liu Jian Luan Xiaolin Hu Bo An

This is my paper

Pith reviewed 2026-05-25 05:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords mobile world modelsGUI agentsrenderable codemultimodal supervisionout-of-distribution executiontrajectory generationaction prediction

0 comments

The pith

Renderable code reconstruction in mobile world models achieves high in-distribution fidelity and provides effective multimodal supervision for GUI agent data construction, while text-based feedback is more robust for online out-of-distribut

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains mobile world models in four modalities—delta text, full text, diffusion images, and renderable code—then tests which ones best guide GUI agents on benchmarks and downstream tasks. It establishes that renderable code matches real data closely enough to create useful training examples, whereas text feedback handles novel situations during live execution. Generated trajectories from these models can be fed into agent training to raise end-to-end success rates even though the trajectories themselves diverge from the original data distribution. The work also shows that using a world model to double-check actions after the fact adds little value for agents that already act with high confidence.

Core claim

By filtering and annotating mobile world-model data and training models across delta text, full text, diffusion-based images, and renderable code, these models reach state-of-the-art on MobileWorldBench and Code2WorldBench. Downstream tests on AITZ, AndroidControl, and AndroidWorld show renderable code reconstruction achieves high in-distribution fidelity and supplies effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution execution. World-model-generated trajectories supply transferable interaction experience that improves agents' end-to-end task performance although the data do not preserve the original distribution.

What carries the argument

Mobile world models in four modalities (delta text, full text, diffusion images, renderable code) that predict future states to supply either prior perception, training supervision, or post-hoc verification for GUI agents.

Load-bearing premise

The three downstream evaluation environments and the filtered training data are representative enough to support general claims about which modality works best for arbitrary mobile GUI agents and long-horizon tasks.

What would settle it

A controlled test on a new mobile environment outside the three evaluated ones in which text-based feedback loses its OOD advantage or generated trajectories cease to raise end-to-end agent performance.

Figures

Figures reproduced from arXiv: 2605.10347 by Bo An, Heng Qu, Jian Luan, Jiaxing Li, Kun Huang, Pengzhi Gao, Weikai Xu, Wei Liu, Xiaolin Hu, Yuhan Chen, Yunren Feng, Yuxuan Liu, Zhizheng Jiang.

**Figure 2.** Figure 2: Comparison between text-based and image-based world models for GUI state prediction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of different world-modeling paradigms across four generation settings. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Four data filtering methods during our data construction. Graph-level transition deduplication. To reduce repeated state transitions, we merge transition triples with similar start states, similar next states, and the same action type. Following the node-merging strategy in Mobile3M [13], for two triples τn = (sn, an, sn+1) and τr = (sr, ar, sr+1), we treat them as duplicates if D(sn, sr) > 0.95, D(sn+1, … view at source ↗

**Figure 5.** Figure 5: Overall SR on AndroidWorld under two agent frameworks. Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overall success rate on AndroidWorld with M3A agents. Bars report overall SR, and labels [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Entropy statistics and entropy-conditioned behavior [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Test-time scaling trends on AITZ (ID), AndroidControl (ID), and GUI-Odyssey (OOD). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: AndroidControl performance during training with World-Model imagination [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Analysis of agent fine-tuned on World-Model trajectories. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Six-dimensional radar plots for offline task navigation. Each subplot corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Entropy-range analysis for GUI world-model feedback. Left: accuracy trends across [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: AITZ downstream task case study with HTML-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: AndroidControl downstream task case study with delta-text world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: AndroidControl downstream task case study with text-based world-model feedback. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: False-negative penalty caused by world model hallucination. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Failure of HTML-based world models in simulating dynamic text input. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: High repetition rate of candidate actions in small-scale models. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Diffusion Image case studies on Mobile GUI state prediction, cases 1–2. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Diffusion Image case studies on Mobile GUI state prediction, cases 3–4. [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Code2Image Case Study 1: Action is Click on the search bar at the top of the screen to search for the arts. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Code2Image Case Study 2: Action is Click on the Moon tab at the bottom left corner of the screen to view the details. interaction logic, producing a systematic centering bias in the collected trajectories that degrades click accuracy after fine-tuning. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Code2Image Case Study 3: Action is Click on save story from the options. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Code2Image Case Study 4: Action is Swipe up to view the romeo and juliet file. Input Image GroundTruth Mobileworldmodel-8B Code2World GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Code2Image Case Study 5: Action is Swipe up to view more details. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

**Figure 26.** Figure 26: Code2Image Case Study 6: Input text is Literature art. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: Bad Case studies of the 40-step denoising process used by diffusion-based world models [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗

**Figure 28.** Figure 28: Qualitative comparison between real Android screens and corresponding world-model [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗

read the original abstract

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a clean empirical comparison of four world-model modalities for mobile GUI agents and surfaces three practical findings, but the OOD claim for text feedback rests on an unmeasured distribution shift.

read the letter

The main takeaway is that renderable code gives strong in-distribution reconstruction and useful supervision for data construction, text feedback holds up better for online execution on the three downstream environments, and trajectories generated by these models improve end-to-end agent performance even though they do not match the original data distribution. The third result is that self-reflection adds little once an agent is already overconfident and low-entropy. These are the concrete points the authors want readers to take away from the modality tests on AITZ, AndroidControl, and AndroidWorld after training on their filtered data.

Referee Report

2 major / 2 minor

Summary. The paper filters and annotates mobile GUI data to train world models in four modalities (delta text, full text, diffusion images, renderable code). These achieve SOTA on MobileWorldBench and Code2WorldBench. Downstream evaluations on AITZ, AndroidControl, and AndroidWorld yield three findings: renderable code provides high in-distribution fidelity and effective multimodal supervision while text feedback is more robust for online OOD execution; generated trajectories improve end-to-end agent performance without preserving the original distribution; and world models are more effective as prior perception or training supervision than as post-hoc verifiers, especially for overconfident low-entropy agents.

Significance. If the empirical modality comparisons hold after proper controls, the work offers concrete guidance on representation choices for mobile world models and demonstrates that generated trajectories can transfer interaction experience. The multi-modality training and SOTA benchmark results are strengths; the downstream utility findings could inform design of long-horizon GUI agents if the in-distribution vs. OOD distinction is rigorously established.

major comments (2)

[Downstream evaluation on AITZ, AndroidControl, and AndroidWorld] Downstream evaluation section: no explicit metric (action-sequence divergence, visual embedding distance, task-horizon statistics, or similar) is reported to quantify distribution shift between the filtered training trajectories and the three evaluation environments (AITZ, AndroidControl, AndroidWorld). This quantification is load-bearing for the central claim that text-based feedback is more robust specifically for online OOD execution versus renderable code for in-distribution fidelity.
[Findings on generated trajectories] Findings paragraph and associated tables/figures: the reported improvements from generated trajectories on end-to-end task performance are presented without controls for multiple comparisons or statistical significance testing across the modality variants and environments, weakening the second finding.

minor comments (2)

[Methods] Notation for the four modalities (delta text, full text, diffusion-based images, renderable code) should be introduced with consistent abbreviations in the methods section for clarity in later comparisons.
[Benchmark results] The abstract states 'SoTA performance' on the two benchmarks; the main text should include the exact prior baselines and margins for each modality to allow direct verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important aspects of rigor in the downstream evaluation and statistical presentation. We address each below.

read point-by-point responses

Referee: [Downstream evaluation on AITZ, AndroidControl, and AndroidWorld] Downstream evaluation section: no explicit metric (action-sequence divergence, visual embedding distance, task-horizon statistics, or similar) is reported to quantify distribution shift between the filtered training trajectories and the three evaluation environments (AITZ, AndroidControl, AndroidWorld). This quantification is load-bearing for the central claim that text-based feedback is more robust specifically for online OOD execution versus renderable code for in-distribution fidelity.

Authors: We agree that explicit quantification of distribution shift is necessary to rigorously support the ID versus OOD distinction in our modality findings. In the revision we will add metrics including action-sequence edit distance and cosine distances in a shared visual embedding space between the filtered training trajectories and each of the three evaluation environments. revision: yes
Referee: [Findings on generated trajectories] Findings paragraph and associated tables/figures: the reported improvements from generated trajectories on end-to-end task performance are presented without controls for multiple comparisons or statistical significance testing across the modality variants and environments, weakening the second finding.

Authors: We acknowledge the absence of formal statistical controls. In the revised manuscript we will include paired statistical tests (with multiple-comparison correction) across modality variants and environments and report the resulting p-values alongside the performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation chain is self-contained against external benchmarks

full rationale

The paper filters and annotates data, trains four modality-specific world models, reports SoTA on MobileWorldBench and Code2WorldBench, then measures downstream effects on the independent environments AITZ, AndroidControl, and AndroidWorld. No equations, fitted parameters, or self-citations are shown to reduce any reported gain (in-distribution fidelity, OOD robustness, or end-to-end improvement) to a quantity defined by the paper's own inputs. The modality ranking and trajectory-transfer claims rest on observable benchmark outcomes rather than definitional or self-referential reductions, satisfying the default expectation of an empirical study with low circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard supervised training of vision-language and diffusion models plus the assumption that the chosen mobile datasets are representative.

pith-pipeline@v0.9.0 · 5806 in / 1179 out tokens · 21743 ms · 2026-05-25T05:57:58.113939+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 15 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, et al. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

work page internal anchor Pith review arXiv 2026
[6]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[8]

Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024
[9]

Mobiledreamer: Generative sketch world model for gui agent

Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent. arXiv preprint arXiv:2601.04035, 2026

work page arXiv 2026
[10]

Vimo: A generative visual gui world model for app agents

Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

work page arXiv 2025
[11]

Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

work page internal anchor Pith review arXiv 2026
[12]

Code2world: A gui world model via renderable code generation

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026

work page arXiv 2026
[13]

Mobilevlm: A vision-language model for better intra-and inter-ui understanding

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Liujian Liujianfeng, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10231–10251, 2024

work page 2024
[14]

An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

work page 2023
[15]

Amex: Android multi-annotation expo dataset for mobile gui agents

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2138–2156, 2025. 13

work page 2025
[16]

Android in the zoo: Chain-of-action-thought for gui agents

Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024

work page 2024
[17]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024
[18]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025
[19]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

work page arXiv 2025
[22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025
[25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[27]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[28]

Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

work page 2024
[29]

Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

work page arXiv 2024
[30]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025. 14

work page 2025
[31]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

work page arXiv 2025
[33]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

work page arXiv 2025
[35]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

work page 2026
[37]

Webworld: A large-scale world model for web agent training

Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026

work page arXiv 2026
[38]

Mobile-bench: An evaluation benchmark for llm-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

work page 2024
[39]

Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[40]

Cogagent: A visual language model for gui agents, 2023

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023

work page 2023
[41]

Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions

Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, and Rui Yan. Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 883–893, 2025

work page 2025
[42]

Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

work page arXiv 2025
[43]

Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, et al. Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

work page arXiv 2026
[44]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025
[45]

Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025. 15

work page arXiv 2025
[46]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[47]

Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

work page arXiv 2025
[48]

Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

work page arXiv 2026
[49]

Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, and Peng Han. Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

work page arXiv 2026
[50]

Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy

Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, 2024

work page 2024
[51]

hallucinations

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, and Heng Ji. From word to world: Can large language models be implicit text-based world models?arXiv preprint arXiv:2512.18832, 2025. 16 A Full Related Work A.1 Mobile GUI Agents The evolution of mobile GUI agents marks a paradigm shift from rule-...

work page arXiv 2025
[52]

Result: Page navigation, popup opens, toggle switches, or focus change

click: A tap on a button, icon, or link. Result: Page navigation, popup opens, toggle switches, or focus change

work page
[53]

Result: Context menu appears or item selection mode triggers

long_press: A sustained touch. Result: Context menu appears or item selection mode triggers

work page
[54]

(New content appears, old content moves off-screen)

scroll: The content shifts vertically or horizontally. (New content appears, old content moves off-screen). 4.input_text: Text appears in an input field (without an explicit enter press)

work page
[55]

inferred_action

open_app: The screen transitions from a launcher/home screen to a specific app interface. 6.navigate_home: Returns to the device home screen/launcher. 7.navigate_back: Returns to the previous screen (reverse navigation). 8.wait: No significant visual change, or a loading spinner continues spinning. 9.none: The transition is hallucinated, broken, illogical...

work page
[56]

Goal progress first: does this action move the task toward completion at this specific step?

work page
[57]

valid" or

Prediction reliability second: is the predicted next page trustworthy enough to support that progress judgment? 26 Treat textual realism as evidence quality, not the objective. If realism is high but progress is weak, do not mark valid. Now provide your judgment on the selected action in JSON format. Your response must include: • Reason: Explain primarily...

work page
[58]

Therefore, participant risks, risk disclosure, and IRB approval are not applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, et al. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

work page internal anchor Pith review arXiv 2026

[6] [6]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024

[8] [8]

Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024

[9] [9]

Mobiledreamer: Generative sketch world model for gui agent

Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent. arXiv preprint arXiv:2601.04035, 2026

work page arXiv 2026

[10] [10]

Vimo: A generative visual gui world model for app agents

Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

work page arXiv 2025

[11] [11]

Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

work page internal anchor Pith review arXiv 2026

[12] [12]

Code2world: A gui world model via renderable code generation

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026

work page arXiv 2026

[13] [13]

Mobilevlm: A vision-language model for better intra-and inter-ui understanding

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Liujian Liujianfeng, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10231–10251, 2024

work page 2024

[14] [14]

An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

work page 2023

[15] [15]

Amex: Android multi-annotation expo dataset for mobile gui agents

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2138–2156, 2025. 13

work page 2025

[16] [16]

Android in the zoo: Chain-of-action-thought for gui agents

Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, 2024

work page 2024

[17] [17]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024

[18] [18]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025

[19] [19]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents.arXiv preprint arXiv:2512.14014, 2025

work page arXiv 2025

[22] [22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025

[25] [25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026

[27] [27]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[28] [28]

Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37: 126544–126565, 2024

work page 2024

[29] [29]

Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

work page arXiv 2024

[30] [30]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025. 14

work page 2025

[31] [31]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

work page arXiv 2025

[33] [33]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

work page arXiv 2025

[35] [35]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

work page 2026

[37] [37]

Webworld: A large-scale world model for web agent training

Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026

work page arXiv 2026

[38] [38]

Mobile-bench: An evaluation benchmark for llm-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

work page 2024

[39] [39]

Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[40] [40]

Cogagent: A visual language model for gui agents, 2023

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023

work page 2023

[41] [41]

Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions

Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, and Rui Yan. Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 883–893, 2025

work page 2025

[42] [42]

Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

work page arXiv 2025

[43] [43]

Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, et al. Come: Empowering channel-of-mobile-experts with informative hybrid-capabilities reasoning.arXiv preprint arXiv:2602.24142, 2026

work page arXiv 2026

[44] [44]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025

[45] [45]

Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025. 15

work page arXiv 2025

[46] [46]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025

[47] [47]

Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

work page arXiv 2025

[48] [48]

Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365, 2026

work page arXiv 2026

[49] [49]

Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, and Peng Han. Rˆ 3: Replay, reflection, and ranking rewards for llm reinforcement learning.arXiv preprint arXiv:2601.19620, 2026

work page arXiv 2026

[50] [50]

Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy

Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. Determlr: Augmenting llm-based logical reasoning from indeterminacy to determinacy. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, 2024

work page 2024

[51] [51]

hallucinations

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, and Heng Ji. From word to world: Can large language models be implicit text-based world models?arXiv preprint arXiv:2512.18832, 2025. 16 A Full Related Work A.1 Mobile GUI Agents The evolution of mobile GUI agents marks a paradigm shift from rule-...

work page arXiv 2025

[52] [52]

Result: Page navigation, popup opens, toggle switches, or focus change

click: A tap on a button, icon, or link. Result: Page navigation, popup opens, toggle switches, or focus change

work page

[53] [53]

Result: Context menu appears or item selection mode triggers

long_press: A sustained touch. Result: Context menu appears or item selection mode triggers

work page

[54] [54]

(New content appears, old content moves off-screen)

scroll: The content shifts vertically or horizontally. (New content appears, old content moves off-screen). 4.input_text: Text appears in an input field (without an explicit enter press)

work page

[55] [55]

inferred_action

open_app: The screen transitions from a launcher/home screen to a specific app interface. 6.navigate_home: Returns to the device home screen/launcher. 7.navigate_back: Returns to the previous screen (reverse navigation). 8.wait: No significant visual change, or a loading spinner continues spinning. 9.none: The transition is hallucinated, broken, illogical...

work page

[56] [56]

Goal progress first: does this action move the task toward completion at this specific step?

work page

[57] [57]

valid" or

Prediction reliability second: is the predicted next page trustworthy enough to support that progress judgment? 26 Treat textual realism as evidence quality, not the objective. If realism is high but progress is weak, do not mark valid. Now provide your judgment on the selected action in JSON format. Your response must include: • Reason: Explain primarily...

work page

[58] [58]

Therefore, participant risks, risk disclosure, and IRB approval are not applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page