pith. sign in

arxiv: 2606.31410 · v1 · pith:767OSZZBnew · submitted 2026-06-30 · 💻 cs.AI

Xiaomi-GUI-0 Technical Report

Pith reviewed 2026-07-01 06:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentmobile agentreal-device trainingmultimodal modelreinforcement learningerror-driven flywheelAndroidWorldRealMobile
0
0 comments X

The pith

A real-device-dominant hybrid infrastructure lets a multimodal GUI agent reach 72% success on mobile tasks while raising execution stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Xiaomi-GUI-0 as a native multimodal GUI agent trained and evaluated inside a closed loop that uses physical phones as the primary environment. Standard benchmarks and simulations fail to capture the shifting states created by accounts, permission dialogs, payments, and risk controls, leaving a persistent gap between reported scores and actual usability. The authors address this by building multi-source data that includes head tasks, long-tail intents, and reflection examples, then feeding failures back through an error-driven flywheel that produces corrected actions and recovery demonstrations. Training proceeds in three stages from supervised fine-tuning through step-level and agentic reinforcement learning, producing 72.0% success on the in-house RealMobile benchmark and 78.9% on AndroidWorld along with measurable gains in abnormal-state handling.

Core claim

Xiaomi-GUI-0 is a multimodal GUI agent whose defining feature is a real-device-dominant hybrid infrastructure that keeps physical phones as the primary execution environment while using sandboxes only for auxiliary support. This infrastructure ensures that data collection, rollout, and evaluation share a state distribution close to real deployment. The model is trained on multi-source trajectories augmented by an error-driven data flywheel that converts failure traces into corrected actions, reflective explanations, and recovery demonstrations, then refined through a progressive pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning.

What carries the argument

real-device-dominant hybrid infrastructure that places physical phones as the primary execution environment and sandboxes in auxiliary support so that training and evaluation distributions match real deployment

If this is right

  • Higher execution stability when the agent encounters permission dialogs, payment flows, and risk controls in live applications.
  • Continuous improvement loop in which real failure trajectories are automatically turned into reflective training data without additional human labeling.
  • Better recognition and recovery from abnormal states through the combination of reflection data and agentic reinforcement learning.
  • Gains observed on both public benchmarks and the in-house RealMobile set indicate that aligning execution distribution reduces the benchmark-to-reality gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid setup could be adapted to other platforms such as tablets or desktop environments if physical devices remain the primary source of state variation.
  • Reducing reliance on purely simulated environments may lower the cost of developing future GUI agents while increasing their robustness to live variability.
  • Incorporating user-specific account states during the data flywheel stage could further narrow the gap between training and personalized deployment.
  • The three-stage training progression may generalize to other agentic tasks where reflection and recovery are critical.

Load-bearing premise

The state distribution produced by physical phones as the main execution environment is close enough to live deployment that benchmark improvements will carry over to production use with varying accounts, permissions, and risk controls.

What would settle it

A substantial drop in task success rate when the agent is run on production devices that include diverse user accounts, active permission dialogs, and risk-control screens absent from the training distribution.

read the original abstract

Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments. It describes a real-device-dominant hybrid infrastructure (physical phones primary, sandboxes auxiliary) for data collection, training, rollout, and evaluation; multi-source training data spanning head tasks, long-tail intents, and capability-enhancement data; an error-driven data flywheel that converts failure trajectories into corrected actions and reflective explanations; and a progressive three-stage training pipeline (supervised fine-tuning, step-level reinforcement learning, agentic reinforcement learning). The central empirical claim is that the resulting model achieves 72.0% success on the in-house RealMobile benchmark and 78.9% on AndroidWorld while substantially improving execution stability and abnormal-state recognition.

Significance. If the performance claims hold under rigorous controls, the work would be significant for GUI agent research by demonstrating a closed-loop system whose execution distribution is intended to match real deployment more closely than offline or simulated benchmarks, potentially narrowing the persistent gap between benchmark scores and practical usability.

major comments (3)
  1. [Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.
  2. [Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.
  3. [Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.
minor comments (2)
  1. [Abstract] The term 'RealMobile' is introduced without an explicit definition or pointer to its construction details.
  2. [Abstract] The phrase 'substantially improving' is used without accompanying numbers or comparison tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where the manuscript will be revised to strengthen the evidence presented.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.

    Authors: We agree that the evaluation section requires additional supporting details. In the revised manuscript we will add comparisons against published baselines on AndroidWorld, report the number of evaluation episodes and task categories for both benchmarks, include error bars from repeated runs where available, describe exclusion criteria for RealMobile, and report statistical significance where the data permit. For the proprietary RealMobile benchmark we will provide summarized rather than exhaustive episode-level statistics. revision: partial

  2. Referee: [Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.

    Authors: The hybrid infrastructure is central to the work. We will revise the infrastructure section to include quantitative state-distribution statistics (e.g., frequency of permission dialogs and abnormal states) comparing the real-device-dominant setup against sandbox-only runs. While full KL divergence on high-dimensional GUI states is impractical, we will add an ablation that isolates the contribution of real-device data to the final performance where feasible. revision: yes

  3. Referee: [Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.

    Authors: We agree that quantitative metrics are needed. In the revision we will define and report concrete metrics for execution stability (e.g., recovery success rate across account-state variations) and abnormal-state recognition (e.g., precision of error detection), together with the exact measurement protocols used during evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical system description with no derivations or fitted predictions

full rationale

The paper is a technical report describing an empirical GUI agent system, hybrid infrastructure, multi-source data collection, error-driven flywheel, and three-stage training pipeline, followed by benchmark results (72.0% RealMobile, 78.9% AndroidWorld). No equations, first-principles derivations, parameter fitting, or predictions appear. No self-citations are load-bearing for any claimed result. The central claims rest on direct evaluation rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on a trained system; the abstract introduces no mathematical derivations, free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5935 in / 1246 out tokens · 33975 ms · 2026-07-01T06:05:33.913188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 37 canonical work pages · 21 internal anchors

  1. [1]

    Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet

    Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet. Model card addendum, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf

  2. [2]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

  3. [3]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

  4. [4]

    Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and A viral Kumar. Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

  6. [6]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity

    ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. arXiv preprint arXiv:2603.11103, 2026. URL https://arxiv.org/abs/2603.11103

  7. [7]

    KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

    Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation. arXiv preprint arXiv:2604.08455 , 2026

  8. [8]

    Step: Success-rate-aware trajectory- efficient policy optimization

    Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success-rate-aware trajectory- efficient policy optimization. arXiv preprint arXiv:2511.13091 , 2025

  9. [9]

    Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning

    Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, and Mengwei Xu. Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning. arXiv preprint arXiv:2505.12493 , 2025

  10. [10]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. Model card, 2026. URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/

  11. [11]

    Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

    Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards. arXiv preprint arXiv:2506.20332 , 2025

  12. [12]

    arXiv preprint arXiv:2508.10833 , year=

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 , 2025. 26

  13. [13]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training, 2025. URL https://arxiv.org/abs/2507.01663

  14. [14]

    Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning

    Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning. arXiv preprint arXiv:2505.12299 , 2025

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024

  16. [16]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia , pages 8778–8786, 2025

  17. [17]

    From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

    Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, and Shu-Tao Xia. From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents. arXiv preprint arXiv:2603.01455 , 2026

  18. [18]

    ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

    Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, and Yuanchun Li. Simuwob: Simulating real-world mobile apps for fast and faithful gui agent benchmarking. arXiv preprint arXiv:2605.25160 , 2026

  19. [19]

    Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Chang- sheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc. arXiv preprint arXiv:2502.14282 , 2025

  20. [20]

    Autoglm: Autonomous foundation agents for guis

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820 , 2024

  21. [21]

    Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 17608–17616, 2026

  22. [22]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents. arXiv preprint arXiv:2504.10458 , 2025

  23. [23]

    Addendum to openai o3 and o4-mini system card: Openai o3 operator

    OpenAI. Addendum to openai o3 and o4-mini system card: Openai o3 operator. System card addendum, 2025. URL https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/

  24. [24]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695–9717, 2024

  25. [25]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025. URL https://arxiv. org/abs/2501.12326, 2025

  26. [26]

    Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

    Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, and Jian Luan. Scaling, benchmarking, and reasoning of vision-language agents for mobile gui navigation. arXiv preprint arXiv:2605.27134 , 2026

  27. [27]

    Androidworld: A dynamic benchmarking environment for au- tonomous agents

    Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for au- tonomous agents. In International Conference on Learning Representations , volume 2025, pages 406–441, 2025

  28. [28]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023

  29. [29]

    Ui-tars-1.5

    Seed. Ui-tars-1.5. ByteDance Seed Blog, 2025. URL https://seed-tars.com/1.5/

  30. [30]

    Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency. arXiv preprint arXiv:2603.20633 , 2026. 27

  31. [31]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems , pages 1279–1297, 2025

  32. [32]

    Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks

    Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks. arXiv preprint arXiv:2510.26098, 2025

  33. [33]

    arXiv preprint arXiv:2507.05720 , year=

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

  34. [34]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  35. [35]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

  36. [36]

    arXiv preprint arXiv:2602.09082 , year=

    Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, et al. Ui-venus-1.5 technical report. arXiv preprint arXiv:2602.09082 , 2026

  37. [37]

    CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

    Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, et al. Cua-gym: Scaling verifiable training environments and tasks for computer-use agents. arXiv preprint arXiv:2605.25624 , 2026

  38. [38]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 , 2025

  39. [39]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems , 37:2686–2710, 2024

  40. [40]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 , 2024

  41. [41]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024

  42. [42]

    Opencua: Open foundations for computer-use agents

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, et al. Opencua: Open foundations for computer-use agents. Advances in Neural Information Processing Systems, 38:139756–139806, 2026

  43. [43]

    Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6239–6248, 2026

  44. [44]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165 , 2026

  45. [45]

    MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

    Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, et al. Mobilegym: A verifiable and highly parallel simulation platform for mobile gui agent research. arXiv preprint arXiv:2605.26114 , 2026

  46. [46]

    Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism

    Qinzhuo Wu, Pengzhi Gao, Wei Liu, and Jian Luan. Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4250–4272, 2025

  47. [47]

    Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment

    Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, and Jian Luan. Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment. arXiv preprint arXiv:2601.20335 , 2026. 28

  48. [48]

    Os-atlas: Foundation action model for generalist gui agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. In International Conference on Learning Representations , volume 2025, pages 5090–5108, 2025

  49. [49]

    Scaling computer-use grounding via user interface decomposition and synthesis

    Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis. Advances in Neural Information Processing Systems , 38, 2026

  50. [50]

    Gui-pra: Process reward agent for gui tasks

    Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. Gui-pra: Process reward agent for gui tasks. arXiv preprint arXiv:2509.23263 , 2025

  51. [51]

    5: Multi-platform fundamental gui agents , author=

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855 , 2026

  52. [52]

    Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. In The Fourteenth International Conference on Learning Representations , 2026

  53. [53]

    Step-gui technical report, 2025

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report. arXiv preprint arXiv:2512.15431 , 2025

  54. [54]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URL https://arxiv. org/abs/2508.15144, 4:21–27, 2025

  55. [55]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL https://arxiv. org/abs/2503.14476, 1:2, 2025

  56. [56]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071 , 2025

  57. [57]

    Sglang: Efficient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems , 37:62557–62583, 2024

  58. [58]

    arXiv preprint arXiv:2512.22047 , year=

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025. 29 Contributions and Acknowledgments All contributors are listed in alphabetical order by their last names. Core Contr...

  59. [59]

    Current device type & foreground app

  60. [60]

    Output the corresponding JSON string inside `<tool_call>`.,→

    Current screenshot # Available Tools You MUST pick exactly one tool per step. Output the corresponding JSON string inside `<tool_call>`.,→

  61. [61]

    name": "Tap

    Tap: `{"name": "Tap", "position": [x, y], "times": 1}` (Tap at coordinate)

  62. [62]

    name": "LongPress

    LongPress: `{"name": "LongPress", "position": [x, y]}` (Trigger contextual menus)

  63. [63]

    name": "Swipe

    Swipe: `{"name": "Swipe", "start_position": [x1, y1], "end_position": [x2, y2]}` (Swipe to scroll/move. Swipe up to scroll down),→

  64. [64]

    name": "Type

    Type: `{"name": "Type", "position": [x, y], "text": "..."}` (Tap input box and type)

  65. [65]

    name": "Search

    Search: `{"name": "Search", "position": [x, y], "text": "..."}` (Macro: tap -> clear -> type -> submit),→

  66. [66]

    name": "Open

    Open: `{"name": "Open", "app": "..."}` (Launch app via system)

  67. [67]

    name": "Back

    Back: `{"name": "Back"}` (System-level back)

  68. [68]

    name": "Home

    Home: `{"name": "Home"}` (Go to home screen)

  69. [69]

    name": "Wait

    Wait: `{"name": "Wait"}` (Wait for page loading/rendering)

  70. [70]

    name": "Request

    Request: `{"name": "Request", "text": "..."}` (Ask user for clarification/confirmation)

  71. [71]

    name": "Fail

    Fail: `{"name": "Fail", "type": "...", "reason": "..."}` (Report failure. `<TYPE>` MUST be one of: LOGIN_REQUIRED, USE_GUIDANCE, CAPTCHA_VERIFICATION, RESULT_NOT_FOUND, BLUETOOTH_CONNECTION_REQUIRED, NETWORK_ERROR, PAYMENT_AUTHENTICATION, TASK_CANT_FULFILLED, REPEAT_OPERATION, PERMISSION_REQUEST, PASSWORD_REQUIRED, TAKEOVER_EXIT, TEMPORARY_TAKEOVER, MANUA...

  72. [72]

    name": "Complete

    Complete: `{"name": "Complete"}` (Confirm goal reached for non-Q&A tasks)

  73. [73]

    name": "Speak

    Speak: `{"name": "Speak", "text": "..."}` (Present final answer for Q&A tasks) # Operational Constraints

  74. [74]

    Top-left is (0, 0); bottom-right is (1, 1).,→

    Coordinate system: every `position` is a relative [x, y] in [0, 1] with 3-decimal precision. Top-left is (0, 0); bottom-right is (1, 1).,→

  75. [75]

    Dismiss unrelated pop-ups (ads, upgrade prompts, rating requests) by tapping their Close / Skip / X / "Later" button rather than calling Fail.,→

  76. [76]

    If self-correction fails, call Fail

    Loop breaker: if three consecutive steps cause no visible change, or the same action is repeating in a loop, self-correct (try Back or a different target). If self-correction fails, call Fail. ,→ ,→ # Reasoning Framework (inside <think>) Before emitting the action, reason inside `<think>...</think>` (omit steps if no new info):

  77. [77]

    [Observation]: Objectively describe the current App, page state, and key visible elements

  78. [78]

    Explain what was expected vs

    [Reflection]: (Optional) Include ONLY if the current screen deviates from the previous plan's expectation. Explain what was expected vs. what is actually seen.,→

  79. [79]

    Output a 2-4 step path in a single line separated by `|`

    [Plan] / [Plan Update] / [Replan]: (Choose one). Output a 2-4 step path in a single line separated by `|`. Mark completed steps with `[done]` and the current step with `->`. Use [Replan] if the previous plan failed. ,→ ,→

  80. [80]

    [Decision]: Deduce the exact action based on the Observation and the current `->` step in the Plan.,→

Showing first 80 references.