Xiaomi-GUI-0 Technical Report

Anan Du; Changqiao Wu; Cheng Tan; Chengzhen Duan; Cong Zou; Fazhan Liu; Haoyuan Sun; Heng Qu; Hui Liu; Jiahui Yang

arxiv: 2606.31410 · v1 · pith:767OSZZBnew · submitted 2026-06-30 · 💻 cs.AI

Xiaomi-GUI-0 Technical Report

Wanxia Cao , Chengzhen Duan , Pei Fu , Pengzhi Gao , Niu Lian , Fazhan Liu , Hui Liu , Heng Qu

show 22 more authors

Qinzhuo Wu Zhehao Yu Tongbo Chen Shiqi Cui Anan Du Shukai Jia Yuanfa Li Yike Liu Wenchao Lu Haoyuan Sun Jiatong Sun Cheng Tan Yajie Wang Changqiao Wu Tao Xiong Jiahui Yang Yuxuan Yuan Ruoceng Zhang Shaojie Zhang Jian Zhu Jian Luan Cong Zou

This is my paper

Pith reviewed 2026-07-01 06:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentmobile agentreal-device trainingmultimodal modelreinforcement learningerror-driven flywheelAndroidWorldRealMobile

0 comments

The pith

A real-device-dominant hybrid infrastructure lets a multimodal GUI agent reach 72% success on mobile tasks while raising execution stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Xiaomi-GUI-0 as a native multimodal GUI agent trained and evaluated inside a closed loop that uses physical phones as the primary environment. Standard benchmarks and simulations fail to capture the shifting states created by accounts, permission dialogs, payments, and risk controls, leaving a persistent gap between reported scores and actual usability. The authors address this by building multi-source data that includes head tasks, long-tail intents, and reflection examples, then feeding failures back through an error-driven flywheel that produces corrected actions and recovery demonstrations. Training proceeds in three stages from supervised fine-tuning through step-level and agentic reinforcement learning, producing 72.0% success on the in-house RealMobile benchmark and 78.9% on AndroidWorld along with measurable gains in abnormal-state handling.

Core claim

Xiaomi-GUI-0 is a multimodal GUI agent whose defining feature is a real-device-dominant hybrid infrastructure that keeps physical phones as the primary execution environment while using sandboxes only for auxiliary support. This infrastructure ensures that data collection, rollout, and evaluation share a state distribution close to real deployment. The model is trained on multi-source trajectories augmented by an error-driven data flywheel that converts failure traces into corrected actions, reflective explanations, and recovery demonstrations, then refined through a progressive pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning.

What carries the argument

real-device-dominant hybrid infrastructure that places physical phones as the primary execution environment and sandboxes in auxiliary support so that training and evaluation distributions match real deployment

If this is right

Higher execution stability when the agent encounters permission dialogs, payment flows, and risk controls in live applications.
Continuous improvement loop in which real failure trajectories are automatically turned into reflective training data without additional human labeling.
Better recognition and recovery from abnormal states through the combination of reflection data and agentic reinforcement learning.
Gains observed on both public benchmarks and the in-house RealMobile set indicate that aligning execution distribution reduces the benchmark-to-reality gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid setup could be adapted to other platforms such as tablets or desktop environments if physical devices remain the primary source of state variation.
Reducing reliance on purely simulated environments may lower the cost of developing future GUI agents while increasing their robustness to live variability.
Incorporating user-specific account states during the data flywheel stage could further narrow the gap between training and personalized deployment.
The three-stage training progression may generalize to other agentic tasks where reflection and recovery are critical.

Load-bearing premise

The state distribution produced by physical phones as the main execution environment is close enough to live deployment that benchmark improvements will carry over to production use with varying accounts, permissions, and risk controls.

What would settle it

A substantial drop in task success rate when the agent is run on production devices that include diverse user accounts, active permission dialogs, and risk-control screens absent from the training distribution.

read the original abstract

Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Xiaomi-GUI-0 is a real-device closed-loop GUI agent report that gives 72% and 78.9% success numbers on its benchmarks but supplies no baselines or distribution checks to support the stability claims.

read the letter

The main takeaway is a technical report on Xiaomi-GUI-0, a multimodal GUI agent trained and run primarily on physical mobile devices inside a closed loop. Physical phones handle the bulk of execution while sandboxes fill in as needed. The system uses an error-driven flywheel to convert failed trajectories into corrected actions, reflective explanations, and recovery examples, then trains in three stages: supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. They report 72.0% success on their in-house RealMobile benchmark and 78.9% on AndroidWorld, along with gains in execution stability and abnormal-state handling.

What stands out is the concrete engineering focus on keeping the training distribution close to real deployment by making real devices the primary environment. The multi-source data mix for head tasks, long-tail intents, and reflection capabilities, plus the flywheel mechanism, gives a practical recipe that others in the mobile GUI agent area could adapt.

The soft spots sit in the evidence for the central assumption. The report states the success rates without baselines, error bars, dataset sizes, or any quantitative comparison showing that the hybrid setup actually produces state frequencies for permission dialogs, account states, or risk controls that match production. No ablation isolates the real-device component, and no distribution statistics are given. The stress-test concern holds: the claim that benchmark gains will translate rests on an unverified match between their execution distribution and real use.

This paper is for researchers already working on GUI agents who want details on a real-device training loop and pipeline. A reader in that subfield can extract the infrastructure and data-generation ideas even while treating the performance numbers as preliminary. The work shows clear thinking about the benchmark-to-deployment gap and honest engagement with the practical constraints, so it is coherent on its own terms.

I would send it to peer review. The system description and public-benchmark numbers are concrete enough to merit referee time, though the experimental controls and validation of the distribution assumption would need close attention.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments. It describes a real-device-dominant hybrid infrastructure (physical phones primary, sandboxes auxiliary) for data collection, training, rollout, and evaluation; multi-source training data spanning head tasks, long-tail intents, and capability-enhancement data; an error-driven data flywheel that converts failure trajectories into corrected actions and reflective explanations; and a progressive three-stage training pipeline (supervised fine-tuning, step-level reinforcement learning, agentic reinforcement learning). The central empirical claim is that the resulting model achieves 72.0% success on the in-house RealMobile benchmark and 78.9% on AndroidWorld while substantially improving execution stability and abnormal-state recognition.

Significance. If the performance claims hold under rigorous controls, the work would be significant for GUI agent research by demonstrating a closed-loop system whose execution distribution is intended to match real deployment more closely than offline or simulated benchmarks, potentially narrowing the persistent gap between benchmark scores and practical usability.

major comments (3)

[Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.
[Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.
[Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.

minor comments (2)

[Abstract] The term 'RealMobile' is introduced without an explicit definition or pointer to its construction details.
[Abstract] The phrase 'substantially improving' is used without accompanying numbers or comparison tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where the manuscript will be revised to strengthen the evidence presented.

read point-by-point responses

Referee: [Evaluation] Evaluation section: success rates of 72.0% on RealMobile and 78.9% on AndroidWorld are stated without baselines, error bars, dataset sizes, number of evaluation episodes, exclusion criteria, or statistical tests, rendering the central performance claim unsupported by evidence in the text.

Authors: We agree that the evaluation section requires additional supporting details. In the revised manuscript we will add comparisons against published baselines on AndroidWorld, report the number of evaluation episodes and task categories for both benchmarks, include error bars from repeated runs where available, describe exclusion criteria for RealMobile, and report statistical significance where the data permit. For the proprietary RealMobile benchmark we will provide summarized rather than exhaustive episode-level statistics. revision: partial
Referee: [Infrastructure] Infrastructure section: the claim that the real-device-dominant hybrid infrastructure produces a state distribution close enough to real deployment (including account states, permission dialogs, payment flows, and risk controls) is load-bearing for the reported gains but is asserted without quantitative validation such as distribution statistics, KL divergence, or ablation isolating the hybrid component.

Authors: The hybrid infrastructure is central to the work. We will revise the infrastructure section to include quantitative state-distribution statistics (e.g., frequency of permission dialogs and abnormal states) comparing the real-device-dominant setup against sandbox-only runs. While full KL divergence on high-dimensional GUI states is impractical, we will add an ablation that isolates the contribution of real-device data to the final performance where feasible. revision: yes
Referee: [Results] Results and abstract: no quantitative metrics or measurement protocol are supplied for the claimed improvements in 'execution stability' and 'abnormal-state recognition,' preventing assessment of whether these are meaningful or reproducible.

Authors: We agree that quantitative metrics are needed. In the revision we will define and report concrete metrics for execution stability (e.g., recovery success rate across account-state variations) and abnormal-state recognition (e.g., precision of error detection), together with the exact measurement protocols used during evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical system description with no derivations or fitted predictions

full rationale

The paper is a technical report describing an empirical GUI agent system, hybrid infrastructure, multi-source data collection, error-driven flywheel, and three-stage training pipeline, followed by benchmark results (72.0% RealMobile, 78.9% AndroidWorld). No equations, first-principles derivations, parameter fitting, or predictions appear. No self-citations are load-bearing for any claimed result. The central claims rest on direct evaluation rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on a trained system; the abstract introduces no mathematical derivations, free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5935 in / 1246 out tokens · 33975 ms · 2026-07-01T06:05:33.913188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 37 canonical work pages · 21 internal anchors

[1]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet. Model card addendum, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf

2024
[2]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

2026
[3]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

2026
[4]

Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and A viral Kumar. Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

2024
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Seed2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. arXiv preprint arXiv:2603.11103, 2026. URL https://arxiv.org/abs/2603.11103

work page arXiv 2026
[7]

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation. arXiv preprint arXiv:2604.08455 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Step: Success-rate-aware trajectory- eﬀicient policy optimization

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success-rate-aware trajectory- eﬀicient policy optimization. arXiv preprint arXiv:2511.13091 , 2025

work page arXiv 2025
[9]

Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning

Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, and Mengwei Xu. Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning. arXiv preprint arXiv:2505.12493 , 2025

work page arXiv 2025
[10]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. Model card, 2026. URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/

2026
[11]

Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards. arXiv preprint arXiv:2506.20332 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

arXiv preprint arXiv:2508.10833 , year=

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 , 2025. 26

work page arXiv 2025
[13]

Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025. URL https://arxiv.org/abs/2507.01663

work page arXiv 2025
[14]

Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning. arXiv preprint arXiv:2505.12299 , 2025

work page arXiv 2025
[15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia , pages 8778–8786, 2025

2025
[17]

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, and Shu-Tao Xia. From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents. arXiv preprint arXiv:2603.01455 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, and Yuanchun Li. Simuwob: Simulating real-world mobile apps for fast and faithful gui agent benchmarking. arXiv preprint arXiv:2605.25160 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Chang- sheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc. arXiv preprint arXiv:2502.14282 , 2025

work page arXiv 2025
[20]

Autoglm: Autonomous foundation agents for guis

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820 , 2024

work page arXiv 2024
[21]

Ui-r1: Enhancing eﬀicient action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing eﬀicient action prediction of gui agents by reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 17608–17616, 2026

2026
[22]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents. arXiv preprint arXiv:2504.10458 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Addendum to openai o3 and o4-mini system card: Openai o3 operator

OpenAI. Addendum to openai o3 and o4-mini system card: Openai o3 operator. System card addendum, 2025. URL https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/

2025
[24]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695–9717, 2024

2024
[25]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025. URL https://arxiv. org/abs/2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, and Jian Luan. Scaling, benchmarking, and reasoning of vision-language agents for mobile gui navigation. arXiv preprint arXiv:2605.27134 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Androidworld: A dynamic benchmarking environment for au- tonomous agents

Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for au- tonomous agents. In International Conference on Learning Representations , volume 2025, pages 406–441, 2025

2025
[28]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023

2023
[29]

Ui-tars-1.5

Seed. Ui-tars-1.5. ByteDance Seed Blog, 2025. URL https://seed-tars.com/1.5/

2025
[30]

Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency. arXiv preprint arXiv:2603.20633 , 2026. 27

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Hybridflow: A flexible and eﬀicient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and eﬀicient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems , pages 1279–1297, 2025

2025
[32]

Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks. arXiv preprint arXiv:2510.26098, 2025

work page arXiv 2025
[33]

arXiv preprint arXiv:2507.05720 , year=

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[34]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

arXiv preprint arXiv:2602.09082 , year=

Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, et al. Ui-venus-1.5 technical report. arXiv preprint arXiv:2602.09082 , 2026

work page arXiv 2026
[37]

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, et al. Cua-gym: Scaling verifiable training environments and tasks for computer-use agents. arXiv preprint arXiv:2605.25624 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems , 37:2686–2710, 2024

2024
[40]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Opencua: Open foundations for computer-use agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, et al. Opencua: Open foundations for computer-use agents. Advances in Neural Information Processing Systems, 38:139756–139806, 2026

2026
[43]

Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6239–6248, 2026

2026
[44]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, et al. Mobilegym: A verifiable and highly parallel simulation platform for mobile gui agent research. arXiv preprint arXiv:2605.26114 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism

Qinzhuo Wu, Pengzhi Gao, Wei Liu, and Jian Luan. Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4250–4272, 2025

2025
[47]

Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment

Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, and Jian Luan. Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment. arXiv preprint arXiv:2601.20335 , 2026. 28

work page arXiv 2026
[48]

Os-atlas: Foundation action model for generalist gui agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. In International Conference on Learning Representations , volume 2025, pages 5090–5108, 2025

2025
[49]

Scaling computer-use grounding via user interface decomposition and synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis. Advances in Neural Information Processing Systems , 38, 2026

2026
[50]

Gui-pra: Process reward agent for gui tasks

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. Gui-pra: Process reward agent for gui tasks. arXiv preprint arXiv:2509.23263 , 2025

work page arXiv 2025
[51]

5: Multi-platform fundamental gui agents , author=

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855 , 2026

work page arXiv 2026
[52]

Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. In The Fourteenth International Conference on Learning Representations , 2026

2026
[53]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report. arXiv preprint arXiv:2512.15431 , 2025

work page arXiv 2025
[54]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URL https://arxiv. org/abs/2508.15144, 4:21–27, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL https://arxiv. org/abs/2503.14476, 1:2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Sglang: Eﬀicient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Eﬀicient execution of structured language model programs. Advances in neural information processing systems , 37:62557–62583, 2024

2024
[58]

arXiv preprint arXiv:2512.22047 , year=

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025. 29 Contributions and Acknowledgments All contributors are listed in alphabetical order by their last names. Core Contr...

work page arXiv 2025
[59]

Current device type & foreground app
[60]

Output the corresponding JSON string inside `<tool_call>`.,→

Current screenshot # Available Tools You MUST pick exactly one tool per step. Output the corresponding JSON string inside `<tool_call>`.,→
[61]

name": "Tap

Tap: `{"name": "Tap", "position": [x, y], "times": 1}` (Tap at coordinate)
[62]

name": "LongPress

LongPress: `{"name": "LongPress", "position": [x, y]}` (Trigger contextual menus)
[63]

name": "Swipe

Swipe: `{"name": "Swipe", "start_position": [x1, y1], "end_position": [x2, y2]}` (Swipe to scroll/move. Swipe up to scroll down),→
[64]

name": "Type

Type: `{"name": "Type", "position": [x, y], "text": "..."}` (Tap input box and type)
[65]

name": "Search

Search: `{"name": "Search", "position": [x, y], "text": "..."}` (Macro: tap -> clear -> type -> submit),→
[66]

name": "Open

Open: `{"name": "Open", "app": "..."}` (Launch app via system)
[67]

name": "Back

Back: `{"name": "Back"}` (System-level back)
[68]

name": "Home

Home: `{"name": "Home"}` (Go to home screen)
[69]

name": "Wait

Wait: `{"name": "Wait"}` (Wait for page loading/rendering)
[70]

name": "Request

Request: `{"name": "Request", "text": "..."}` (Ask user for clarification/confirmation)
[71]

name": "Fail

Fail: `{"name": "Fail", "type": "...", "reason": "..."}` (Report failure. `<TYPE>` MUST be one of: LOGIN_REQUIRED, USE_GUIDANCE, CAPTCHA_VERIFICATION, RESULT_NOT_FOUND, BLUETOOTH_CONNECTION_REQUIRED, NETWORK_ERROR, PAYMENT_AUTHENTICATION, TASK_CANT_FULFILLED, REPEAT_OPERATION, PERMISSION_REQUEST, PASSWORD_REQUIRED, TAKEOVER_EXIT, TEMPORARY_TAKEOVER, MANUA...
[72]

name": "Complete

Complete: `{"name": "Complete"}` (Confirm goal reached for non-Q&A tasks)
[73]

name": "Speak

Speak: `{"name": "Speak", "text": "..."}` (Present final answer for Q&A tasks) # Operational Constraints
[74]

Top-left is (0, 0); bottom-right is (1, 1).,→

Coordinate system: every `position` is a relative [x, y] in [0, 1] with 3-decimal precision. Top-left is (0, 0); bottom-right is (1, 1).,→
[75]

Dismiss unrelated pop-ups (ads, upgrade prompts, rating requests) by tapping their Close / Skip / X / "Later" button rather than calling Fail.,→
[76]

If self-correction fails, call Fail

Loop breaker: if three consecutive steps cause no visible change, or the same action is repeating in a loop, self-correct (try Back or a different target). If self-correction fails, call Fail. ,→ ,→ # Reasoning Framework (inside <think>) Before emitting the action, reason inside `<think>...</think>` (omit steps if no new info):
[77]

[Observation]: Objectively describe the current App, page state, and key visible elements
[78]

Explain what was expected vs

[Reflection]: (Optional) Include ONLY if the current screen deviates from the previous plan's expectation. Explain what was expected vs. what is actually seen.,→
[79]

Output a 2-4 step path in a single line separated by `|`

[Plan] / [Plan Update] / [Replan]: (Choose one). Output a 2-4 step path in a single line separated by `|`. Mark completed steps with `[done]` and the current step with `->`. Use [Replan] if the previous plan failed. ,→ ,→
[80]

[Decision]: Deduce the exact action based on the Observation and the current `->` step in the Plan.,→

Showing first 80 references.

[1] [1]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet. Model card addendum, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf

2024

[2] [2]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

2026

[3] [3]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. Anthropic announcement, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

2026

[4] [4]

Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and A viral Kumar. Digirl: Training in- the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

2024

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Seed2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. arXiv preprint arXiv:2603.11103, 2026. URL https://arxiv.org/abs/2603.11103

work page arXiv 2026

[7] [7]

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation. arXiv preprint arXiv:2604.08455 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Step: Success-rate-aware trajectory- eﬀicient policy optimization

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. Step: Success-rate-aware trajectory- eﬀicient policy optimization. arXiv preprint arXiv:2511.13091 , 2025

work page arXiv 2025

[9] [9]

Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning

Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, and Mengwei Xu. Gui-shift: Enhancing vlm-based gui agents through self-supervised reinforcement learning. arXiv preprint arXiv:2505.12493 , 2025

work page arXiv 2025

[10] [10]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. Model card, 2026. URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/

2026

[11] [11]

Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards. arXiv preprint arXiv:2506.20332 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

arXiv preprint arXiv:2508.10833 , year=

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 , 2025. 26

work page arXiv 2025

[13] [13]

Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025. URL https://arxiv.org/abs/2507.01663

work page arXiv 2025

[14] [14]

Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mo- bileipl: Enhancing mobile agents thinking process via iterative preference learning. arXiv preprint arXiv:2505.12299 , 2025

work page arXiv 2025

[15] [15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia , pages 8778–8786, 2025

2025

[17] [17]

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, and Shu-Tao Xia. From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents. arXiv preprint arXiv:2603.01455 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, and Yuanchun Li. Simuwob: Simulating real-world mobile apps for fast and faithful gui agent benchmarking. arXiv preprint arXiv:2605.25160 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Chang- sheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc. arXiv preprint arXiv:2502.14282 , 2025

work page arXiv 2025

[20] [20]

Autoglm: Autonomous foundation agents for guis

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820 , 2024

work page arXiv 2024

[21] [21]

Ui-r1: Enhancing eﬀicient action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing eﬀicient action prediction of gui agents by reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 17608–17616, 2026

2026

[22] [22]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision- language action model for gui agents. arXiv preprint arXiv:2504.10458 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Addendum to openai o3 and o4-mini system card: Openai o3 operator

OpenAI. Addendum to openai o3 and o4-mini system card: Openai o3 operator. System card addendum, 2025. URL https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/

2025

[24] [24]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations, volume 2024, pages 9695–9717, 2024

2024

[25] [25]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents, 2025. URL https://arxiv. org/abs/2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, and Jian Luan. Scaling, benchmarking, and reasoning of vision-language agents for mobile gui navigation. arXiv preprint arXiv:2605.27134 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Androidworld: A dynamic benchmarking environment for au- tonomous agents

Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for au- tonomous agents. In International Conference on Learning Representations , volume 2025, pages 406–441, 2025

2025

[28] [28]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023

2023

[29] [29]

Ui-tars-1.5

Seed. Ui-tars-1.5. ByteDance Seed Blog, 2025. URL https://seed-tars.com/1.5/

2025

[30] [30]

Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency. arXiv preprint arXiv:2603.20633 , 2026. 27

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Hybridflow: A flexible and eﬀicient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and eﬀicient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems , pages 1279–1297, 2025

2025

[32] [32]

Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks. arXiv preprint arXiv:2510.26098, 2025

work page arXiv 2025

[33] [33]

arXiv preprint arXiv:2507.05720 , year=

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025

[34] [34]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[35] [35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

arXiv preprint arXiv:2602.09082 , year=

Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, et al. Ui-venus-1.5 technical report. arXiv preprint arXiv:2602.09082 , 2026

work page arXiv 2026

[37] [37]

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, et al. Cua-gym: Scaling verifiable training environments and tasks for computer-use agents. arXiv preprint arXiv:2605.25624 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems , 37:2686–2710, 2024

2024

[40] [40]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Opencua: Open foundations for computer-use agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, et al. Opencua: Open foundations for computer-use agents. Advances in Neural Information Processing Systems, 38:139756–139806, 2026

2026

[43] [43]

Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: A unified hierarchical evaluation framework for multi-platform gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6239–6248, 2026

2026

[44] [44]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, et al. Mobilegym: A verifiable and highly parallel simulation platform for mobile gui agent research. arXiv preprint arXiv:2605.26114 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism

Qinzhuo Wu, Pengzhi Gao, Wei Liu, and Jian Luan. Backtrackagent: Enhancing gui agent with error detection and backtracking mechanism. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4250–4272, 2025

2025

[47] [47]

Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment

Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, and Jian Luan. Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment. arXiv preprint arXiv:2601.20335 , 2026. 28

work page arXiv 2026

[48] [48]

Os-atlas: Foundation action model for generalist gui agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. In International Conference on Learning Representations , volume 2025, pages 5090–5108, 2025

2025

[49] [49]

Scaling computer-use grounding via user interface decomposition and synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis. Advances in Neural Information Processing Systems , 38, 2026

2026

[50] [50]

Gui-pra: Process reward agent for gui tasks

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, and Shengyu Zhang. Gui-pra: Process reward agent for gui tasks. arXiv preprint arXiv:2509.23263 , 2025

work page arXiv 2025

[51] [51]

5: Multi-platform fundamental gui agents , author=

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855 , 2026

work page arXiv 2026

[52] [52]

Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, and Bo An. Sman-bench: A cross-system benchmark for mobile agents under single-and multi-path, ambiguous, and noisy tasks. In The Fourteenth International Conference on Learning Representations , 2026

2026

[53] [53]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report. arXiv preprint arXiv:2512.15431 , 2025

work page arXiv 2025

[54] [54]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URL https://arxiv. org/abs/2508.15144, 4:21–27, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL https://arxiv. org/abs/2503.14476, 1:2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Sglang: Eﬀicient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Eﬀicient execution of structured language model programs. Advances in neural information processing systems , 37:62557–62583, 2024

2024

[58] [58]

arXiv preprint arXiv:2512.22047 , year=

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025. 29 Contributions and Acknowledgments All contributors are listed in alphabetical order by their last names. Core Contr...

work page arXiv 2025

[59] [59]

Current device type & foreground app

[60] [60]

Output the corresponding JSON string inside `<tool_call>`.,→

Current screenshot # Available Tools You MUST pick exactly one tool per step. Output the corresponding JSON string inside `<tool_call>`.,→

[61] [61]

name": "Tap

Tap: `{"name": "Tap", "position": [x, y], "times": 1}` (Tap at coordinate)

[62] [62]

name": "LongPress

LongPress: `{"name": "LongPress", "position": [x, y]}` (Trigger contextual menus)

[63] [63]

name": "Swipe

Swipe: `{"name": "Swipe", "start_position": [x1, y1], "end_position": [x2, y2]}` (Swipe to scroll/move. Swipe up to scroll down),→

[64] [64]

name": "Type

Type: `{"name": "Type", "position": [x, y], "text": "..."}` (Tap input box and type)

[65] [65]

name": "Search

Search: `{"name": "Search", "position": [x, y], "text": "..."}` (Macro: tap -> clear -> type -> submit),→

[66] [66]

name": "Open

Open: `{"name": "Open", "app": "..."}` (Launch app via system)

[67] [67]

name": "Back

Back: `{"name": "Back"}` (System-level back)

[68] [68]

name": "Home

Home: `{"name": "Home"}` (Go to home screen)

[69] [69]

name": "Wait

Wait: `{"name": "Wait"}` (Wait for page loading/rendering)

[70] [70]

name": "Request

Request: `{"name": "Request", "text": "..."}` (Ask user for clarification/confirmation)

[71] [71]

name": "Fail

Fail: `{"name": "Fail", "type": "...", "reason": "..."}` (Report failure. `<TYPE>` MUST be one of: LOGIN_REQUIRED, USE_GUIDANCE, CAPTCHA_VERIFICATION, RESULT_NOT_FOUND, BLUETOOTH_CONNECTION_REQUIRED, NETWORK_ERROR, PAYMENT_AUTHENTICATION, TASK_CANT_FULFILLED, REPEAT_OPERATION, PERMISSION_REQUEST, PASSWORD_REQUIRED, TAKEOVER_EXIT, TEMPORARY_TAKEOVER, MANUA...

[72] [72]

name": "Complete

Complete: `{"name": "Complete"}` (Confirm goal reached for non-Q&A tasks)

[73] [73]

name": "Speak

Speak: `{"name": "Speak", "text": "..."}` (Present final answer for Q&A tasks) # Operational Constraints

[74] [74]

Top-left is (0, 0); bottom-right is (1, 1).,→

Coordinate system: every `position` is a relative [x, y] in [0, 1] with 3-decimal precision. Top-left is (0, 0); bottom-right is (1, 1).,→

[75] [75]

Dismiss unrelated pop-ups (ads, upgrade prompts, rating requests) by tapping their Close / Skip / X / "Later" button rather than calling Fail.,→

[76] [76]

If self-correction fails, call Fail

Loop breaker: if three consecutive steps cause no visible change, or the same action is repeating in a loop, self-correct (try Back or a different target). If self-correction fails, call Fail. ,→ ,→ # Reasoning Framework (inside <think>) Before emitting the action, reason inside `<think>...</think>` (omit steps if no new info):

[77] [77]

[Observation]: Objectively describe the current App, page state, and key visible elements

[78] [78]

Explain what was expected vs

[Reflection]: (Optional) Include ONLY if the current screen deviates from the previous plan's expectation. Explain what was expected vs. what is actually seen.,→

[79] [79]

Output a 2-4 step path in a single line separated by `|`

[Plan] / [Plan Update] / [Replan]: (Choose one). Output a 2-4 step path in a single line separated by `|`. Mark completed steps with `[done]` and the current step with `->`. Use [Replan] if the previous plan failed. ,→ ,→

[80] [80]

[Decision]: Deduce the exact action based on the Observation and the current `->` step in the Plan.,→