pith. machine review for the scientific record. sign in

arxiv: 2605.12481 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Haiyang Xu, Jieping Ye, Jing Shao, Jingyi Yang, Kyle Qiao, Ming Yan, Xi Zhang, Xuanjing Huang, Xuhao Hu

Pith reviewed 2026-05-13 03:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords computer use agentsGUI-tool orchestrationtrajectory scalingreinforcement learninghybrid action spacepath selectionsynthetic dataagent training
0
0 comments X

The pith

ToolCUA learns optimal selection between GUI actions and tool calls by scaling synthetic hybrid trajectories and applying staged reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Computer use agents often face uncertainty about whether to continue with low-level GUI operations such as clicks or switch to high-level tool calls such as file APIs. ToolCUA addresses this by first creating an Interleaved GUI-Tool Trajectory Scaling Pipeline that turns abundant static GUI trajectories into diverse hybrid ones using a synthesized tool library. It then applies Tool-Bootstrapped GUI RFT to strengthen decisions at switching points and finishes with Online Agentic RL guided by a Tool-Efficient Path Reward. A sympathetic reader would care because this shows hybrid action spaces can be mastered without manual engineering or expensive real trajectory collection, potentially making digital agents more reliable for everyday computer tasks.

Core claim

The paper claims that an end-to-end agent trained through a staged paradigm—repurposing static GUI trajectories to synthesize interleaved GUI-Tool data, followed by Tool-Bootstrapped GUI RFT and then Online Agentic RL with a reward that favors appropriate tool use and shorter paths—can learn to select more effective execution paths in a hybrid action space, achieving 46.85 percent accuracy on OSWorld-MCP with a 66 percent relative improvement over the baseline and a 3.9 percent gain over GUI-only settings.

What carries the argument

The staged training paradigm that combines the Interleaved GUI-Tool Trajectory Scaling Pipeline, Tool-Bootstrapped GUI RFT, and Online Agentic RL with a Tool-Efficient Path Reward.

Load-bearing premise

The synthesized interleaved trajectories match the distribution of real user tasks closely enough that policies trained on them transfer without major distribution shift or bias from the synthesis process.

What would settle it

A large drop in accuracy when ToolCUA is evaluated on a set of real human-collected interleaved GUI-tool trajectories compared with its performance on the synthetic data would show the assumption does not hold.

Figures

Figures reproduced from arXiv: 2605.12481 by Haiyang Xu, Jieping Ye, Jing Shao, Jingyi Yang, Kyle Qiao, Ming Yan, Xi Zhang, Xuanjing Huang, Xuhao Hu.

Figure 1
Figure 1. Figure 1: (a) The advantage of Tool-augmented actions compared with pure GUI actions. (b) The performance of our ToolCUA compared with the baselines, agentic CUAs, and general models. ∗Equal Contribution, †Corresponding Author arXiv:2605.12481v1 [cs.AI] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Current computer use agents suffer from optimal path confusion under GUI-Tool hybrid actions. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: A synthetic GUI-Tool interleaved trajectory generated by our pipeline, which demonstrates strategic tool selection and seamless switching between atomic GUI actions and tool calls. a grounded library of tools from recurrent GUI procedures, and then use these tools to transform GUI￾only trajectories into interleaved GUI-Tool trajectories. Our pipeline scales data along three dimensions: Tool Functionality a… view at source ↗
Figure 5
Figure 5. Figure 5: Results across tasks on OSWorld-MCP for different models, Gemini-3.1-Pro, Qwen3-VL-8B-Instruct (baseline), [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Online Agentic RL training dynamics of ToolCUA and two ablations. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the synthesized tools in a projected action space, where each point corresponds to one tool node, colors denote the application taxonomy, and marker shapes denote granularity tiers. this stage includes a rollout size of 32 per group, a learning rate of 1 × 10−6 , and a training batch size of 32 to obtain ToolCUA. For the Tool Appropriateness Reward Term, we leverage the task-level annotati… view at source ↗
read the original abstract

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ToolCUA, an end-to-end Computer Use Agent for optimal orchestration between atomic GUI actions and high-level tool calls. It introduces an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes static GUI trajectories and synthesizes a tool library to generate diverse interleaved trajectories, followed by a staged training process consisting of warmup SFT, Tool-Bootstrapped GUI RFT, and Online Agentic RL guided by a Tool-Efficient Path Reward that penalizes inefficient paths. On the OSWorld-MCP benchmark, ToolCUA reports 46.85% accuracy (approximately 66% relative improvement over baseline and +3.9% over GUI-only), claiming a new SOTA among models of comparable scale and demonstrating the benefits of hybrid action spaces.

Significance. If the synthetic trajectories match real task distributions, the work provides a practical, scalable approach to training hybrid GUI-tool agents without expensive real tool-trajectory collection and shows that staged RL with path-efficiency rewards can improve switching decisions. The open-sourcing of the model and pipeline is a positive contribution for reproducibility in the CUA field.

major comments (3)
  1. [§3] §3 (Interleaved GUI-Tool Trajectory Scaling Pipeline): The pipeline is load-bearing for all transfer and SOTA claims, yet the manuscript provides no quantitative distributional validation (e.g., KS tests, EMD on action histograms, switching-point statistics, or tool-call frequency comparisons) against held-out real user trajectories. Without such checks, it remains possible that reported gains optimize for synthesis artifacts rather than genuine GUI-tool orchestration.
  2. [§4] §4 (Experiments and OSWorld-MCP results): The headline numbers (46.85% accuracy, ~66% relative gain, +3.9% over GUI-only) are presented without ablations isolating the contribution of each training stage, without error bars or statistical significance, and without explicit confirmation that baselines use identical model scale and evaluation protocol. These omissions make it difficult to attribute gains specifically to the proposed orchestration method.
  3. [§4.2–4.3] §4.2–4.3 (Tool-Bootstrapped GUI RFT and Online Agentic RL): The Tool-Efficient Path Reward is defined externally in terms of tool usage and path length; the paper does not show that this reward correlates with human-judged task success or that the learned policy generalizes beyond the synthetic distribution, which is required to support the claim of “optimal GUI-Tool path selection.”
minor comments (2)
  1. [Abstract] Abstract: The exact baseline accuracy used for the “approximately 66%” relative improvement is not stated, forcing readers to consult tables to verify the claim.
  2. [§3.1] Notation and reproducibility: The precise mathematical form of the Tool-Efficient Path Reward and the details of the tool-library synthesis procedure should be given as numbered equations or pseudocode in the main text rather than only in appendices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the constructive criticism and have prepared detailed responses to each major comment. We believe the revisions outlined will address the concerns and improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [§3] The pipeline is load-bearing for all transfer and SOTA claims, yet the manuscript provides no quantitative distributional validation (e.g., KS tests, EMD on action histograms, switching-point statistics, or tool-call frequency comparisons) against held-out real user trajectories. Without such checks, it remains possible that reported gains optimize for synthesis artifacts rather than genuine GUI-tool orchestration.

    Authors: We agree that providing quantitative validation of the synthetic trajectories against real distributions would strengthen the claims. Although the original manuscript focused on end-to-end performance on the real OSWorld-MCP benchmark to demonstrate transfer, we will revise §3 to include distributional analyses. Specifically, we will report Kolmogorov-Smirnov tests, Earth Mover's Distance on action histograms, switching-point statistics, and tool-call frequency comparisons using held-out real trajectories. This addition will help confirm that the Interleaved GUI-Tool Trajectory Scaling Pipeline generates data aligned with real user behaviors. revision: yes

  2. Referee: [§4] The headline numbers (46.85% accuracy, ~66% relative gain, +3.9% over GUI-only) are presented without ablations isolating the contribution of each training stage, without error bars or statistical significance, and without explicit confirmation that baselines use identical model scale and evaluation protocol. These omissions make it difficult to attribute gains specifically to the proposed orchestration method.

    Authors: We acknowledge these omissions in the experimental presentation. In the revised version, we will expand §4 with detailed ablations for each component of the staged training (warmup SFT, Tool-Bootstrapped GUI RFT, and Online Agentic RL). We will also include error bars from multiple evaluation runs, report p-values for statistical significance, and explicitly confirm that all compared baselines were run with identical model scales and under the same evaluation protocol on OSWorld-MCP. These changes will better isolate the contributions of our orchestration approach. revision: yes

  3. Referee: [§4.2–4.3] The Tool-Efficient Path Reward is defined externally in terms of tool usage and path length; the paper does not show that this reward correlates with human-judged task success or that the learned policy generalizes beyond the synthetic distribution, which is required to support the claim of “optimal GUI-Tool path selection.”

    Authors: The Tool-Efficient Path Reward combines task success signals with penalties for inefficient tool usage and longer paths, and is applied during online RL in a high-fidelity environment that mirrors real GUI-Tool interactions. While we did not include a separate human correlation study, the final evaluation on the real-world OSWorld-MCP benchmark demonstrates generalization beyond synthetic data, with improvements over GUI-only baselines indicating effective path selection. To further address this, we will add in the revision a discussion of how the reward design aligns with task success and include additional experiments on held-out real tasks to show generalization. We maintain that the SOTA results support the optimality claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmark evaluation.

full rationale

The paper's chain consists of (1) a data-generation pipeline that repurposes static GUI trajectories to synthesize interleaved GUI-Tool data, (2) staged training (SFT + single-turn RL + online agentic RL) using an externally defined Tool-Efficient Path Reward based on tool usage and path length, and (3) direct measurement of accuracy on the held-out OSWorld-MCP benchmark. None of these steps reduce by construction to their inputs; the synthesis is a standard augmentation method, the reward is not fitted from model outputs, and the headline numbers are empirical results rather than self-predictions or self-citations. No uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that synthetic trajectories generated from static GUI data plus a tool library are distributionally close enough to real hybrid trajectories for effective policy learning; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5642 in / 1101 out tokens · 104647 ms · 2026-05-13T03:42:30.762348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 19 internal anchors

  1. [1]

    Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compo- sitional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

  2. [2]

    Claude opus 4.5, 2026

    Anthropic. Claude opus 4.5, 2026. URL https://www.anthropic.com/news/ claude-opus-4-5. Accessed: 2026-04-20

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025.URL https://arxiv. org/abs/2511.21631, 2025

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    org/abs/2409.08264

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

  6. [6]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  7. [7]

    Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning

    Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

  8. [8]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  9. [9]

    The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250, 2025

    Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250, 2025

  10. [10]

    Gemini: The most capable and general model we’ve built, 2026

    Google DeepMind. Gemini: The most capable and general model we’ve built, 2026. URLhttps: //deepmind.google/models/gemini/pro/. Accessed: 2026-04-28

  11. [11]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  12. [12]

    Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563, 2025

    Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563, 2025

  13. [13]

    Cua-suite: Massive human-annotated video demonstrations for computer-use agents.arXiv preprint arXiv:2603.24440, 2026

    Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, and Sai Rajeswar. Cua-suite: Massive human-annotated video demonstrations for computer-use agents.arXiv preprint arXiv:2603.24440, 2026

  14. [14]

    Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation.arXiv preprint arXiv:2509.23866, 2025

    Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, et al. Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation.arXiv preprint arXiv:2509.23866, 2025. 12 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

  15. [15]

    Mm-browsecomp: Acomprehensivebenchmarkformultimodalbrowsing agents.arXiv preprint arXiv:2508.13186, 2025

    Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, ChenchenJing, ZhenLi, etal. Mm-browsecomp: Acomprehensivebenchmarkformultimodalbrowsing agents.arXiv preprint arXiv:2508.13186, 2025

  16. [16]

    Pc-agent: Ahierarchicalmulti-agentcollaborationframework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan,ChangshengXu,WeimingHu,etal. Pc-agent: Ahierarchicalmulti-agentcollaborationframework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

  17. [17]

    Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024

  18. [18]

    Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

    ZhaoyangLiu,JingJingXie,ZichenDing,ZehaoLi,BowenYang,ZhenyuWu,XuehuiWang,QiushiSun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

  19. [19]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  20. [20]

    Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

    Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

  21. [21]

    Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 1(2):3, 2025

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 1(2):3, 2025

  22. [22]

    Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  23. [23]

    Gui-360: A comprehensive dataset and benchmark for computer- using agents.To appear, 2025

    Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, et al. Gui-360: A comprehensive dataset and benchmark for computer- using agents.To appear, 2025

  24. [24]

    Introducing operator, 2026

    OpenAI. Introducing operator, 2026. URL https://openai.com/index/ introducing-operator/. Accessed: 2026-04-20

  25. [25]

    Openclaw, 2026

    OpenClaw. Openclaw, 2026. URLhttps://github.com/openclaw/openclaw. Accessed: 2026- 04-20

  26. [26]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis, 2023.URL https://arxiv. org/abs/2305.15334, 2023

  27. [27]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

  28. [28]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

  29. [29]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen. ai/blog?id=qwen3.5. 13 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  32. [32]

    CoAct-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

    Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

  33. [33]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  34. [34]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  35. [35]

    Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  36. [36]

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

  37. [37]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  38. [38]

    Acting less is reasoning more! teaching model to act efficiently, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently. arXiv preprint arXiv:2504.14870, 2025

  39. [39]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

  40. [40]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

  41. [41]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026

  42. [42]

    Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

    Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

  43. [43]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025. 14 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

  44. [44]

    Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

    Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, ZechenLi, YangShi, YuqiTang, etal. Agentic-mme: Whatagenticcapabilityreallybringstomultimodal intelligence?arXiv preprint arXiv:2604.03016, 2026

  45. [45]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  46. [46]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  47. [47]

    Mobilerl: Online agentic reinforcement learning for mobile gui agents

    Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents. arXiv preprint arXiv:2509.18119, 2025

  48. [48]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

  49. [49]

    Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, et al. Step-gui technical report.arXiv preprint arXiv:2512.15431, 2025

  50. [50]

    Mcpworld: A unified benchmarking testbed for api, gui, and hybrid computer use agents.arXiv preprint arXiv:2506.07672, 2025

    Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, et al. Mcpworld: A unified benchmarking testbed for api, gui, and hybrid computer use agents.arXiv preprint arXiv:2506.07672, 2025

  51. [51]

    Os-symphony: A holistic framework for robust and generalist computer-using agent.arXiv preprint arXiv:2601.07779, 2026

    Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, et al. Os-symphony: A holistic framework for robust and generalist computer-using agent.arXiv preprint arXiv:2601.07779, 2026

  52. [52]

    Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

    Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, et al. Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

  53. [53]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025

  54. [54]

    Ultracua: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

    Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, et al. Ultracua: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

  55. [55]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026

  56. [56]

    Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025. 15 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

  57. [57]

    Agentfold: Long-horizon web agents with proactive context management.CoRR, abs/2510.24699, 2025

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

  58. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  59. [59]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  60. [60]

    Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents

    Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12367–12375, 2026

  61. [61]

    Api agents vs

    Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Api agents vs. gui agents: Divergence and convergence.arXiv preprint arXiv:2503.11069, 2025

  62. [62]

    Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

    Chaoyun Zhang, Liqun Li, He Huang, Chiming Ni, Bo Qiao, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, et al. Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

  63. [63]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 16 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents Appendix A. Limitations and Future Works Although ToolCUA demonstra...

  64. [64]

    Which application, window, or web page is visible

  65. [65]

    The main interactive UI elements that are visible, such as buttons, menus, tabs, tables, forms, inputs, dialogs, or side panels

  66. [66]

    Do not invent details that are not visible in the screenshot

    The primary content currently shown on screen Write 2-4 sentences. Do not invent details that are not visible in the screenshot. Return plain text only, not JSON. JOINT GENERATION PROMPT ## Role You are a trajectory generator, not a real agent. You must simulate how a smart and efficient agent would complete exactly the current step of a computer-using ta...

  67. [67]

    view page

    The screenshot description already tells you the current screen state, so you do not need a separate "view page" or "inspect page" tool

  68. [68]

    Do not call wait-like tools more than once in a row

  69. [69]

    Do not call wait-like tools unless there is a clear error condition

  70. [70]

    Every step must make progress toward the task goal

  71. [71]

    Do not wait for loading unless the state explicitly requires it

    Assume the network, application, and desktop environment are functioning normally. Do not wait for loading unless the state explicitly requires it

  72. [72]

    Choose only a tool action that can plausibly move the current state to a later real state in the recorded trajectory

  73. [73]

    Do not call terminate unless the task has already reached the final recorded state. ## Task goal {goal} ## Trajectory history {history} ## Current screenshot description {screenshot_description} ## Current virtual world state {world_state} ## Available tools for this task {tools} Generate the complete output for the current step, including:

  74. [74]

    observation

    Tool Response ## Rules for generating the Step - The observation should reflect the current screenshot description, especial attention to important region. - Thougt must demonstrate the agent's real thinking progress advancing the task given current step. - Action must be a natural-language description of the intended tool call action, it should be a simp...