pith. machine review for the scientific record. sign in

arxiv: 2504.10458 · v4 · submitted 2025-04-14 · 💻 cs.CV · cs.CL· cs.HC

Recognition: 3 theorem links

· Lean Theorem

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.HC
keywords GUI agentsreinforcement fine-tuningvision-language modelsunified action spacepolicy optimizationcross-platform agentsgeneralizationR1-style training
0
0 comments X

The pith

GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GUI-R1 as a reinforcement fine-tuning method that improves large vision-language models for controlling graphical interfaces. It replaces heavy supervised fine-tuning with policy optimization on a unified action space, using a few thousand carefully chosen examples drawn from Windows, Linux, MacOS, Android, and web environments. With this setup the model exceeds prior state-of-the-art results that required millions of labeled samples. The approach targets high-level tasks rather than low-level pixel clicking and reports gains on eight benchmarks that span mobile, desktop, and web platforms. A sympathetic reader sees the work as evidence that reinforcement learning can reduce data requirements while improving generalization to interfaces never seen in training.

Core claim

GUI-R1 is the first reinforcement learning framework that enhances the GUI capabilities of large vision-language models through unified action space rule modeling. By applying Group Relative Policy Optimization on a small curated dataset of 3K examples collected across five operating systems, the method surpasses previous supervised approaches such as OS-Atlas that used 13M examples, delivering higher success rates on eight benchmarks covering mobile, desktop, and web platforms.

What carries the argument

Unified action space rule modeling inside a Group Relative Policy Optimization loop that updates the vision-language model from a few thousand multi-platform demonstrations.

If this is right

  • GUI agents can be trained for new platforms with orders-of-magnitude less labeled data.
  • Policy optimization can replace or augment supervised fine-tuning for high-level interface tasks.
  • A single model can handle mobile, desktop, and web environments after exposure to a modest shared dataset.
  • Real-world deployment of GUI agents becomes feasible with smaller, platform-agnostic training corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement recipe might extend to other embodied control domains that currently rely on massive supervised datasets.
  • If the unified action space proves robust, future work could test whether it supports zero-shot transfer between entirely different interface styles.
  • Developers might experiment with mixing the 3K seed set with synthetic trajectories generated by the model itself to further reduce human curation effort.
  • The reported gains suggest that reasoning-style reinforcement loops can improve perception-action loops even when the input is a screenshot rather than text.

Load-bearing premise

A small set of carefully chosen high-quality examples plus a shared action vocabulary is enough for the model to handle new interfaces without large-scale supervised training.

What would settle it

A new benchmark interface where GUI-R1 accuracy falls below the accuracy of a supervised model trained on the same 3K examples or where the performance gap to OS-Atlas disappears.

read the original abstract

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GUI-R1, the first R1-style reinforcement learning framework for GUI agents. It trains LVLMs via Group Relative Policy Optimization (GRPO) on a small curated dataset of 3K high-quality trajectories spanning Windows, Linux, macOS, Android, and Web platforms, using a unified action space rule model. The central claim is that this yields superior performance over prior SOTA methods such as OS-Atlas (trained on 13M examples) across eight benchmarks on mobile, desktop, and web, using only 0.02% of the data volume.

Significance. If the performance delta is reproducible and the RL contribution is isolated, the result would indicate that unified-action-space GRPO on carefully curated small data can outperform large-scale SFT baselines for GUI agents. This would support a shift toward data-efficient RL paradigms for real-world interface agents and reduce reliance on massive supervised datasets.

major comments (3)
  1. [Experiments] Experiments section: no ablation applies standard supervised fine-tuning to the identical 3K curated trajectories and unified action-space rules. Without this control, the headline claim that GRPO (rather than curation quality or base-model strength) drives the gains over OS-Atlas cannot be isolated and remains load-bearing for the central argument.
  2. [Results] Results and evaluation sections: benchmark definitions, exact task splits, statistical significance tests (e.g., confidence intervals or p-values), and precise baseline re-implementation details are not provided. This makes it impossible to verify the reported superiority across the eight benchmarks.
  3. [Method] Data curation and method sections: the process for selecting the 3K examples and formalizing the unified action-space rules is described only at high level. More concrete specification of selection criteria and rule encoding is required to assess reproducibility and rule out selection effects.
minor comments (1)
  1. [Abstract] Abstract and introduction: the 0.02% data claim should be accompanied by a precise citation or table entry for the 13M figure used by OS-Atlas.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements, which we believe will strengthen the clarity and rigor of our claims regarding the effectiveness of GRPO on curated small-scale data for GUI agents.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no ablation applies standard supervised fine-tuning to the identical 3K curated trajectories and unified action-space rules. Without this control, the headline claim that GRPO (rather than curation quality or base-model strength) drives the gains over OS-Atlas cannot be isolated and remains load-bearing for the central argument.

    Authors: We agree that directly comparing GRPO to standard SFT on the exact same 3K trajectories is essential to isolate the RL contribution. While our primary comparisons were against large-scale SFT baselines like OS-Atlas, we will add this ablation experiment in the revised manuscript. The new results will demonstrate performance differences attributable to the policy optimization step under identical data and action-space conditions. revision: yes

  2. Referee: [Results] Results and evaluation sections: benchmark definitions, exact task splits, statistical significance tests (e.g., confidence intervals or p-values), and precise baseline re-implementation details are not provided. This makes it impossible to verify the reported superiority across the eight benchmarks.

    Authors: We will expand the results and evaluation sections in the revision to include detailed benchmark definitions, exact task splits, re-implementation specifics for all baselines, and statistical measures such as confidence intervals. These additions will enable full verification and reproducibility of the reported performance gains across the eight benchmarks. revision: yes

  3. Referee: [Method] Data curation and method sections: the process for selecting the 3K examples and formalizing the unified action-space rules is described only at high level. More concrete specification of selection criteria and rule encoding is required to assess reproducibility and rule out selection effects.

    Authors: We acknowledge that the current description of data curation and unified action-space rule formalization is high-level. In the revised manuscript, we will provide concrete details on the selection criteria for the 3K trajectories (e.g., quality filters and platform coverage) and the precise encoding of the unified action-space rules to support reproducibility and address potential selection effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical application of external RL methods

full rationale

The paper applies GRPO and RFT techniques cited from external DeepSeek-R1 work to a curated 3K-example GUI dataset with unified action space. No mathematical derivation chain exists that reduces by construction to self-defined inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The performance claims rest on benchmark comparisons rather than any self-referential reduction. This is a standard empirical proposal with independent content from the cited external RL algorithms.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that GRPO-based policy optimization on a small curated multi-platform dataset produces generalizable GUI action policies; no explicit free parameters or invented entities are named, but the effectiveness of the unified action space modeling is taken as given.

axioms (1)
  • domain assumption Reinforcement fine-tuning with GRPO can efficiently improve LVLM action prediction on GUI screenshots without large-scale supervised data
    Invoked when stating that RFT from DeepSeek-R1 transfers to GUI agents

pith-pipeline@v0.9.0 · 5572 in / 1183 out tokens · 37333 ms · 2026-05-15T02:06:38.256306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  2. What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

    cs.CV 2026-05 conditional novelty 7.0

    GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.

  3. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 accept novelty 7.0

    GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

  4. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 unverdicted novelty 7.0

    GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.

  5. Benchmarking and Improving GUI Agents in High-Dynamic Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...

  6. Benchmarking and Improving GUI Agents in High-Dynamic Environments

    cs.CV 2026-04 conditional novelty 7.0

    DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...

  7. OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

  8. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  9. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  10. How Mobile World Model Guides GUI Agents?

    cs.AI 2026-05 unverdicted novelty 6.0

    Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...

  11. LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.

  12. BAMI: Training-Free Bias Mitigation in GUI Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.

  13. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  14. AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

    cs.CV 2026-04 unverdicted novelty 6.0

    AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.

  15. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  16. ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

    cs.IR 2026-04 unverdicted novelty 6.0

    ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.

  17. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  18. HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

  19. Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

    cs.AI 2026-04 unverdicted novelty 5.0

    LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...

  20. From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

    cs.AI 2026-03 unverdicted novelty 5.0

    An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

  21. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

  22. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  23. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

  24. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 21 Pith papers · 9 internal anchors

  1. [1]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

  2. [2]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  3. [3]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiy- ong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  6. [6]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language mod- els.arXiv preprint arXiv:2503.06749, 2025

  7. [7]

    R1-v: Reinforcing super gen- eralization ability in vision-language models with less than $3.https://github.com/ Deep-Agent/R1-V, 2025

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super gen- eralization ability in vision-language models with less than $3.https://github.com/ Deep-Agent/R1-V, 2025. Accessed: 2025-02-02

  8. [8]

    Vlm-r1: A stable and generalizable r1-style large vision-language model.https: //github.com/om-ai-lab/VLM-R1, 2025

    Haozhan Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model.https: //github.com/om-ai-lab/VLM-R1, 2025. Accessed: 2025-02-15

  9. [9]

    Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Cognitive archi- tectures for language agents.Transactions on Machine Learning Research, 2023

    Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive archi- tectures for language agents.Transactions on Machine Learning Research, 2023

  12. [12]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  13. [13]

    Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,

    Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration.arXiv preprint arXiv:2310.00280, 2023

  14. [14]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

  15. [15]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10

  17. [18]

    Training agents by reinforcing reasoning, 2025

    Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei- Fei, Lijuan Wang, Yejin Choi, and Manling Li. Training agents by reinforcing reasoning, 2025

  18. [19]

    Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,

    Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the meta- verse.arXiv preprint arXiv:2503.18470, 2025

  19. [20]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Bo- tian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  20. [21]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydl ´ıˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Le- andro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. InNeurIPS, pages 30811–30849, 2024

  21. [22]

    Uibert: Learning generic multimodal representations for ui under- standing, 2021

    Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera y Arcas. Uibert: Learning generic multimodal representations for ui under- standing, 2021

  22. [23]

    Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490, 2024

    Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490, 2024

  23. [24]

    Mapping natural language instructions to mobile ui action sequences.arXiv preprint arXiv:2005.03776, 2020

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences.arXiv preprint arXiv:2005.03776, 2020

  24. [25]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InACL, 2024

  25. [26]

    Easyr1: An efficient, scalable, multi- modality rl training framework, 2025

    Yaowei Zheng, Junting Lu, Shenzhi Wang, and Y Xiong. Easyr1: An efficient, scalable, multi- modality rl training framework, 2025

  26. [27]

    On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyama- gundlu, and Oriana Riva. On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

  27. [28]

    Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024

  28. [29]

    Screenspot-pro: Gui grounding for professional high-resolution computer use.Workshop on Reasoning and Planning for Large Language Models, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.Workshop on Reasoning and Planning for Large Language Models, 2025

  29. [30]

    Guicourse: From general vision language models to versatile gui agents.arXiv preprint arXiv:2406.11317, 2024

    Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents.arXiv preprint arXiv:2406.11317, 2024

  30. [31]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Al- Shikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InECCV, pages 161–178. Springer, 2024. 11