arxiv: 2604.11784 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Recognition: unknown

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Fei Tang , Zhiqiong Lu , Boxuan Zhang , Weiming Lu , Jun Xiao , Yueting Zhuang , Yongliang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords GUI agentsreinforcement learningevaluation standardizationmobile deploymentAndroidiOSagent infrastructure

0 comments

The pith

The presented framework supplies an integrated open-source pipeline that trains GUI agents through reinforcement learning, standardizes their evaluation across benchmarks, and deploys them to real mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Progress on GUI agents, which control apps by tapping screens and swiping instead of using code, has been limited more by missing infrastructure than by model size. The authors build a unified framework to fix three specific problems at once: unstable environments during online reinforcement learning training, evaluation methods that change quietly between papers, and trained agents that stay in simulation instead of reaching real users. The framework supplies parallel virtual and physical device support for training, a locked evaluation system that reproduces prior results at 95.8 percent, and deployment code that connects agents to chat apps on Android, HarmonyOS, and iOS. When a two-billion-parameter model is trained end to end inside this system it reaches 17.1 percent success on a mobile GUI benchmark, six points above a same-size baseline.

Core claim

The framework addresses the three main gaps by offering open-source reinforcement learning infrastructure with support for both parallel virtual environments and real physical devices, a fully standardized evaluation pipeline across six benchmarks and eleven models that achieves 95.8 percent reproduction of official baselines, and deployment capabilities for Android, HarmonyOS, and iOS through twelve chat platforms with hybrid control and persistent memory, resulting in improved agent performance when trained within the pipeline.

What carries the argument

The unified framework that combines reinforcement learning training infrastructure with parallel virtual and real device support, a standardized evaluation pipeline across multiple benchmarks, and deployment to real devices via chat platforms with hybrid control and memory.

Load-bearing premise

The benefits of the framework rest on the premise that its standardized evaluation pipeline will prevent silent drifts in results when used independently by other groups without extra tuning.

What would settle it

Running the standardized evaluation pipeline on the same models and benchmarks by an outside team and finding success rates that differ noticeably from the published baselines would show that the standardization does not fully eliminate drifts.

read the original abstract

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ClawGUI, a unified open-source framework for GUI agents comprising ClawGUI-RL (RL training infrastructure using GiGPO and a Process Reward Model for dense supervision, supporting virtual and real devices), ClawGUI-Eval (standardized evaluation pipeline across 6 benchmarks and 11+ models with 95.8% reproduction of official baselines), and ClawGUI-Agent (deployment to Android, HarmonyOS, and iOS via 12+ platforms with hybrid control and memory). End-to-end training yields ClawGUI-2B achieving 17.1% success rate on MobileWorld GUI-Only, a 6% gain over the same-scale MAI-UI-2B baseline.

Significance. If the central claims hold, the work provides valuable infrastructure that could reduce environment instability in online RL, curb drifting evaluation protocols, and enable real-device deployment for GUI agents. The reported performance improvement demonstrates the pipeline's potential utility on a challenging benchmark, though the absolute success rate remains modest.

major comments (2)

[ClawGUI-Eval] ClawGUI-Eval pipeline: The 95.8% reproduction rate against official baselines is central to the claim that the framework eliminates drifting evaluation protocols, yet the manuscript provides no details on how baselines were matched (e.g., exact action-space definitions, success-criterion thresholds, prompt templates, environment seeds, or any benchmark-specific tuning). Without these, it is unclear whether an independent re-implementation using only the released code would recover the same numbers.
[Results] Results section (MobileWorld GUI-Only): The 17.1% success rate for ClawGUI-2B and the 6.0% gain over MAI-UI-2B are reported without error bars, number of evaluation runs, variance across seeds, or statistical significance tests. This information is load-bearing for assessing whether the improvement is reliable and reproducible.

minor comments (1)

[Abstract and Introduction] The abstract and introduction refer to '11+ models' and '6 benchmarks' without listing them explicitly; adding a table or appendix enumeration would improve clarity and allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [ClawGUI-Eval] ClawGUI-Eval pipeline: The 95.8% reproduction rate against official baselines is central to the claim that the framework eliminates drifting evaluation protocols, yet the manuscript provides no details on how baselines were matched (e.g., exact action-space definitions, success-criterion thresholds, prompt templates, environment seeds, or any benchmark-specific tuning). Without these, it is unclear whether an independent re-implementation using only the released code would recover the same numbers.

Authors: We acknowledge that the manuscript does not currently provide exhaustive textual documentation of the exact baseline-matching procedures. The released ClawGUI-Eval codebase contains the evaluation scripts, configuration files, and prompts used to achieve the reported 95.8% reproduction rate. In the revised version we will add a dedicated appendix (Appendix C) that explicitly lists, for each of the six benchmarks: the precise action-space definitions, success-criterion thresholds, prompt templates, environment seeds, and any benchmark-specific tuning steps. This addition will allow readers to verify the reproduction numbers directly from the paper without needing to inspect the code repository. revision: yes
Referee: [Results] Results section (MobileWorld GUI-Only): The 17.1% success rate for ClawGUI-2B and the 6.0% gain over MAI-UI-2B are reported without error bars, number of evaluation runs, variance across seeds, or statistical significance tests. This information is load-bearing for assessing whether the improvement is reliable and reproducible.

Authors: We agree that statistical details are necessary to establish the reliability of the reported improvement. In the revised Results section we will update the MobileWorld GUI-Only numbers to report the mean success rate over five independent evaluation runs that use distinct environment seeds. We will include standard error bars, the observed variance across runs, and the result of a paired t-test confirming that the 6.0% gain over the MAI-UI-2B baseline is statistically significant. These values will appear both in the main text and in the corresponding table. revision: yes

Circularity Check

0 steps flagged

No circularity: paper contains no derivations or equations

full rationale

The manuscript presents an engineering framework (ClawGUI-RL, ClawGUI-Eval, ClawGUI-Agent) and reports empirical results such as 17.1% success rate and 95.8% reproduction rate. No mathematical derivations, equations, fitted parameters, or first-principles claims appear in the provided text. Central assertions are statements about pipeline performance and standardization rather than quantities defined in terms of themselves or reduced by self-citation chains. The reproduction claim is an empirical measurement against baselines, not a self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The work is an engineering framework paper with no free parameters fitted to data, no mathematical axioms beyond standard RL assumptions, and no invented physical entities; the new components are software artifacts whose value rests on empirical claims.

axioms (2)

domain assumption Process reward models can provide stable dense supervision for GUI agent RL training
Invoked when integrating GiGPO with the Process Reward Model in ClawGUI-RL.
domain assumption Standardized evaluation pipelines can eliminate silent protocol drift across independent works
Central premise of ClawGUI-Eval.

invented entities (2)

ClawGUI-RL infrastructure no independent evidence
purpose: Stable parallel virtual and real-device RL training for GUI agents
New software harness introduced to address environment instability.
ClawGUI-Eval pipeline no independent evidence
purpose: Enforce identical evaluation across 6 benchmarks and 11+ models
New standardized testing harness.

pith-pipeline@v0.9.0 · 5578 in / 1641 out tokens · 55134 ms · 2026-05-10T15:54:34.092814+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 2 Pith papers

[1]

Google Innovation and AI blog post about the release of Gemma 4, a family of open-weight multimodal AI models under the Apache 2.0 license with advanced reasoning, multimodal capabili- ties, and support for agentic workflows. H. Feng, S. Wei, X. Fei, W. Shi, Y. Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tang, H. Liu, and C. Huang. Dolphin: Document im...

work page doi:10.18653/v1/2025.findings-acl.1130 2025
[2]

URLhttps://arxiv.org/abs/2507.22358. R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URLhttps://arxiv. org/abs/2112.09332. S. Nayak, X. ...

work page doi:10.1145/3689031.3696075 2022
[3]

URLhttps://arxiv.org/abs/2603.10165. J. Wener. OpenCLI: Make any website your CLI.https://github.com/jackwener/opencli , 2026. Z. Wu, Z. Wu, F. Xu, Y. Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang, and Y. Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. URLhttps://arxiv.org/ abs/2410.23218. T. Xie, J. Deng, X. Li,...

work page Pith review arXiv 2026