Mobile-Agent-v3: Fundamental Agents for GUI Automation
Pith reviewed 2026-05-20 11:55 UTC · model grok-4.3
The pith
GUI-Owl and Mobile-Agent-v3 set new open-source records for GUI agents on Android and desktop benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GUI-Owl achieves state-of-the-art results among open-source end-to-end models on ten GUI benchmarks by incorporating large-scale environment infrastructure for self-evolving trajectory production, diverse foundational agent capabilities for end-to-end decision-making, and scalable environment reinforcement learning with Trajectory-aware Relative Policy Optimization. This leads to Mobile-Agent-v3 improving the scores to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks.
What carries the argument
Self-Evolving GUI Trajectory Production framework that generates high-quality interaction data via automated query generation, correctness validation, and iterative refinement in a self-improving loop.
If this is right
- The model supports end-to-end decision making and serves as a modular component in multi-agent systems.
- Scalable asynchronous RL training with TRPO improves online performance on complex tasks such as OSWorld.
- The infrastructure supports diverse data pipelines across Android, Ubuntu, macOS, and Windows while cutting manual annotation.
- Performance gains show stronger integration of UI grounding, planning, action semantics, and reasoning.
Where Pith is reading between the lines
- The self-evolving data loop could transfer to training agents for web interfaces or other software environments.
- Accurate virtual environments might enable direct deployment of these agents on user-owned devices with minimal retraining.
- Combining the framework with existing multi-agent setups could produce more general automation for everyday computing.
Load-bearing premise
Cloud-based virtual environments accurately reproduce the timing, rendering, and error modes of real user devices so that collected trajectories transfer without large distribution shift.
What would settle it
Running the trained Mobile-Agent-v3 agents on physical Android devices and real desktop machines and checking whether success rates match the reported virtual benchmark numbers.
read the original abstract
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GUI-Owl, a 7B foundational end-to-end GUI agent model that achieves SOTA results among open-source models on ten GUI benchmarks across desktop and mobile settings, reporting 66.4 on AndroidWorld and 29.4 on OSWorld. It further presents Mobile-Agent-v3, a general-purpose framework that improves these scores to 73.3 and 37.7 respectively and claims new SOTA for open-source GUI agent frameworks. Core contributions include a cloud-based virtual environment infrastructure enabling a Self-Evolving GUI Trajectory Production framework (with automated query generation and correctness validation), integration of UI grounding/planning/reasoning capabilities, and a scalable asynchronous RL setup using Trajectory-aware Relative Policy Optimization (TRPO) that achieves 34.9 on OSWorld. Models and code are released at https://github.com/X-PLUG/MobileAgent.
Significance. If the benchmark gains are robust, the work would meaningfully advance open-source GUI agents by demonstrating a scalable, low-annotation pipeline for generating interaction trajectories and aligning them via RL. The concrete numbers on AndroidWorld and OSWorld, combined with the open release of code and models, provide a useful baseline and resource for the community. The self-evolving loop and TRPO formulation represent practical engineering contributions that could generalize to other agent settings.
major comments (2)
- [Section 3] Section 3 (Large-scale Environment Infrastructure and Self-Evolving GUI Trajectory Production): The central empirical claims rest on trajectories and policies produced inside the authors' cloud-based virtual Android/Ubuntu/macOS/Windows instances. The manuscript provides no experiments or metrics quantifying fidelity to the AndroidWorld and OSWorld benchmark environments with respect to screen rendering, input latency, or failure modes. This is load-bearing for the reported improvements (e.g., GUI-Owl-7B at 66.4/29.4 and Mobile-Agent-v3 at 73.3/37.7), as any systematic mismatch would imply distribution shift and prevent apples-to-apples comparison with prior baselines.
- [Scalable Environment RL] Scalable Environment RL section: The description of the fully asynchronous training framework and the introduced TRPO variant lacks (a) the explicit reward function or correctness-validation criteria used inside the self-evolving loop, (b) ablation tables isolating the contribution of TRPO versus standard methods, and (c) error bars or statistical tests on the 34.9 OSWorld score. These omissions make it difficult to assess whether the RL component is responsible for the observed gains or whether the results are sensitive to implementation details.
minor comments (2)
- The abstract states results on 'ten GUI benchmarks' but only details AndroidWorld and OSWorld; a single summary table aggregating all ten would improve readability and allow direct comparison with prior work.
- Notation for TRPO (Trajectory-aware Relative Policy Optimization) is introduced without a formal equation or pseudocode; adding a concise algorithmic box would clarify the modification relative to standard TRPO.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and strengthen the empirical presentation without misrepresenting the current manuscript.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Large-scale Environment Infrastructure and Self-Evolving GUI Trajectory Production): The central empirical claims rest on trajectories and policies produced inside the authors' cloud-based virtual Android/Ubuntu/macOS/Windows instances. The manuscript provides no experiments or metrics quantifying fidelity to the AndroidWorld and OSWorld benchmark environments with respect to screen rendering, input latency, or failure modes. This is load-bearing for the reported improvements (e.g., GUI-Owl-7B at 66.4/29.4 and Mobile-Agent-v3 at 73.3/37.7), as any systematic mismatch would imply distribution shift and prevent apples-to-apples comparison with prior baselines.
Authors: We appreciate the referee's emphasis on environment fidelity, which is indeed important for validating the training pipeline. Our cloud-based virtual environments are configured using standard Android emulators and desktop virtualization stacks chosen to match the OS versions, screen resolutions, and action spaces specified in AndroidWorld and OSWorld. The self-evolving trajectories are generated and validated against the same UI element hierarchies and interaction semantics used in the benchmarks. However, the manuscript does not currently include quantitative side-by-side metrics on rendering pixel fidelity, input latency distributions, or failure-mode statistics. To address this concern directly, we will add a new subsection in Section 3 that details the environment configuration parameters, provides qualitative alignment arguments, and discusses why minor discrepancies are unlikely to explain the consistent gains observed across ten benchmarks. We believe this addition will allow readers to better evaluate potential distribution shift. revision: yes
-
Referee: [Scalable Environment RL] Scalable Environment RL section: The description of the fully asynchronous training framework and the introduced TRPO variant lacks (a) the explicit reward function or correctness-validation criteria used inside the self-evolving loop, (b) ablation tables isolating the contribution of TRPO versus standard methods, and (c) error bars or statistical tests on the 34.9 OSWorld score. These omissions make it difficult to assess whether the RL component is responsible for the observed gains or whether the results are sensitive to implementation details.
Authors: We agree that these elements are necessary for full reproducibility and for isolating the contribution of Trajectory-aware Relative Policy Optimization (TRPO). In the revised manuscript we will (a) explicitly state the reward function and the automated correctness-validation criteria applied within the self-evolving loop, (b) add ablation tables comparing TRPO against standard PPO and other online RL baselines under identical data and environment conditions, and (c) report error bars together with statistical significance tests (e.g., standard deviation and p-values across multiple random seeds) for the 34.9 OSWorld result. These revisions will clarify the role of the asynchronous RL framework and the specific TRPO formulation in the reported performance. revision: yes
Circularity Check
No circularity in empirical benchmark claims or self-evolving data pipeline
full rationale
The paper reports measured performance on external public benchmarks (AndroidWorld, OSWorld) after training and evaluation in cloud virtual environments. The Self-Evolving GUI Trajectory Production framework is described as an iterative empirical data-generation and RL procedure that uses the model to refine trajectories, but the final reported scores are not derived by construction from fitted parameters or self-referential definitions; they remain independently falsifiable on held-out benchmarks. No equations, uniqueness theorems, or self-citation chains are invoked to force the results. This is a standard empirical agent paper whose central claims rest on external evaluation rather than tautological reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Virtual cloud environments reproduce real-device timing, rendering, and failure modes sufficiently for policy transfer.
Forward citations
Cited by 23 Pith papers
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
Faithful Mobile GUI Agents with Guided Advantage Estimator
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.
-
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
-
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
-
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
-
SE-GA: Memory-Augmented Self-Evolution for GUI Agents
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
-
TopoClaw: A Human-Centric and Topology-Aware Agent Operating System
TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.