MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Bojiang Zhou; Dingbang Wu; Haiyang Wang; Han Xiao; Lue Fan; Rui Hao; Shuzhe Wu; Zhaoxiang Zhang; Zhenghong Li; Zheng Ju

arxiv: 2605.26114 · v2 · pith:NX5YKRNLnew · submitted 2026-05-25 · 💻 cs.AI · cs.CL

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Dingbang Wu , Rui Hao , Haiyang Wang , Shuzhe Wu , Han Xiao , Zhenghong Li , Bojiang Zhou , Zheng Ju

show 3 more authors

Zichen Liu Lue Fan Zhaoxiang Zhang

This is my paper

Pith reviewed 2026-06-29 21:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords mobile simulationGUI agentsreinforcement learningsim-to-real transferdeterministic judgingparallel environmentsstructured stateverifiable evaluation

0 comments

The pith

MobileGym uses structured JSON state and deterministic judging to enable scalable RL training for mobile GUI agents that transfers to real devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MobileGym as a browser-hosted simulation environment that represents mobile app states as structured JSON for forking, comparison, and deterministic outcome judging. This design supports hundreds of parallel instances on a single server at low memory cost and provides dense rewards for online reinforcement learning. The accompanying benchmark supplies 416 parameterized tasks across 28 apps with built-in judges that avoid free-text evaluation issues. A case study demonstrates that training with GRPO on Qwen3-VL-4B-Instruct produces a 12.8 percentage point improvement on the 256-task test set, and that 95.1 percent of this gain is retained when the same model executes on real devices for a 59-task subset.

Core claim

MobileGym captures the full environment state as structured JSON that can be configured, forked, and compared, paired with a layered state model and declarative task framework that enables a single programmatic judging mechanism to deliver both deterministic verdicts and dense RL rewards. A single server can host hundreds of instances with roughly 400 MB memory per instance and three-second cold starts. In the reported Sim-to-Real case study, GRPO training on Qwen3-VL-4B-Instruct yields a +12.8 percentage point gain on the 256-task test set, while real-device execution on the 59-task subset retains 95.1 percent of the simulation-side training gain.

What carries the argument

Structured JSON state representation together with deterministic state-based judging that supplies both evaluation verdicts and dense RL rewards.

If this is right

Online RL for mobile GUI agents becomes feasible at scale using modest server hardware for hundreds of parallel rollouts.
Task evaluation becomes reliable through structured AnswerSheet protocols that eliminate free-text matching failures.
State forking and direct comparison support efficient exploration and debugging during agent training.
Agents trained entirely in simulation produce performance gains that largely persist when deployed on physical mobile devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The JSON state model could be adapted to create similar verifiable environments for desktop or web GUI agents.
Parameterized task templates with deterministic judges could support automated curriculum learning by tracking measurable state progress.
The low cold-start time and memory footprint suggest the platform could dynamically scale instance counts during large training runs.

Load-bearing premise

The structured JSON state representation and deterministic judging mechanism accurately capture task outcomes for everyday mobile apps without access to proprietary backends or full device emulation fidelity.

What would settle it

If real-device execution on the 59-task subset shows retention of the simulation training gain substantially below 95.1 percent, for example below 70 percent, the transfer result would be falsified.

Figures

Figures reproduced from arXiv: 2605.26114 by Bojiang Zhou, Dingbang Wu, Haiyang Wang, Han Xiao, Lue Fan, Rui Hao, Shuzhe Wu, Zhaoxiang Zhang, Zhenghong Li, Zheng Ju, Zichen Liu.

**Figure 1.** Figure 1: Example screens from MOBILEGYM. Annotated launcher and messaging screens showing MOBILEGYM’s configurable and sandboxed capabilities. ple open-source apps, and scaling to online training requires many heavyweight emulator instances. Real-device benchmarks such as MobileBenchOL (Wu et al., 2026a) reach everyday apps, but live accounts, backend state, app-version drift, realworld consequences, and the c… view at source ↗

**Figure 2.** Figure 2: End-to-end workflow of MOBILEGYM. A structured state supports task instantiation, parallel rollout forking, and state-diff verification. The resulting judgments are then converted into benchmark metrics and RL rewards. ther constrained by discrete screenshots that provide only partial evidence. It is unwritable: reproducible evaluation and online RL require resetting to known initial conditions, yet tas… view at source ↗

**Figure 3.** Figure 3: System capabilities and state model of MOBILEGYM. App views are produced by composing readmostly World Data, a per-environment Runtime Overlay, and the OS Runtime. The resulting structured environment state supports snapshot/reset/fork and deterministic state-diff judging. Full environment state comparison. The fully structured state enables full-environment state comparison between an episode’s initial a… view at source ↗

**Figure 4.** Figure 4: AnswerSheet protocol. Free-text heuristics can reject equivalent answers or accept leaked reasoning that contains the gold answer. AnswerSheet instead uses GUI form filling and type-specific checks over typed fields. Parameterized task instantiation. The 416 entries in MOBILEGYM-BENCH are templates, not fixed instances. Each template is instantiated at runtime through three sources of variation: (i) in… view at source ↗

**Figure 5.** Figure 5: Sim-to-Real transfer of GRPO training gains. Per-bucket Success Rate on the 59-task signalbucket subset and the overall Signal Total. In the legend, Sim/Real denotes the evaluation environment and Base/Trained denotes before/after GRPO. Sim columns are 4-seed averages, Real columns are pass@1 and all manually audited (Appendix J). gle real device serially would be costly and manual state restoration, whi… view at source ↗

**Figure 6.** Figure 6: Sim-to-Real OOD generalization on Reddit_CreatePostToCommunity. The real-device r/China_irl community requires a flair tag before submission. Top row—trained model recovery (4 keyframes from a 22-step trajectory): step 13 clicks the grayed “Post”; step 15 attends to the asterisk on the Add tags & flair pill and infers that flair is a required field; step 16 picks the “Tech” flair; step 18 the “Post” button… view at source ↗

read the original abstract

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MobileGym gives a lightweight browser simulator with JSON state capture for verifiable mobile GUI tasks and cheap parallel rollouts, plus a 416-task benchmark; the 95.1% sim-to-real retention on 59 tasks is the claim that needs checking.

read the letter

MobileGym is a browser-hosted simulator that captures mobile app states as structured JSON for deterministic task judging and supports cheap parallel rollouts. The main contribution is this setup plus a benchmark of 416 task templates across 28 apps, with a case study showing RL gains that mostly carry over to real devices on a subset.

They do a few things right. The layered state model and declarative tasks make it practical to define many tasks without heavy coding. Parallel hosting at 400 MB per instance and 3-second cold starts is a clear win for scaling RL experiments. The AnswerSheet protocol for evaluation avoids some common matching issues. The GRPO experiment on Qwen3-VL gives a +12.8 point lift on the 256-task test set, which is a solid data point.

The soft spot is the sim-to-real result. Real-device execution retains 95.1% of the simulation gain on 59 tasks. That number is interesting, but it depends on the JSON judge matching actual outcomes. Since the environment skips proprietary backends and full emulation, any task hinging on unexposed server state or timing will diverge. The paper does not say how the 59-task subset was picked, so it is possible the selection favors cases where the model works. Without more detail on mismatches or a broader comparison, the retention figure is hard to generalize from.

This paper is for researchers in mobile GUI agents who want to run larger RL loops without burning through device fleets. The benchmark could serve as a starting point for standardized testing. The work shows clear thinking on the engineering constraints of agent training.

I would send it to peer review. The platform is new enough and the empirical claims are specific enough to warrant referee input, even if revisions are needed on the evaluation details.

Referee Report

1 major / 1 minor

Summary. The manuscript presents MobileGym, a browser-hosted lightweight simulation platform for mobile GUI agent research. It claims to enable verifiable task outcomes via a layered structured-JSON state model and deterministic programmatic judging, while supporting highly parallel RL rollouts (hundreds of instances per server at ~400 MB memory and ~3 s cold start). The accompanying MobileGym-Bench supplies 416 parameterized task templates (256 test, 160 train) across 28 apps with an AnswerSheet protocol. A Sim-to-Real case study reports that GRPO on Qwen3-VL-4B-Instruct yields a +12.8 percentage point gain on the 256-task test set, with real-device execution on a 59-task subset retaining 95.1% of the simulation-side training gain.

Significance. If the fidelity and scalability claims hold, MobileGym would provide a practical, low-cost alternative to full device emulation for everyday mobile apps, enabling reproducible online RL with dense rewards and deterministic evaluation. The structured state representation, declarative task framework, and parallel rollout efficiency are concrete engineering contributions that could accelerate GUI agent research. The benchmark size and reported resource numbers add practical value; the sim-to-real retention figure, if shown to be robust, would strengthen the case for simulation-based training.

major comments (1)

[Sim-to-Real case study] Sim-to-Real case study (abstract and associated results): the selection criteria and composition of the 59-task real-device signal subset are not specified. This detail is load-bearing for the 95.1% retention claim, because any systematic exclusion of tasks whose outcomes depend on unmodeled server responses, timing, or proprietary backends would make the retention percentage non-generalizable to the full 256-task set.

minor comments (1)

[Abstract] Abstract: the memory and cold-start figures (~400 MB, ~3 s) would be more informative with a brief statement of the measurement conditions or hardware baseline used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the sim-to-real case study. We address the concern point by point below.

read point-by-point responses

Referee: [Sim-to-Real case study] Sim-to-Real case study (abstract and associated results): the selection criteria and composition of the 59-task real-device signal subset are not specified. This detail is load-bearing for the 95.1% retention claim, because any systematic exclusion of tasks whose outcomes depend on unmodeled server responses, timing, or proprietary backends would make the retention percentage non-generalizable to the full 256-task set.

Authors: We agree that the selection criteria and composition of the 59-task subset must be specified for the retention claim to be interpretable. The current manuscript does not provide these details. In the revised version we will add an explicit subsection describing the subset construction process, including the sampling strategy across the 28 apps and 256 test tasks, the criteria applied to ensure feasibility on physical devices (e.g., exclusion of tasks requiring live external services or precise timing not captured in the JSON state model), and a breakdown of the subset by app category and task type. We will also report any observed differences between the subset and the full test set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical platform and benchmark results with no derivations or self-referential predictions

full rationale

The paper introduces MobileGym as a simulation platform using structured JSON state and deterministic judging, then reports direct empirical outcomes from GRPO training (+12.8 pp on 256-task test set) and sim-to-real retention (95.1% on 59-task subset). No equations, derivations, fitted parameters, or first-principles claims exist that could reduce to inputs by construction. The central results are measurements on the described environment and real devices rather than predictions derived from the platform's own definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing support. The skeptic concern about JSON fidelity matching real outcomes is a question of assumption validity and generalizability, not circularity per the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated; the platform itself constitutes the contribution with implicit assumptions about state fidelity.

pith-pipeline@v0.9.1-grok · 5801 in / 1195 out tokens · 24681 ms · 2026-06-29T21:24:29.921448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages 16022–16076

Appworld: A controllable world of apps and people for benchmarking interactive coding agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages 16022–16076. Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subra- monian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, and M...
[2]

browser refresh = device reboot

Openapps: Simulating environment varia- tions to measure ui-agent reliability . arXiv preprint arXiv:2511.20766. Venus-Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, and 8 othe...

work page arXiv 2026
[3]

The primary 8-model cali- 19 bration yields +21.3/+25.4/+11.1/+0.9 pt, while the 4-model calibration yields +23.0/+ 22.5/ + 7.3/ + 0.7 pt

Sim-to-Real lift concentrates on L1–L2 and diminishes sharply at L3–L4 under both calibrations. The primary 8-model cali- 19 bration yields +21.3/+25.4/+11.1/+0.9 pt, while the 4-model calibration yields +23.0/+ 22.5/ + 7.3/ + 0.7 pt. In both cases, most of the training lift lies in L1–L2 and nearly van- ishes on L4
[4]

declarative-statement

L4 isolates the frontier under both calibra- tions. Under the 8-model calibration, only Gemini 3.1 Pro stays meaningfully above the floor on L4 (21.9%), while every other model is ≤ 6.2%. Under the 4-model calibration, Gemini remains the only model above 10% on L4 (12.2%), while all other models are ≤ 8.1%. The trained 4B model also exceeds AutoGLM-Phone-...

2026

[1] [1]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages 16022–16076

Appworld: A controllable world of apps and people for benchmarking interactive coding agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages 16022–16076. Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subra- monian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, and M...

[2] [2]

browser refresh = device reboot

Openapps: Simulating environment varia- tions to measure ui-agent reliability . arXiv preprint arXiv:2511.20766. Venus-Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, and 8 othe...

work page arXiv 2026

[3] [3]

The primary 8-model cali- 19 bration yields +21.3/+25.4/+11.1/+0.9 pt, while the 4-model calibration yields +23.0/+ 22.5/ + 7.3/ + 0.7 pt

Sim-to-Real lift concentrates on L1–L2 and diminishes sharply at L3–L4 under both calibrations. The primary 8-model cali- 19 bration yields +21.3/+25.4/+11.1/+0.9 pt, while the 4-model calibration yields +23.0/+ 22.5/ + 7.3/ + 0.7 pt. In both cases, most of the training lift lies in L1–L2 and nearly van- ishes on L4

[4] [4]

declarative-statement

L4 isolates the frontier under both calibra- tions. Under the 8-model calibration, only Gemini 3.1 Pro stays meaningfully above the floor on L4 (21.9%), while every other model is ≤ 6.2%. Under the 4-model calibration, Gemini remains the only model above 10% on L4 (12.2%), while all other models are ≤ 8.1%. The trained 4B model also exceeds AutoGLM-Phone-...

2026