arxiv: 2604.02345 · v1 · submitted 2026-02-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics

Mengzhou Wu , Yuzhe Guo , Yuan Cao , Haochuan Lu , Songhe Zhu , Pingzhe Qu , Xin Chen , Kang Qin

show 11 more authors

Zhongpu Wang Xiaode Zhang Xinyi Wang Wei Dai Gang Cao Yuetang Deng Zhi Gong Dezhi Ran Linyi Li Wei Yang Tao Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords syntheticagentsdynamicsui-oceanusdataenvironmentalforwardgenerative

0 comments

The pith

UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the high cost of collecting human examples for training AI agents that click, type, and navigate graphical interfaces. Instead of imitation or distillation from a stronger model, the approach lets agents explore interfaces on their own. The key learning signal is forward prediction: the agent tries to generate what the next screen state will be after an action, and the actual system outcome provides direct feedback. This builds an internal model of interface physics. The authors report that this forward-dynamics focus works better than trying to infer past actions. They pre-train models on large amounts of this synthetic data and measure gains on navigation tasks. Performance improves with more data and shows better results in live settings than in offline tests.

Core claim

models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume.

Load-bearing premise

That forward dynamics prediction identified via systematic investigation of self-supervised objectives is the primary scalable driver and that autonomous exploration verified by system execution yields unbiased high-density generative supervision for a robust internal world model.

Figures

Figures reproduced from arXiv: 2604.02345 by Dezhi Ran, Gang Cao, Haochuan Lu, Kang Qin, Linyi Li, Mengzhou Wu, Pingzhe Qu, Songhe Zhu, Tao Xie, Wei Dai, Wei Yang, Xiaode Zhang, Xin Chen, Xinyi Wang, Yuan Cao, Yuetang Deng, Yuzhe Guo, Zhi Gong, Zhongpu Wang.

**Figure 1.** Figure 1: Constructing Generalist GUI Agents via Scalable World Model Learning. (Top) We first establish a robust physical foundation by learning a forward dynamics world model from massive, autonomously explored transitions. (Bottom) We then leverage this internalized world model to instantiate a generalist GUI agent through agentic post-training. To address these limitations, we propose UIOceanus, a self-supervis… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed UI-Oceanus framework. UI-Oceanus consists of four sequential stages: (1) Scalable Acquisition, which autonomously explores diverse GUI applications to generate large-scale raw interaction trajectories; (2) Multi-Step Data Filtering Pipeline, which systematically filters and deduplicates raw interactions based on structural, visual, and semantic criteria; (3) Grounded Instruction Ge… view at source ↗

**Figure 3.** Figure 3: Scaling behavior of Qwen3-VL series models. from 0% to 100%. Evaluation Protocol. To support large-scale experiments, we evaluate performance on a held-out offline benchmark comprising 8K diverse mini-program tasks not seen during training. Following prior works Wu et al. (2024b), Zhang et al. (2024), Li et al. (2024), we report Exact Match (EM), which requires both the action type and its parameters to be… view at source ↗

**Figure 4.** Figure 4: Training Loss Comparison. Inverse Dynamics (orange) exhibits rapid saturation, indicating insufficient task difficulty. In contrast, Forward Dynamics (blue) maintains a higher loss level, providing the sustained gradient signal necessary for effective representation learning. E. Training Dynamics Analysis [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

read the original abstract

Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross-domain adaptability and compositional generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UI-Oceanus reports solid gains from continual pre-training on forward dynamics of synthetic GUI exploration, but the evidence does not yet isolate that objective from other self-supervised signals or extra data.

read the letter

The paper's main contribution is a concrete framework that turns low-cost autonomous GUI exploration into generative supervision for forward state prediction, then uses that to continually pre-train agents. They report an average 7% success-rate lift on offline benchmarks and 16.8% in real online navigation, plus a scaling trend with more synthetic data. That framing—treating interaction physics as the primary learning target rather than trajectory imitation—is a useful shift and worth testing in other agent settings where human demonstrations are expensive.

Referee Report

3 major / 1 minor

Summary. The paper introduces UI-Oceanus, a framework for scaling GUI agents that shifts focus from trajectory imitation to mastering interaction physics via ground-truth feedback. It claims that a systematic investigation of self-supervised objectives identifies forward dynamics (generative future-state prediction) as the dominant scalable driver over inverse inference; autonomous exploration verified by system execution is converted into high-density synthetic supervision for continual pre-training (CPT) of an internal world model. Models using CPT on this synthetic dynamics reportedly achieve 7% average success-rate gains on offline benchmarks and 16.8% gains in real-world online navigation, with performance scaling with synthetic data volume.

Significance. If the central claims hold after proper controls, the work would offer a concrete, data-scalable route to robust GUI agents that reduces dependence on human demonstrations and teacher distillation, with potential benefits for cross-domain generalization. The emphasis on verifiable system execution as an external ground truth and the reported scaling behavior are positive features that could influence future agent training pipelines.

major comments (3)

Abstract: the assertion that forward dynamics 'significantly outweighs inverse inference' rests on an unspecified 'systematic investigation of self-supervised objectives,' yet no description of the candidate objectives, matched data volumes, CPT schedules, or quantitative metrics used in that comparison is supplied, leaving the identification of forward dynamics as the primary driver unsupported.
Abstract: the 7% offline and 16.8% online success-rate improvements are reported only for CPT on synthetic dynamics; no ablation results compare forward dynamics against other self-supervised objectives (e.g., inverse inference or reconstruction) at identical data volume and CPT schedule, so the gains cannot be isolated from the effects of additional pre-training data or the CPT procedure itself.
Abstract: the experimental claims cite specific percentage gains and a scaling observation but provide no information on baselines, number of runs, statistical tests, data exclusion criteria, or evaluation protocols, rendering the numbers unverifiable and the soundness of the central empirical claim low.

minor comments (1)

Abstract: the phrase 'high-density generative supervision' is used without a quantitative definition or comparison to the density of human-demonstration data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each of the major comments point by point below, and we will make revisions to the manuscript to address the concerns raised.

read point-by-point responses

Referee: Abstract: the assertion that forward dynamics 'significantly outweighs inverse inference' rests on an unspecified 'systematic investigation of self-supervised objectives,' yet no description of the candidate objectives, matched data volumes, CPT schedules, or quantitative metrics used in that comparison is supplied, leaving the identification of forward dynamics as the primary driver unsupported.

Authors: We agree that the abstract, due to length constraints, does not detail the systematic investigation. The main body of the manuscript describes the self-supervised objectives considered and the comparison metrics in Section 3. To strengthen the presentation, we will revise the abstract to briefly outline the key objectives compared (forward dynamics, inverse inference, and reconstruction) and the primary metric (downstream success rate) used to identify forward dynamics as the dominant objective. revision: yes
Referee: Abstract: the 7% offline and 16.8% online success-rate improvements are reported only for CPT on synthetic dynamics; no ablation results compare forward dynamics against other self-supervised objectives (e.g., inverse inference or reconstruction) at identical data volume and CPT schedule, so the gains cannot be isolated from the effects of additional pre-training data or the CPT procedure itself.

Authors: We acknowledge this limitation in the current presentation. While the paper focuses on the forward dynamics approach, we will add ablation studies in the revised manuscript that compare forward dynamics prediction against inverse inference and reconstruction objectives using identical data volumes and CPT schedules to isolate the contribution of each objective. revision: yes
Referee: Abstract: the experimental claims cite specific percentage gains and a scaling observation but provide no information on baselines, number of runs, statistical tests, data exclusion criteria, or evaluation protocols, rendering the numbers unverifiable and the soundness of the central empirical claim low.

Authors: We agree that additional details on the experimental setup are needed for verifiability. In the revised version, we will expand the experimental section to include the baseline models used, the number of independent runs (e.g., 5 runs per condition), statistical significance tests (e.g., paired t-tests), data exclusion criteria, and detailed evaluation protocols for both offline and online settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on claimed external investigation and system-verified data

full rationale

The paper states that a systematic investigation of self-supervised objectives identified forward dynamics as the primary scalable driver, then applies CPT on synthetic data generated via autonomous exploration that is verified directly by system execution. No equations, fitted parameters, or self-citations are shown that reduce the performance claims (7% offline, 16.8% online gains) to the inputs by construction. The gains are presented as experimental outcomes on benchmarks, and the verification step supplies an external ground truth independent of the modeling choice. The derivation chain therefore remains self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that forward dynamics is the dominant scalable objective and that system-verified autonomous exploration produces reliable generative supervision without hidden biases.

axioms (1)

domain assumption Forward dynamics prediction acts as the primary driver for scalability and significantly outweighs inverse inference
Identified through systematic investigation of self-supervised objectives as stated in the abstract.

invented entities (1)

internal world model no independent evidence
purpose: To master interaction physics via generative prediction of future interface states
Constructed from high-density generative supervision derived from autonomous exploration

pith-pipeline@v0.9.0 · 5574 in / 1244 out tokens · 47263 ms · 2026-05-16T02:50:26.495744+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

forward dynamics ... generative prediction of future interface states ... Lfwd = −E log Pθ(st+1 | st, at)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance scales log-linearly with synthetic data volume ... no saturation up to 32B parameters and 3.2B tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Corpus complexity matters in pretraining language models

Ameeta Agrawal and Suresh Singh. Corpus complexity matters in pretraining language models. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 257--263, 2023

work page 2023
[2]

Claude opus 4.5, 2025 a

Anthropic . Claude opus 4.5, 2025 a . URL https://www.anthropic.com/claude/opus. Accessed: 2026-01-29

work page 2025
[3]

Claude sonnet 4.5, 2025 b

Anthropic . Claude sonnet 4.5, 2025 b . URL https://www.anthropic.com/news/claude-sonnet-4-5. Accessed: 2026-01-29

work page 2025
[4]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619--15629, 2023

work page 2023
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Seed1.8 model card: Towards generalized real-world agency, 2025

ByteDance-Seed. Seed1.8 model card: Towards generalized real-world agency, 2025. Accessed: 2026-01-29

work page 2025
[8]

Guicourse: From general vision language models to versatile GUI agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile GUI agents. arXiv preprint arXiv:2406.11317, 2024

work page arXiv 2024
[9]

Seeclick: Harnessing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935, 2024

work page arXiv 2024
[10]

Uishift: Enhancing vlm-based gui agents through self-supervised reinforcement learning

Longxi Gao, Li Zhang, and Mengwei Xu. Uishift: Enhancing vlm-based gui agents through self-supervised reinforcement learning. arXiv preprint arXiv:2505.12493, 2025

work page arXiv 2025
[11]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

work page 2020
[12]

Android monkey

Google . Android monkey. https://developer.android.com/studio/test/monkey, 2021. Accessed: 2026-01-03

work page 2021
[13]

Gemini 3 flash, 2025

Google . Gemini 3 flash, 2025. URL https://deepmind.google/models/gemini/flash/. Accessed: 2026-01-29

work page 2025
[14]

Knowledge distillation: A survey

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International journal of computer vision, 129 0 (6): 0 1789--1819, 2021

work page 2021
[15]

Practical GUI testing of Android applications via model abstraction and refinement

Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, Qirun Zhang, Jian Lu, and Zhendong Su. Practical GUI testing of Android applications via model abstraction and refinement. In ICSE, pages 269--280, 2019

work page 2019
[16]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2 0 (3), 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[18]

Mastering diverse control tasks through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, pages 1--7, 2025

work page 2025
[19]

Cogagent: A visual language model for GUI agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for GUI agents. In CVPR, pages 14281--14290, 2024

work page 2024
[20]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62 0 (1): 0 1--62, 2022

work page 2022
[21]

Mobileworldbench: Towards semantic world modeling for mobile agents

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Mobileworldbench: Towards semantic world modeling for mobile agents. arXiv preprint arXiv:2512.14014, 2025

work page arXiv 2025
[22]

On the effects of data scale on UI control agents

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on UI control agents. NeurIPS, 37: 0 92130--92154, 2024

work page 2024
[23]

Widget captioning: Generating natural language description for mobile user interface elements

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295, 2020

work page arXiv 2010
[24]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19498--19508, 2025 a

work page 2025
[25]

Gui-rewalk: Massive data generation for gui agent via stochastic exploration and intent-aware reasoning

Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, and Zhaojian Li. Gui-rewalk: Massive data generation for gui agent via stochastic exploration and intent-aware reasoning. arXiv preprint arXiv:2509.15738, 2025 b

work page arXiv 2025
[26]

GUI odyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. GUI odyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices. arXiv preprint arXiv:2406.08451, 2024

work page arXiv 2024
[27]

Vimo: A generative visual gui world model for app agents

Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025

work page arXiv 2025
[28]

When less is more: Investigating data pruning for pretraining llms at scale

Max Marion, Ahmet \"U st \"u n, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023

work page arXiv 2023
[29]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, 12 2025. URL https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. Accessed: 2026-01-29

work page 2025
[30]

Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents

Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6300--6323, 2025

work page 2025
[31]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Scaling synthetic task generation for agents via exploration

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration. arXiv preprint arXiv:2509.25047, 2025

work page arXiv 2025
[33]

Badge: prioritizing UI events with hierarchical multi-armed bandits for automated UI testing

Dezhi Ran, Hao Wang, Wenyu Wang, and Tao Xie. Badge: prioritizing UI events with hierarchical multi-armed bandits for automated UI testing. In ICSE, pages 894--905, 2023

work page 2023
[34]

Androidinthewild: A large-scale dataset for Android device control

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for Android device control. NeurIPS, 36: 0 59708--59728, 2023

work page 2023
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Guided, stochastic model-based gui testing of android apps

Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su. Guided, stochastic model-based gui testing of android apps. In Proceedings of the 2017 11th joint meeting on foundations of software engineering, pages 245--256, 2017

work page 2017
[37]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5555--5579, 2025

work page 2025
[38]

Magicgui: A foundational mobile gui agent with scalable data pipeline and reinforcement fine-tuning

Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Hailiang Pang, et al. Magicgui: A foundational mobile gui agent with scalable data pipeline and reinforcement fine-tuning. arXiv preprint arXiv:2508.03700, 2025

work page arXiv 2025
[39]

Weixin mini program platform capabilities, 2024

Tencent . Weixin mini program platform capabilities, 2024. URL https://developers.weixin.qq.com/miniprogram/dev/platform-capabilities/miniapp/intro/. Accessed: 2026-01-29

work page 2024
[40]

Behavioral Cloning from Observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Rethinking the role of text complexity in language model pretraining

Dan John Velasco and Matthew Theodore Roque. Rethinking the role of text complexity in language model pretraining. In Proceedings of the First BabyLM Workshop, pages 1--28, 2025

work page 2025
[42]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Vagen:reinforcing world model reasoning for multi-turn vlm agents, 2025

Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, and Manling Li. Vagen:reinforcing world model reasoning for multi-turn vlm agents, 2025. URL https://vagen-ai.github.io/

work page 2025
[44]

Gui agents with foundation models: A comprehensive survey

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024

work page arXiv 2024
[45]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[46]

Skill-adpative imitation learning for UI test reuse

Mengzhou Wu, Hao Wang, Jun Ren, Yuan Cao, Yuetong Li, Alex Jiang, Dezhi Ran, Yitao Hu, Wei Yang, and Tao Xie. Skill-adpative imitation learning for UI test reuse. arXiv preprint arXiv:2409.13311, 2024 a

work page arXiv 2024
[47]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Mobilerl: Online agentic reinforcement learning for mobile gui agents, 2025 a

Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents, 2025 a . URL https://arxiv.org/abs/2509.18119

work page arXiv 2025
[49]

Aguvis: Unified pure vision agents for autonomous GUI interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. In ICML, 2025 b

work page 2025
[50]

Mobile-agent-v3: Foundamental agents for gui automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Foundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

work page arXiv 2025
[51]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025 a

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025 a . URL https://arxiv.org/abs/2512.07783

work page arXiv 2025
[53]

Progrm: Build better gui agents with progress rewards

Danyang Zhang, Situo Zhang, Ziyue Yang, Zichen Zhu, Zihan Zhao, Ruisheng Cao, Lu Chen, and Kai Yu. Progrm: Build better gui agents with progress rewards. arXiv preprint arXiv:2505.18121, 2025 b

work page arXiv 2025
[54]

Android in the zoo: Chain-of-action-thought for GUI agents

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for GUI agents. arXiv preprint arXiv:2403.02713, 2024

work page arXiv 2024
[55]

Agent learning via early experience

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025 c

work page arXiv 2025
[56]

Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, and Shi-Min Hu. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms. arXiv preprint arXiv:2510.13795, 2025 d

work page arXiv 2025
[57]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[58]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[59]

GUI as a state transition graph

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv