ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Guohong Liu; Jialei Ye; Jian Luan; Pengzhi Gao; Wei Liu; Yuanchun Li; Yunxin Liu

arxiv: 2605.25160 · v2 · pith:TCICKQUYnew · submitted 2026-05-24 · 💻 cs.AI

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Guohong Liu , Jialei Ye , Pengzhi Gao , Wei Liu , Jian Luan , Yunxin Liu , Yuanchun Li This is my paper

Pith reviewed 2026-06-30 10:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentssynthetic environmentsverifiable rewardsmobile GUI benchmarklarge-scale synthesisweb-based interfacesagent evaluationcross-platform GUI

0 comments

The pith

ScaleWoB generates high-fidelity GUI environments as backend-free webpages with verifiable rewards for scalable agent evaluation across platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a synthesis framework that converts GUI application designs into interactive web pages accessible by URL. These pages include built-in reward signals and support reset and state control without backend servers or virtual machines. The approach yields over 100 environments and 1000 tasks spanning mobile, desktop, and automotive interfaces. A released subset forms a benchmark of 120 tasks on 63 mobile apps where five current agents average 27.92 percent success, falling to 17.82 percent on long-horizon items, while humans reach 92.08 percent. Performance measured in the synthetic settings correlates with behavior on real applications.

Core claim

ScaleWoB produces 100+ synthesized interactive environments and 1000+ verifiable tasks as backend-free webpages accessible via URL, including a public benchmark of 120 challenging tasks across 63 simulated mobile applications, on which state-of-the-art mobile GUI agents achieve an average success rate of only 27.92 percent (dropping to 17.82 percent on the long-horizon subset) while humans reach 92.08 percent, with the synthetic assessments generalizing to real apps.

What carries the argument

A synthesis pipeline that converts GUI specifications into backend-free interactive webpages equipped with verifiable reward functions and state reset capabilities.

If this is right

GUI agent training and evaluation can proceed at large scale with near-zero setup cost and without dependence on device emulators or cloud instances.
Reproducible, resetable tasks become available for long-horizon mobile, desktop, and in-vehicle scenarios using a single pipeline.
New benchmarks can be generated and shared simply by publishing URLs rather than distributing virtual-machine images.
The gap between current agent performance and human performance on long-horizon tasks can be quantified under controlled conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis method could support iterative training loops in which coding agents generate or refine environment specifications for GUI agents.
The low-resource web format opens the possibility of running large-scale agent experiments on consumer hardware or in browser-based sandboxes.
Similar synthesis pipelines might be applied to other interface domains such as web browsers or game UIs to create comparable verifiable benchmarks.

Load-bearing premise

The synthesized web pages replicate the visual layout, interaction dynamics, and reward outcomes of real GUI applications closely enough that agent success rates and rankings transfer to actual apps.

What would settle it

Measure the same set of agents on both the synthetic mobile environments and the corresponding real mobile applications and observe whether success rates and relative rankings remain consistent.

Figures

Figures reproduced from arXiv: 2605.25160 by Guohong Liu, Jialei Ye, Jian Luan, Pengzhi Gao, Wei Liu, Yuanchun Li, Yunxin Liu.

**Figure 2.** Figure 2: Two-stage environment synthesis pipeline of SimuWoB. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Automatic issue inspection and correction workflow. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Example task in SimuWoB. The agent is asked to top up the wallet by 100 euros in a [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Experimental results of different agents on SimuWoB. For local models, we evaluate only [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Success rate of different task categories across evaluated agents in SimuWoB. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Case study of a long-horizon failure: the agent executes UI operations correctly but does not persist key information in context, leading to an incorrect final answer. Mobile GUI agents fall short in long-horizon tasks. Results in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Case study of a vague-description failure: the agent fails to locate the task entry point due to a lack of proactive exploration capabilities following initial failures. Tasks with vague descriptions or inconspicuous functional entry points can confuse the agent. Our analysis shows that agent performance degrades when instructions are underspecified and the true entry point is visually inconspicuous (e.g.,… view at source ↗

**Figure 9.** Figure 9: Fine-grained control results. Agents perform poorly on tasks that require finegrained control. Fine-grained control is a common requirement in real-world mobile tasks, including dragging a slider to a target position, setting date/time values with pickers, confirming payment via drag gestures, and invoking context menus through long presses. Compared with standard click-and-type tasks, these operations … view at source ↗

read the original abstract

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScaleWoB gives a practical synthesis pipeline for turning GUI apps into low-overhead web environments with verifiable rewards and releases a mobile benchmark, but the transfer claim to real apps rests on a comparison whose strength is not clear from the abstract.

read the letter

The main takeaway is that this work supplies a pipeline for generating backend-free webpage versions of GUI apps across mobile, desktop, and automotive platforms, complete with automatic rewards, at a scale of 100+ environments and 1000+ tasks. They also release a benchmark of 120 tasks from 63 simulated mobile apps.

The engineering is the useful part. The environments run via simple URLs with near-zero setup and low resources, which directly tackles the reproducibility and reset problems that come with real apps or VMs. The reported numbers show current agents averaging 27.92% success (17.82% on long-horizon tasks) against a 92.08% human baseline, and the multi-platform coverage is broader than most prior setups limited to open-source apps.

The soft spot is the generalization result. The abstract states that assessments on the synthetic tasks transfer to real apps based on a comparison, but without quantitative fidelity metrics (action equivalence, state-transition match rates, or statistical correlation of outcomes), it is hard to know how much the synthetic environments actually preserve the original dynamics. If the paper only shows a small uncontrolled sample rather than controlled equivalence data, that part of the argument stays thin.

This is aimed at researchers who need scalable environments for GUI agent training or evaluation and are tired of setup friction. The released benchmark and pipeline give it concrete value even if the transfer evidence needs tightening. It deserves a serious referee because the scale, the artifact, and the headroom numbers are substantive enough to check in detail.

Referee Report

1 major / 0 minor

Summary. The paper introduces ScaleWoB, a framework that leverages coding agents to synthesize high-fidelity, backend-free webpage environments for GUI agents across mobile, desktop, and automotive platforms. These environments provide verifiable rewards, require near-zero setup, and scale to 100+ environments and 1000+ tasks; the authors release a benchmark of 120 tasks across 63 simulated mobile apps. Experiments on five state-of-the-art mobile GUI agents report average success rates of 27.92% (17.82% on long-horizon tasks) versus 92.08% for humans, and a comparison on real-world sample tasks is presented to argue that synthetic assessments generalize to real apps.

Significance. If the fidelity and transfer claims hold, the work offers a practical, low-resource alternative to VMs or real-device testing for large-scale GUI agent evaluation and training. The release of a fully synthesized mobile benchmark and the empirical demonstration of substantial headroom in current agents are concrete contributions. The coding-agent synthesis pipeline is a notable strength for reproducibility and scalability.

major comments (1)

[Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the generalization claim. We agree that quantitative support is needed to strengthen the assertion and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.

Authors: We acknowledge the validity of this observation. The current manuscript supports the generalization claim via a qualitative comparison on real-world sample tasks (detailed in the experiments section), which shows consistent agent behavior patterns. However, to make the claim more rigorous and address the lack of quantitative metrics, we will add action-equivalence rates, visual similarity scores, and statistical correlations between synthetic and real success rates in the revised version. These additions will be incorporated into the relevant experimental analysis and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical synthesis and evaluation framework

full rationale

The paper describes a pipeline for synthesizing backend-free webpage environments from coding agents, then reports measured success rates of GUI agents on 120 tasks and a separate real-app comparison. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The central assertions rest on experimental outcomes (agent success rates, human baselines, generalization checks) rather than any reduction of outputs to inputs by construction. This is the expected non-finding for an applied systems paper whose load-bearing content is the synthesis method and the measured transfer gap.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that web-based simulations can faithfully replicate GUI interactions and reward structures; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Web-based simulations can provide high-fidelity replicas of GUI interactions and reward structures of real apps.
This underpins the claims of high-fidelity synthesis, verifiable rewards, and generalization to real apps.

pith-pipeline@v0.9.1-grok · 5839 in / 1525 out tokens · 68886 ms · 2026-06-30T10:54:37.023539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 53 canonical work pages · 22 internal anchors

[1]

Autodroid: Llm-powered task automation in android, 2024

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android, 2024. URL https://arxiv.org/abs/2308.15272

work page arXiv 2024
[2]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Os-copilot: Towards generalist computer agents with self-improvement, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024. URL https://arxiv.org/abs/2402.07456

work page arXiv 2024
[5]

Aria-ui: Visual grounding for gui instructions, 2025

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions, 2025. URLhttps://arxiv.org/abs/2412.16256

work page arXiv 2025
[6]

Android in the zoo: Chain-of-action-thought for gui agents, 2024

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024. URL https://arxiv.org/abs/2403. 02713

2024
[7]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URLhttps://arxiv.org/abs/2508.15144

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URL https: //arxiv.org/abs/2408.07199

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Agent s: An open agentic framework that uses computers like a human, 2024

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human, 2024. URL https://arxiv.org/abs/2410. 08164

2024
[10]

Autoglm: Autonomous foundation agents for guis, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page arXiv 2024
[11]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

work page arXiv 2025
[12]

Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026. URL https://arxiv.org/abs/2602.16855

work page arXiv 2026
[13]

Mai-ui technical report: Real-world centric foundation gui agents, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. Mai-ui technical report: Real-world centric foundation gui agents, 2025. URLhttps://arxiv.org/abs/2512.22047. 10

work page arXiv 2025
[14]

Androidlab: Training and systematic benchmarking of android autonomous agents,

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents,
[15]

URLhttps://arxiv.org/abs/2410.24024

work page arXiv
[16]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Crab: Cross-environment agent benchmark for multimodal language model agents, 2025

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, and Guohao Li. Crab: Cross-environment agent benchmark for multimodal language model agents, 2025. URLhttps://arxiv.org/abs/2407.01511

work page arXiv 2025
[19]

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URLhttps://arxiv.org/abs/2409.08264

work page arXiv 2024
[21]

Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024

Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024. URLhttps://arxiv.org/abs/2406.08184

work page arXiv 2024
[22]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025. URL https://arxiv.org/abs/2512.19432

work page arXiv 2025
[24]

Weblinux: a scalable in-browser and client- side linux and ide

Rémi Sharrock, Lawrence Angrave, and Ella Hamonic. Weblinux: a scalable in-browser and client- side linux and ide. InProceedings of the Fifth Annual ACM Conference on Learning at Scale, L@S ’18, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450358866. doi: 10.1145/3231644.3231703. URLhttps://doi.org/10.1145/3231644.3231703

work page doi:10.1145/3231644.3231703 2018
[25]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. URL https://arxiv.org/abs/2401.13919

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

work page arXiv 2026
[28]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/ abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. URLhttps://arxiv.org/abs/2410.23218. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981

work page arXiv 2025
[31]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025

Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv.org/abs/2503. 01245

2025
[33]

From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence, 2025

Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo,...

work page arXiv 2025
[34]

Software development life cycle perspective: A survey of benchmarks for code large language models and agents,

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi. Software development life cycle perspective: A survey of benchmarks for code large language models and agents,
[35]

URLhttps://arxiv.org/abs/2505.05283

work page arXiv
[36]

Challenges and paths towards ai for software engineering, 2025

Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Yijia Shao, Ziyang Li, Diyi Yang, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Challenges and paths towards ai for software engineering, 2025. URL https://arxiv.org/abs/2503.22625

work page arXiv 2025
[37]

ByteDance Seed 1.8

ByteDance. ByteDance Seed 1.8. https://seed.bytedance.com/en/seed1_8, 2026

2026
[38]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL https://arxiv. org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Large language models: A survey, 2025

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025. URL https://arxiv.org/abs/2402. 06196

2025
[40]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025. URLhttps://arxiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024. URLhttps://arxiv.org/abs/2312.08914

work page arXiv 2024
[42]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Autowebglm: A large language model-based web navigating agent, 2024

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648

work page arXiv 2024
[44]

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. URLhttps://arxiv.org/abs/2307.12856

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024. URLhttps://arxiv.org/abs/2403.19128

work page arXiv 2024
[46]

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers, 2026. URL https://arxiv.org/abs/ 2510.03853. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions, 2025. URL https://arxiv. org/abs/2501.16150

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Os agents: A survey on mllm- based agents for general computing devices use, 2025

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page arXiv 2025
[49]

Android in the wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023. URLhttps://arxiv.org/abs/2307.10088

work page arXiv 2023
[50]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URLhttps://arxiv.org/abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Mapping natural language instructions to mobile ui action sequences, 2020

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences, 2020. URLhttps://arxiv.org/abs/2005.03776

work page arXiv 2020
[52]

Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. Mo- bile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments,
[53]

URLhttps://arxiv.org/abs/2104.08560

work page arXiv
[54]

Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022

Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022. URLhttps://arxiv.org/abs/2205.11029

work page arXiv 2022
[55]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. URLhttps://arxiv.org/abs/2402.17553

work page arXiv 2024
[56]

On the effects of data scale on ui control agents, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents, 2024. URL https://arxiv.org/abs/ 2406.03679

work page arXiv 2024
[57]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration, 2018. URL https://arxiv.org/abs/1802.08802

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, February 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. URLhttps://arxiv.org/abs/2207.01206

work page arXiv 2023
[60]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. URLhttps://arxiv.org/abs/2401.13649

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024. URLhttps://arxiv.org/abs/2305.08144

work page arXiv 2024
[63]

A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026

Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026. URL https://arxiv.org/abs/2501.01149

work page arXiv 2026
[64]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499

2024
[65]

Home”, “Discovery

Google. A new era of intelligence with Gemini 3. https://blog.google/products-and- platforms/products/gemini/gemini-3/, November 2025. 13 A SimuWoB Environment Synthesizing Following the pipeline of Figure 2, we first had the model draft a detailed PRD document based on the given metadata, then asked it to write code based on the document. Here follows an...

2025
[66]

Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36)

Actions: Single-tap on the video area to evoke the control layer; use gravity sensor or tap the button to switch to [full-screen landscape mode]. Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36). Represents vitality and youthfulness. Used for the logo, s...
[67]

large images with minimal text

Clear Information Hierarchy: Through the card-style design featuring “large images with minimal text”, users can quickly capture the visual focus while scrolling rapidly
[68]

browsing for content

Contextual Design: Strictly distinguishes between the “browsing for content” scenario (bright, efficient) and the “watching content” scenario (dark, immersive), aligning with user mental models
[69]

Monetization Integration: The VIP membership design is not just a functional entry point but an independent visual system that effectively stimulates users’ desire to pay through color psychology
[70]

long-form video attracts → community discussion → short-form video kills time

Ecosystem Loop: Cleverly embeds short videos (Suike) and community (Discovery) into the bottom navigation, forming a content consumption loop of “long-form video attracts → community discussion → short-form video kills time”. After writing, it reviewed the existing codebase, proposed a series of items to be added or modified, updated the PRD document acco...

2025
[71]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

[1] [1]

Autodroid: Llm-powered task automation in android, 2024

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android, 2024. URL https://arxiv.org/abs/2308.15272

work page arXiv 2024

[2] [2]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Os-copilot: Towards generalist computer agents with self-improvement, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024. URL https://arxiv.org/abs/2402.07456

work page arXiv 2024

[5] [5]

Aria-ui: Visual grounding for gui instructions, 2025

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions, 2025. URLhttps://arxiv.org/abs/2412.16256

work page arXiv 2025

[6] [6]

Android in the zoo: Chain-of-action-thought for gui agents, 2024

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024. URL https://arxiv.org/abs/2403. 02713

2024

[7] [7]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URLhttps://arxiv.org/abs/2508.15144

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URL https: //arxiv.org/abs/2408.07199

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Agent s: An open agentic framework that uses computers like a human, 2024

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human, 2024. URL https://arxiv.org/abs/2410. 08164

2024

[10] [10]

Autoglm: Autonomous foundation agents for guis, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page arXiv 2024

[11] [11]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

work page arXiv 2025

[12] [12]

Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026. URL https://arxiv.org/abs/2602.16855

work page arXiv 2026

[13] [13]

Mai-ui technical report: Real-world centric foundation gui agents, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. Mai-ui technical report: Real-world centric foundation gui agents, 2025. URLhttps://arxiv.org/abs/2512.22047. 10

work page arXiv 2025

[14] [14]

Androidlab: Training and systematic benchmarking of android autonomous agents,

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents,

[15] [15]

URLhttps://arxiv.org/abs/2410.24024

work page arXiv

[16] [16]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Crab: Cross-environment agent benchmark for multimodal language model agents, 2025

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, and Guohao Li. Crab: Cross-environment agent benchmark for multimodal language model agents, 2025. URLhttps://arxiv.org/abs/2407.01511

work page arXiv 2025

[19] [19]

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URLhttps://arxiv.org/abs/2409.08264

work page arXiv 2024

[21] [21]

Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024

Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024. URLhttps://arxiv.org/abs/2406.08184

work page arXiv 2024

[22] [22]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025. URL https://arxiv.org/abs/2512.19432

work page arXiv 2025

[23] [24]

Weblinux: a scalable in-browser and client- side linux and ide

Rémi Sharrock, Lawrence Angrave, and Ella Hamonic. Weblinux: a scalable in-browser and client- side linux and ide. InProceedings of the Fifth Annual ACM Conference on Learning at Scale, L@S ’18, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450358866. doi: 10.1145/3231644.3231703. URLhttps://doi.org/10.1145/3231644.3231703

work page doi:10.1145/3231644.3231703 2018

[24] [25]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. URL https://arxiv.org/abs/2401.13919

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [27]

Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

work page arXiv 2026

[26] [28]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/ abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [29]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. URLhttps://arxiv.org/abs/2410.23218. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [30]

Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981

work page arXiv 2025

[29] [31]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [32]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025

Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv.org/abs/2503. 01245

2025

[31] [33]

From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence, 2025

Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo,...

work page arXiv 2025

[32] [34]

Software development life cycle perspective: A survey of benchmarks for code large language models and agents,

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi. Software development life cycle perspective: A survey of benchmarks for code large language models and agents,

[33] [35]

URLhttps://arxiv.org/abs/2505.05283

work page arXiv

[34] [36]

Challenges and paths towards ai for software engineering, 2025

Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Yijia Shao, Ziyang Li, Diyi Yang, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Challenges and paths towards ai for software engineering, 2025. URL https://arxiv.org/abs/2503.22625

work page arXiv 2025

[35] [37]

ByteDance Seed 1.8

ByteDance. ByteDance Seed 1.8. https://seed.bytedance.com/en/seed1_8, 2026

2026

[36] [38]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL https://arxiv. org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

Large language models: A survey, 2025

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025. URL https://arxiv.org/abs/2402. 06196

2025

[38] [40]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025. URLhttps://arxiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [41]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024. URLhttps://arxiv.org/abs/2312.08914

work page arXiv 2024

[40] [42]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [43]

Autowebglm: A large language model-based web navigating agent, 2024

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648

work page arXiv 2024

[42] [44]

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. URLhttps://arxiv.org/abs/2307.12856

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [45]

Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024. URLhttps://arxiv.org/abs/2403.19128

work page arXiv 2024

[44] [46]

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers, 2026. URL https://arxiv.org/abs/ 2510.03853. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [47]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions, 2025. URL https://arxiv. org/abs/2501.16150

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [48]

Os agents: A survey on mllm- based agents for general computing devices use, 2025

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page arXiv 2025

[47] [49]

Android in the wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023. URLhttps://arxiv.org/abs/2307.10088

work page arXiv 2023

[48] [50]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URLhttps://arxiv.org/abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [51]

Mapping natural language instructions to mobile ui action sequences, 2020

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences, 2020. URLhttps://arxiv.org/abs/2005.03776

work page arXiv 2020

[50] [52]

Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. Mo- bile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments,

[51] [53]

URLhttps://arxiv.org/abs/2104.08560

work page arXiv

[52] [54]

Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022

Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022. URLhttps://arxiv.org/abs/2205.11029

work page arXiv 2022

[53] [55]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. URLhttps://arxiv.org/abs/2402.17553

work page arXiv 2024

[54] [56]

On the effects of data scale on ui control agents, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents, 2024. URL https://arxiv.org/abs/ 2406.03679

work page arXiv 2024

[55] [57]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [58]

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration, 2018. URL https://arxiv.org/abs/1802.08802

work page internal anchor Pith review Pith/arXiv arXiv 2018

[57] [59]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, February 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. URLhttps://arxiv.org/abs/2207.01206

work page arXiv 2023

[58] [60]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. URLhttps://arxiv.org/abs/2401.13649

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [61]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [62]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024. URLhttps://arxiv.org/abs/2305.08144

work page arXiv 2024

[61] [63]

A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026

Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026. URL https://arxiv.org/abs/2501.01149

work page arXiv 2026

[62] [64]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499

2024

[63] [65]

Home”, “Discovery

Google. A new era of intelligence with Gemini 3. https://blog.google/products-and- platforms/products/gemini/gemini-3/, November 2025. 13 A SimuWoB Environment Synthesizing Following the pipeline of Figure 2, we first had the model draft a detailed PRD document based on the given metadata, then asked it to write code based on the document. Here follows an...

2025

[64] [66]

Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36)

Actions: Single-tap on the video area to evoke the control layer; use gravity sensor or tap the button to switch to [full-screen landscape mode]. Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36). Represents vitality and youthfulness. Used for the logo, s...

[65] [67]

large images with minimal text

Clear Information Hierarchy: Through the card-style design featuring “large images with minimal text”, users can quickly capture the visual focus while scrolling rapidly

[66] [68]

browsing for content

Contextual Design: Strictly distinguishes between the “browsing for content” scenario (bright, efficient) and the “watching content” scenario (dark, immersive), aligning with user mental models

[67] [69]

Monetization Integration: The VIP membership design is not just a functional entry point but an independent visual system that effectively stimulates users’ desire to pay through color psychology

[68] [70]

long-form video attracts → community discussion → short-form video kills time

Ecosystem Loop: Cleverly embeds short videos (Suike) and community (Discovery) into the bottom navigation, forming a content consumption loop of “long-form video attracts → community discussion → short-form video kills time”. After writing, it reviewed the existing codebase, proposed a series of items to be added or modified, updated the PRD document acco...

2025

[69] [71]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...