ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis
Pith reviewed 2026-06-30 10:54 UTC · model grok-4.3
The pith
ScaleWoB generates high-fidelity GUI environments as backend-free webpages with verifiable rewards for scalable agent evaluation across platforms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScaleWoB produces 100+ synthesized interactive environments and 1000+ verifiable tasks as backend-free webpages accessible via URL, including a public benchmark of 120 challenging tasks across 63 simulated mobile applications, on which state-of-the-art mobile GUI agents achieve an average success rate of only 27.92 percent (dropping to 17.82 percent on the long-horizon subset) while humans reach 92.08 percent, with the synthetic assessments generalizing to real apps.
What carries the argument
A synthesis pipeline that converts GUI specifications into backend-free interactive webpages equipped with verifiable reward functions and state reset capabilities.
If this is right
- GUI agent training and evaluation can proceed at large scale with near-zero setup cost and without dependence on device emulators or cloud instances.
- Reproducible, resetable tasks become available for long-horizon mobile, desktop, and in-vehicle scenarios using a single pipeline.
- New benchmarks can be generated and shared simply by publishing URLs rather than distributing virtual-machine images.
- The gap between current agent performance and human performance on long-horizon tasks can be quantified under controlled conditions.
Where Pith is reading between the lines
- The same synthesis method could support iterative training loops in which coding agents generate or refine environment specifications for GUI agents.
- The low-resource web format opens the possibility of running large-scale agent experiments on consumer hardware or in browser-based sandboxes.
- Similar synthesis pipelines might be applied to other interface domains such as web browsers or game UIs to create comparable verifiable benchmarks.
Load-bearing premise
The synthesized web pages replicate the visual layout, interaction dynamics, and reward outcomes of real GUI applications closely enough that agent success rates and rankings transfer to actual apps.
What would settle it
Measure the same set of agents on both the synthetic mobile environments and the corresponding real mobile applications and observe whether success rates and relative rankings remain consistent.
Figures
read the original abstract
GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ScaleWoB, a framework that leverages coding agents to synthesize high-fidelity, backend-free webpage environments for GUI agents across mobile, desktop, and automotive platforms. These environments provide verifiable rewards, require near-zero setup, and scale to 100+ environments and 1000+ tasks; the authors release a benchmark of 120 tasks across 63 simulated mobile apps. Experiments on five state-of-the-art mobile GUI agents report average success rates of 27.92% (17.82% on long-horizon tasks) versus 92.08% for humans, and a comparison on real-world sample tasks is presented to argue that synthetic assessments generalize to real apps.
Significance. If the fidelity and transfer claims hold, the work offers a practical, low-resource alternative to VMs or real-device testing for large-scale GUI agent evaluation and training. The release of a fully synthesized mobile benchmark and the empirical demonstration of substantial headroom in current agents are concrete contributions. The coding-agent synthesis pipeline is a notable strength for reproducibility and scalability.
major comments (1)
- [Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the generalization claim. We agree that quantitative support is needed to strengthen the assertion and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.
Authors: We acknowledge the validity of this observation. The current manuscript supports the generalization claim via a qualitative comparison on real-world sample tasks (detailed in the experiments section), which shows consistent agent behavior patterns. However, to make the claim more rigorous and address the lack of quantitative metrics, we will add action-equivalence rates, visual similarity scores, and statistical correlations between synthetic and real success rates in the revised version. These additions will be incorporated into the relevant experimental analysis and referenced in the abstract. revision: yes
Circularity Check
No circularity; purely empirical synthesis and evaluation framework
full rationale
The paper describes a pipeline for synthesizing backend-free webpage environments from coding agents, then reports measured success rates of GUI agents on 120 tasks and a separate real-app comparison. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The central assertions rest on experimental outcomes (agent success rates, human baselines, generalization checks) rather than any reduction of outputs to inputs by construction. This is the expected non-finding for an applied systems paper whose load-bearing content is the synthesis method and the measured transfer gap.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Web-based simulations can provide high-fidelity replicas of GUI interactions and reward structures of real apps.
Reference graph
Works this paper leans on
-
[1]
Autodroid: Llm-powered task automation in android, 2024
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android, 2024. URL https://arxiv.org/abs/2308.15272
-
[2]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Os-copilot: Towards generalist computer agents with self-improvement, 2024
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024. URL https://arxiv.org/abs/2402.07456
-
[5]
Aria-ui: Visual grounding for gui instructions, 2025
Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions, 2025. URLhttps://arxiv.org/abs/2412.16256
-
[6]
Android in the zoo: Chain-of-action-thought for gui agents, 2024
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024. URL https://arxiv.org/abs/2403. 02713
2024
-
[7]
Mobile-Agent-v3: Fundamental Agents for GUI Automation
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URLhttps://arxiv.org/abs/2508.15144
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URL https: //arxiv.org/abs/2408.07199
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Agent s: An open agentic framework that uses computers like a human, 2024
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human, 2024. URL https://arxiv.org/abs/2410. 08164
2024
-
[10]
Autoglm: Autonomous foundation agents for guis, 2024
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...
-
[11]
Step-gui technical report, 2025
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...
-
[12]
Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026. URL https://arxiv.org/abs/2602.16855
-
[13]
Mai-ui technical report: Real-world centric foundation gui agents, 2025
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. Mai-ui technical report: Real-world centric foundation gui agents, 2025. URLhttps://arxiv.org/abs/2512.22047. 10
-
[14]
Androidlab: Training and systematic benchmarking of android autonomous agents,
Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents,
- [15]
-
[16]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Crab: Cross-environment agent benchmark for multimodal language model agents, 2025
Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, and Guohao Li. Crab: Cross-environment agent benchmark for multimodal language model agents, 2025. URLhttps://arxiv.org/abs/2407.01511
-
[19]
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URLhttps://arxiv.org/abs/2409.08264
-
[21]
Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024
Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024. URLhttps://arxiv.org/abs/2406.08184
-
[22]
Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025. URL https://arxiv.org/abs/2512.19432
-
[24]
Weblinux: a scalable in-browser and client- side linux and ide
Rémi Sharrock, Lawrence Angrave, and Ella Hamonic. Weblinux: a scalable in-browser and client- side linux and ide. InProceedings of the Fifth Annual ACM Conference on Learning at Scale, L@S ’18, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450358866. doi: 10.1145/3231644.3231703. URLhttps://doi.org/10.1145/3231644.3231703
-
[25]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. URL https://arxiv.org/abs/2401.13919
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026
Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876
-
[28]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/ abs/2401.10935
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. URLhttps://arxiv.org/abs/2410.23218. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981
-
[31]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025
Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv.org/abs/2503. 01245
2025
-
[33]
Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo,...
-
[34]
Software development life cycle perspective: A survey of benchmarks for code large language models and agents,
Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi. Software development life cycle perspective: A survey of benchmarks for code large language models and agents,
- [35]
-
[36]
Challenges and paths towards ai for software engineering, 2025
Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Yijia Shao, Ziyang Li, Diyi Yang, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Challenges and paths towards ai for software engineering, 2025. URL https://arxiv.org/abs/2503.22625
-
[37]
ByteDance Seed 1.8
ByteDance. ByteDance Seed 1.8. https://seed.bytedance.com/en/seed1_8, 2026
2026
-
[38]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL https://arxiv. org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Large language models: A survey, 2025
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025. URL https://arxiv.org/abs/2402. 06196
2025
-
[40]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025. URLhttps://arxiv.org/abs/2303.18223
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Cogagent: A visual language model for gui agents, 2024
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024. URLhttps://arxiv.org/abs/2312.08914
-
[42]
Mind2Web: Towards a Generalist Agent for the Web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Autowebglm: A large language model-based web navigating agent, 2024
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648
-
[44]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. URLhttps://arxiv.org/abs/2307.12856
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024. URLhttps://arxiv.org/abs/2403.19128
-
[46]
UGround: Towards Unified Visual Grounding with Unrolled Transformers
Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers, 2026. URL https://arxiv.org/abs/ 2510.03853. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions, 2025. URL https://arxiv. org/abs/2501.16150
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Os agents: A survey on mllm- based agents for general computing devices use, 2025
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...
-
[49]
Android in the wild: A large-scale dataset for android device control, 2023
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023. URLhttps://arxiv.org/abs/2307.10088
-
[50]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URLhttps://arxiv.org/abs/2311.12983
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Mapping natural language instructions to mobile ui action sequences, 2020
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences, 2020. URLhttps://arxiv.org/abs/2005.03776
-
[52]
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. Mo- bile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments,
- [53]
-
[54]
Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022. URLhttps://arxiv.org/abs/2205.11029
-
[55]
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. URLhttps://arxiv.org/abs/2402.17553
-
[56]
On the effects of data scale on ui control agents, 2024
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents, 2024. URL https://arxiv.org/abs/ 2406.03679
-
[57]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration, 2018. URL https://arxiv.org/abs/1802.08802
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, February 2023
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. URLhttps://arxiv.org/abs/2207.01206
-
[60]
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. URLhttps://arxiv.org/abs/2401.13649
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024
Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024. URLhttps://arxiv.org/abs/2305.08144
-
[63]
A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026
Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026. URL https://arxiv.org/abs/2501.01149
-
[64]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499
2024
-
[65]
Home”, “Discovery
Google. A new era of intelligence with Gemini 3. https://blog.google/products-and- platforms/products/gemini/gemini-3/, November 2025. 13 A SimuWoB Environment Synthesizing Following the pipeline of Figure 2, we first had the model draft a detailed PRD document based on the given metadata, then asked it to write code based on the document. Here follows an...
2025
-
[66]
Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36)
Actions: Single-tap on the video area to evoke the control layer; use gravity sensor or tap the button to switch to [full-screen landscape mode]. Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36). Represents vitality and youthfulness. Used for the logo, s...
-
[67]
large images with minimal text
Clear Information Hierarchy: Through the card-style design featuring “large images with minimal text”, users can quickly capture the visual focus while scrolling rapidly
-
[68]
browsing for content
Contextual Design: Strictly distinguishes between the “browsing for content” scenario (bright, efficient) and the “watching content” scenario (dark, immersive), aligning with user mental models
-
[69]
Monetization Integration: The VIP membership design is not just a functional entry point but an independent visual system that effectively stimulates users’ desire to pay through color psychology
-
[70]
long-form video attracts → community discussion → short-form video kills time
Ecosystem Loop: Cleverly embeds short videos (Suike) and community (Discovery) into the bottom navigation, forming a content consumption loop of “long-form video attracts → community discussion → short-form video kills time”. After writing, it reviewed the existing codebase, proposed a series of items to be added or modified, updated the PRD document acco...
2025
-
[71]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.