arxiv: 2605.06761 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CV· cs.LG

Recognition: no theorem link

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

O\u{g}uzhan Fatih Kar , Roman Bachmann , Yuanzheng Gong , Anders Boesen Lindbo Larsen , Afshin Dehghan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords visual web agentsweb navigationreinforcement learningenvironment synthesisHTTP cachingscalable trainingreproducible environmentsLLM-based synthesis

0 comments

The pith

Weblica uses HTTP caching and LLM synthesis to create thousands of stable web environments for training visual agents that navigate better than similar-sized baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Weblica as a way to overcome the scarcity of diverse, scalable training data for visual web navigation agents by turning real websites into reproducible environments. It does this through two techniques: capturing and replaying HTTP interactions to keep visual states stable while preserving clicks and dynamics, and using LLMs to generate new environments grounded in actual sites and basic navigation skills. Scaling reinforcement learning across these thousands of environments produces Weblica-8B, which beats open-weight models of comparable size on standard benchmarks, requires fewer inference steps, improves when given more test-time compute, and approaches the performance of closed API models. A reader would care because current web agent training is bottlenecked by live-site variability and limited simulation coverage, and solving that could make robust open agents practical.

Core claim

Weblica (Web Replica) constructs reproducible and scalable web environments by combining HTTP-level caching, which replays stable visual states while maintaining interactive behavior, with LLM-based environment synthesis anchored in real-world websites and core navigation skills. This framework enables scaling RL training to thousands of diverse environments and tasks, yielding Weblica-8B that outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and remains competitive with API models.

What carries the argument

The Weblica framework, which relies on HTTP-level caching to capture and replay interactive visual states plus LLM-driven synthesis of new environments from real sites and navigation primitives.

If this is right

Reinforcement learning for web agents can be scaled to far larger numbers of tasks without continuous live web access during training.
Training runs become more reproducible because cached environments remove the variability caused by changing live sites.
Allocating extra compute at test time yields further gains, indicating the model can be improved post-training without retraining.
Open-weight models can reach performance levels close to closed API systems when environment diversity is increased through synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caching-plus-synthesis approach might transfer to training agents in other interactive digital domains such as mobile apps or desktop software.
If the synthesized environments prove sufficiently representative, human-designed evaluation suites could be partially replaced by automatically generated ones.
Extending the method to longer-horizon or multi-step tasks could expose where caching fails to preserve complex state dependencies.

Load-bearing premise

Cached HTTP states will continue to behave like live websites without artifacts or missing interactions, and the LLM-generated environments will represent enough real-world web diversity and dynamics to avoid biasing what the agent learns.

What would settle it

Running the trained agents on live uncached versions of the benchmark websites and checking whether success rates or step counts degrade substantially relative to the cached training environments.

read the original abstract

The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weblica scales reproducible web environments via HTTP caching plus LLM synthesis, but the fidelity of those environments to real dynamics is not directly validated.

read the letter

Colleague, the main point is that this paper introduces Weblica as a way to build thousands of training environments for visual web agents by combining HTTP-level caching for replay with LLM synthesis of tasks from real sites. That specific combination lets them move RL training past the usual limits of offline trajectories or a few hand-built simulations, and they claim their resulting 8B model beats similar open-weight baselines on web navigation tasks while using fewer steps and scaling with extra test-time compute. If the environments are faithful, this addresses a real practical bottleneck in the area. The framework itself looks straightforward to implement and emphasizes reproducibility, which is a clear strength given how fast web pages change. They ground the synthesis in actual websites and core navigation skills, which gives it more grounding than pure synthetic generation. The soft spots are around validation. The abstract and claims rest on the idea that cached HTTP responses preserve interactive behavior without artifacts and that the synthesized tasks reflect real web diversity, but there are no side-by-side traces, success-rate comparisons between live and cached versions, or statistics on action or state diversity to check this. Performance numbers are stated without the benchmark details, baselines, or error bars that would let a reader judge the effect sizes. These gaps make the results harder to interpret right now. This is aimed at people building and training web agents who need larger-scale RL data. A reader in that niche would pick up usable ideas on environment construction even if they have to add their own checks. It is solid enough on the problem and method to deserve peer review rather than a desk reject, though reviewers will likely ask for more direct evidence on the caching and synthesis premises.

Referee Report

3 major / 1 minor

Summary. The paper introduces Weblica, a framework for scalable, reproducible training of visual web agents. It combines HTTP-level caching to replay stable visual states while preserving interactivity with LLM-based synthesis of environments grounded in real websites and core navigation skills. The authors scale RL training across thousands of such environments and report that their Weblica-8B model outperforms open-weight baselines of similar size on multiple web navigation benchmarks, uses fewer inference steps, exhibits favorable scaling with test-time compute, and is competitive with API-based models.

Significance. If the core premises hold, the work offers a practical route to large-scale RL for web agents by addressing data scarcity and non-reproducibility in dynamic web settings. The explicit focus on reproducibility via caching and the scaling to thousands of environments are concrete strengths that could enable follow-on research; the test-time scaling observation is also potentially useful if quantified.

major comments (3)

[Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).
[Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.
[Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.

minor comments (1)

[Abstract] Abstract: Key quantitative results (e.g., exact benchmark scores, step reductions, or scaling curves) are omitted, reducing the reader's ability to gauge the magnitude of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify how to strengthen the presentation of our framework and results. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).

Authors: We agree that explicit quantitative validation would make the caching claim more robust. The design replays exact HTTP responses to maintain visual and interactive fidelity, but we will add a new subsection in Section 3 with side-by-side comparisons: agent success rates on live vs. cached versions of the same sites, interaction trace similarity metrics, and state-drift statistics over multiple episodes. These results will be reported with the revised manuscript. revision: yes
Referee: [Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.

Authors: We acknowledge the need for fuller experimental reporting. The current manuscript lists the open-weight baselines and reports aggregate metrics, but we will expand Section 4 to explicitly name all baselines, report the number of evaluation runs (with seed details), include error bars or standard deviations, and add statistical significance tests (e.g., paired t-tests) where appropriate. This will directly support the robustness of the performance and efficiency claims. revision: yes
Referee: [Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.

Authors: We recognize that quantitative support for representativeness would strengthen the synthesis claims. While the synthesis is grounded in real websites and core navigation skills, we will add diversity metrics in Section 3.2, including action-distribution histograms, entropy measures over DOM state changes, and direct comparisons against publicly available real-web task corpora. These additions will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper introduces a framework for scalable web environments via HTTP caching and LLM synthesis, scales RL training, and reports empirical outperformance of Weblica-8B on multiple web navigation benchmarks. No equations, fitted parameters, or predictions are defined in terms of themselves. Central claims are validated against independent external test sets rather than reducing to training inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing for the results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on two domain assumptions about caching fidelity and LLM synthesis quality; no free parameters or new physical entities are introduced in the abstract.

axioms (2)

domain assumption HTTP-level caching can capture and replay stable visual states while preserving interactive behavior
Core premise enabling reproducible environments; stated in the framework description.
domain assumption LLM-based synthesis can generate environments grounded in real-world websites and core navigation skills
Used to scale from limited real sites to thousands of diverse tasks.

pith-pipeline@v0.9.0 · 5468 in / 1331 out tokens · 45872 ms · 2026-05-11T01:22:34.753401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 15 internal anchors

[1]

Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, et al. Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

work page arXiv 2025
[2]

Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024

Anthropic. Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024. Accessed: 2026- 04-30

work page 2024
[3]

Claude code.https://www.anthropic.com/claude-code, 2025

Anthropic. Claude code.https://www.anthropic.com/claude-code, 2025. Accessed: 2026-04-30

work page 2025
[4]

Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025

Anthropic. Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025. Accessed: 2026-04-30

work page 2025
[5]

arXiv preprint arXiv:2511.19663 , year=

Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

work page arXiv 2025
[6]

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[8]

Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use

Google. Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use. Accessed: 2026-04-30

work page 2026
[9]

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

work page arXiv 2023
[11]

Webvoyager: Building an end-to-end web agent with large multimodal models,

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[12]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024
[13]

Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

work page 2025
[14]

Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

work page 2025
[15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Dhillon, David Brandfonbrener, and Rishabh Agarwal

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025

work page arXiv 2025
[18]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024. 11

work page 2024
[19]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

work page 2025
[20]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review arXiv 2025
[21]

Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

work page arXiv 2025
[22]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[23]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use

OpenAI. Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use. Accessed: 2026-04-30

work page 2026
[25]

Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025

OpenAI. Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2026-04-30

work page 2025
[26]

Codex.https://chatgpt.com/codex/, 2025

OpenAI. Codex.https://chatgpt.com/codex/, 2025. Accessed: 2026-04-30

work page 2025
[27]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv:2411.02337, 2024

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024
[28]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page Pith review arXiv 2025
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

work page 2017
[31]

Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025

The Yutori Team. Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025. Yutori Blog

work page 2025
[32]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Fengyi Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single- stream diffusion transformer...

work page internal anchor Pith review arXiv 2025
[33]

Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

work page arXiv 2025
[34]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review arXiv 2025
[35]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[36]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025
[37]

Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

work page arXiv 2025
[38]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 12

work page internal anchor Pith review arXiv 2024
[39]

Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

work page arXiv 2025
[40]

arXiv preprint arXiv:2412.09605 (2024)

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024

work page arXiv 2024
[41]

Xue et al

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

work page arXiv 2025
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review arXiv 2023
[44]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[45]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[46]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

The bitter lesson for web agents.https://yutori.com/blog/the-bitter-lesson-for-web-agents, December 2025

Yutori. The bitter lesson for web agents.https://yutori.com/blog/the-bitter-lesson-for-web-agents, December 2025. Ac- cessed: 2026-04-30

work page 2025
[48]

Boyuan Zheng, Michael Y

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page arXiv 2024
[49]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[50]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Appendix A Additional Quantitative Results 15 A.1 Results across model sizes . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023