pith. machine review for the scientific record. sign in

arxiv: 2605.06761 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CV· cs.LG

Recognition: no theorem link

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG
keywords visual web agentsweb navigationreinforcement learningenvironment synthesisHTTP cachingscalable trainingreproducible environmentsLLM-based synthesis
0
0 comments X

The pith

Weblica uses HTTP caching and LLM synthesis to create thousands of stable web environments for training visual agents that navigate better than similar-sized baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Weblica as a way to overcome the scarcity of diverse, scalable training data for visual web navigation agents by turning real websites into reproducible environments. It does this through two techniques: capturing and replaying HTTP interactions to keep visual states stable while preserving clicks and dynamics, and using LLMs to generate new environments grounded in actual sites and basic navigation skills. Scaling reinforcement learning across these thousands of environments produces Weblica-8B, which beats open-weight models of comparable size on standard benchmarks, requires fewer inference steps, improves when given more test-time compute, and approaches the performance of closed API models. A reader would care because current web agent training is bottlenecked by live-site variability and limited simulation coverage, and solving that could make robust open agents practical.

Core claim

Weblica (Web Replica) constructs reproducible and scalable web environments by combining HTTP-level caching, which replays stable visual states while maintaining interactive behavior, with LLM-based environment synthesis anchored in real-world websites and core navigation skills. This framework enables scaling RL training to thousands of diverse environments and tasks, yielding Weblica-8B that outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and remains competitive with API models.

What carries the argument

The Weblica framework, which relies on HTTP-level caching to capture and replay interactive visual states plus LLM-driven synthesis of new environments from real sites and navigation primitives.

If this is right

  • Reinforcement learning for web agents can be scaled to far larger numbers of tasks without continuous live web access during training.
  • Training runs become more reproducible because cached environments remove the variability caused by changing live sites.
  • Allocating extra compute at test time yields further gains, indicating the model can be improved post-training without retraining.
  • Open-weight models can reach performance levels close to closed API systems when environment diversity is increased through synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caching-plus-synthesis approach might transfer to training agents in other interactive digital domains such as mobile apps or desktop software.
  • If the synthesized environments prove sufficiently representative, human-designed evaluation suites could be partially replaced by automatically generated ones.
  • Extending the method to longer-horizon or multi-step tasks could expose where caching fails to preserve complex state dependencies.

Load-bearing premise

Cached HTTP states will continue to behave like live websites without artifacts or missing interactions, and the LLM-generated environments will represent enough real-world web diversity and dynamics to avoid biasing what the agent learns.

What would settle it

Running the trained agents on live uncached versions of the benchmark websites and checking whether success rates or step counts degrade substantially relative to the cached training environments.

read the original abstract

The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Weblica, a framework for scalable, reproducible training of visual web agents. It combines HTTP-level caching to replay stable visual states while preserving interactivity with LLM-based synthesis of environments grounded in real websites and core navigation skills. The authors scale RL training across thousands of such environments and report that their Weblica-8B model outperforms open-weight baselines of similar size on multiple web navigation benchmarks, uses fewer inference steps, exhibits favorable scaling with test-time compute, and is competitive with API-based models.

Significance. If the core premises hold, the work offers a practical route to large-scale RL for web agents by addressing data scarcity and non-reproducibility in dynamic web settings. The explicit focus on reproducibility via caching and the scaling to thousands of environments are concrete strengths that could enable follow-on research; the test-time scaling observation is also potentially useful if quantified.

major comments (3)
  1. [Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).
  2. [Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.
  3. [Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.
minor comments (1)
  1. [Abstract] Abstract: Key quantitative results (e.g., exact benchmark scores, step reductions, or scaling curves) are omitted, reducing the reader's ability to gauge the magnitude of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify how to strengthen the presentation of our framework and results. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).

    Authors: We agree that explicit quantitative validation would make the caching claim more robust. The design replays exact HTTP responses to maintain visual and interactive fidelity, but we will add a new subsection in Section 3 with side-by-side comparisons: agent success rates on live vs. cached versions of the same sites, interaction trace similarity metrics, and state-drift statistics over multiple episodes. These results will be reported with the revised manuscript. revision: yes

  2. Referee: [Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.

    Authors: We acknowledge the need for fuller experimental reporting. The current manuscript lists the open-weight baselines and reports aggregate metrics, but we will expand Section 4 to explicitly name all baselines, report the number of evaluation runs (with seed details), include error bars or standard deviations, and add statistical significance tests (e.g., paired t-tests) where appropriate. This will directly support the robustness of the performance and efficiency claims. revision: yes

  3. Referee: [Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.

    Authors: We recognize that quantitative support for representativeness would strengthen the synthesis claims. While the synthesis is grounded in real websites and core navigation skills, we will add diversity metrics in Section 3.2, including action-distribution histograms, entropy measures over DOM state changes, and direct comparisons against publicly available real-web task corpora. These additions will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper introduces a framework for scalable web environments via HTTP caching and LLM synthesis, scales RL training, and reports empirical outperformance of Weblica-8B on multiple web navigation benchmarks. No equations, fitted parameters, or predictions are defined in terms of themselves. Central claims are validated against independent external test sets rather than reducing to training inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing for the results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on two domain assumptions about caching fidelity and LLM synthesis quality; no free parameters or new physical entities are introduced in the abstract.

axioms (2)
  • domain assumption HTTP-level caching can capture and replay stable visual states while preserving interactive behavior
    Core premise enabling reproducible environments; stated in the framework description.
  • domain assumption LLM-based synthesis can generate environments grounded in real-world websites and core navigation skills
    Used to scale from limited real sites to thousands of diverse tasks.

pith-pipeline@v0.9.0 · 5468 in / 1331 out tokens · 45872 ms · 2026-05-11T01:22:34.753401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 15 internal anchors

  1. [1]

    Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

    Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, et al. Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

  2. [2]

    Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024

    Anthropic. Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024. Accessed: 2026- 04-30

  3. [3]

    Claude code.https://www.anthropic.com/claude-code, 2025

    Anthropic. Claude code.https://www.anthropic.com/claude-code, 2025. Accessed: 2026-04-30

  4. [4]

    Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025

    Anthropic. Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025. Accessed: 2026-04-30

  5. [5]

    arXiv preprint arXiv:2511.19663 , year=

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

  6. [6]

    WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026

  7. [7]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  8. [8]

    Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use

    Google. Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use. Accessed: 2026-04-30

  9. [9]

    MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

  10. [10]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

  11. [11]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

  12. [12]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  13. [13]

    Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

  14. [14]

    Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

  15. [15]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  17. [17]

    Dhillon, David Brandfonbrener, and Rishabh Agarwal

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025

  18. [18]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024. 11

  19. [19]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

  20. [20]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  21. [21]

    Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

  22. [22]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  23. [23]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  24. [24]

    Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use

    OpenAI. Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use. Accessed: 2026-04-30

  25. [25]

    Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025

    OpenAI. Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2026-04-30

  26. [26]

    Codex.https://chatgpt.com/codex/, 2025

    OpenAI. Codex.https://chatgpt.com/codex/, 2025. Accessed: 2026-04-30

  27. [27]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv:2411.02337, 2024

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

  28. [28]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  29. [29]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  30. [30]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  31. [31]

    Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025

    The Yutori Team. Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025. Yutori Blog

  32. [32]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Fengyi Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single- stream diffusion transformer...

  33. [33]

    Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

    Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

  34. [34]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  35. [35]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

  36. [36]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

  37. [37]

    Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

  38. [38]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 12

  39. [39]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

  40. [40]

    arXiv preprint arXiv:2412.09605 (2024)

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024

  41. [41]

    Xue et al

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  44. [44]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  45. [45]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  46. [46]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  47. [47]

    The bitter lesson for web agents.https://yutori.com/blog/the-bitter-lesson-for-web-agents, December 2025

    Yutori. The bitter lesson for web agents.https://yutori.com/blog/the-bitter-lesson-for-web-agents, December 2025. Ac- cessed: 2026-04-30

  48. [48]

    Boyuan Zheng, Michael Y

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

  49. [49]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  50. [50]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Appendix A Additional Quantitative Results 15 A.1 Results across model sizes . . . . . . . . . . . . . . . . . ...