Recognition: no theorem link
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3
The pith
Weblica uses HTTP caching and LLM synthesis to create thousands of stable web environments for training visual agents that navigate better than similar-sized baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Weblica (Web Replica) constructs reproducible and scalable web environments by combining HTTP-level caching, which replays stable visual states while maintaining interactive behavior, with LLM-based environment synthesis anchored in real-world websites and core navigation skills. This framework enables scaling RL training to thousands of diverse environments and tasks, yielding Weblica-8B that outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and remains competitive with API models.
What carries the argument
The Weblica framework, which relies on HTTP-level caching to capture and replay interactive visual states plus LLM-driven synthesis of new environments from real sites and navigation primitives.
If this is right
- Reinforcement learning for web agents can be scaled to far larger numbers of tasks without continuous live web access during training.
- Training runs become more reproducible because cached environments remove the variability caused by changing live sites.
- Allocating extra compute at test time yields further gains, indicating the model can be improved post-training without retraining.
- Open-weight models can reach performance levels close to closed API systems when environment diversity is increased through synthesis.
Where Pith is reading between the lines
- The same caching-plus-synthesis approach might transfer to training agents in other interactive digital domains such as mobile apps or desktop software.
- If the synthesized environments prove sufficiently representative, human-designed evaluation suites could be partially replaced by automatically generated ones.
- Extending the method to longer-horizon or multi-step tasks could expose where caching fails to preserve complex state dependencies.
Load-bearing premise
Cached HTTP states will continue to behave like live websites without artifacts or missing interactions, and the LLM-generated environments will represent enough real-world web diversity and dynamics to avoid biasing what the agent learns.
What would settle it
Running the trained agents on live uncached versions of the benchmark websites and checking whether success rates or step counts degrade substantially relative to the cached training environments.
read the original abstract
The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Weblica, a framework for scalable, reproducible training of visual web agents. It combines HTTP-level caching to replay stable visual states while preserving interactivity with LLM-based synthesis of environments grounded in real websites and core navigation skills. The authors scale RL training across thousands of such environments and report that their Weblica-8B model outperforms open-weight baselines of similar size on multiple web navigation benchmarks, uses fewer inference steps, exhibits favorable scaling with test-time compute, and is competitive with API-based models.
Significance. If the core premises hold, the work offers a practical route to large-scale RL for web agents by addressing data scarcity and non-reproducibility in dynamic web settings. The explicit focus on reproducibility via caching and the scaling to thousands of environments are concrete strengths that could enable follow-on research; the test-time scaling observation is also potentially useful if quantified.
major comments (3)
- [Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).
- [Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.
- [Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.
minor comments (1)
- [Abstract] Abstract: Key quantitative results (e.g., exact benchmark scores, step reductions, or scaling curves) are omitted, reducing the reader's ability to gauge the magnitude of the reported improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify how to strengthen the presentation of our framework and results. We address each major comment below and will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [Section 3] Framework description (Section 3): The central claim that HTTP-level caching 'preserves interactive behavior' without artifacts is load-bearing for all downstream results, yet no quantitative validation is supplied (e.g., side-by-side success-rate deltas, interaction-trace comparisons, or state-drift statistics between live and cached sites).
Authors: We agree that explicit quantitative validation would make the caching claim more robust. The design replays exact HTTP responses to maintain visual and interactive fidelity, but we will add a new subsection in Section 3 with side-by-side comparisons: agent success rates on live vs. cached versions of the same sites, interaction trace similarity metrics, and state-drift statistics over multiple episodes. These results will be reported with the revised manuscript. revision: yes
-
Referee: [Section 4] Experiments and evaluation (Section 4): The headline claim that Weblica-8B 'outperforms open-weight baselines of similar size' and 'uses fewer inference steps' is presented without reported baselines, number of runs, error bars, or statistical tests, making it impossible to assess whether the gains are robust or load-bearing for the scaling and competitiveness assertions.
Authors: We acknowledge the need for fuller experimental reporting. The current manuscript lists the open-weight baselines and reports aggregate metrics, but we will expand Section 4 to explicitly name all baselines, report the number of evaluation runs (with seed details), include error bars or standard deviations, and add statistical significance tests (e.g., paired t-tests) where appropriate. This will directly support the robustness of the performance and efficiency claims. revision: yes
-
Referee: [Section 3.2] LLM synthesis subsection (Section 3.2): The assertion that LLM-grounded synthesis produces 'representative' task distributions and dynamics lacks any supporting diversity metrics (e.g., entropy over DOM changes, action-distribution statistics, or comparison to real-web task corpora), which directly underpins the claim of capturing web diversity at scale.
Authors: We recognize that quantitative support for representativeness would strengthen the synthesis claims. While the synthesis is grounded in real websites and core navigation skills, we will add diversity metrics in Section 3.2, including action-distribution histograms, entropy measures over DOM state changes, and direct comparisons against publicly available real-web task corpora. These additions will be included in the revision. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external benchmarks
full rationale
The paper introduces a framework for scalable web environments via HTTP caching and LLM synthesis, scales RL training, and reports empirical outperformance of Weblica-8B on multiple web navigation benchmarks. No equations, fitted parameters, or predictions are defined in terms of themselves. Central claims are validated against independent external test sets rather than reducing to training inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing for the results. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HTTP-level caching can capture and replay stable visual states while preserving interactive behavior
- domain assumption LLM-based synthesis can generate environments grounded in real-world websites and core navigation skills
Reference graph
Works this paper leans on
-
[1]
Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, et al. Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025
-
[2]
Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024
Anthropic. Claude computer use.https://www.anthropic.com/news/3-5-models-and-computer-use, 2024. Accessed: 2026- 04-30
work page 2024
-
[3]
Claude code.https://www.anthropic.com/claude-code, 2025
Anthropic. Claude code.https://www.anthropic.com/claude-code, 2025. Accessed: 2026-04-30
work page 2025
-
[4]
Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025
Anthropic. Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, November 2025. Accessed: 2026-04-30
work page 2025
-
[5]
arXiv preprint arXiv:2511.19663 , year=
Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025
-
[6]
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
work page 2023
-
[8]
Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use
Google. Gemini computer use.https://ai.google.dev/gemini-api/docs/computer-use. Accessed: 2026-04-30
work page 2026
-
[9]
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023
-
[11]
Webvoyager: Building an end-to-end web agent with large multimodal models,
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024
-
[12]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024
work page 2024
-
[13]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
work page 2025
-
[14]
Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025
Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025
work page 2025
-
[15]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Dhillon, David Brandfonbrener, and Rishabh Agarwal
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025
-
[18]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024. 11
work page 2024
-
[19]
Screenspot-pro: Gui grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025
work page 2025
-
[20]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025
Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025
-
[22]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[23]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use
OpenAI. Computer use.https://developers.openai.com/api/docs/guides/tools-computer-use. Accessed: 2026-04-30
work page 2026
-
[25]
Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025
OpenAI. Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2026-04-30
work page 2025
-
[26]
Codex.https://chatgpt.com/codex/, 2025
OpenAI. Codex.https://chatgpt.com/codex/, 2025. Accessed: 2026-04-30
work page 2025
-
[27]
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024
-
[28]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page Pith review arXiv 2025
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
World of bits: An open-domain platform for web-based agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017
work page 2017
-
[31]
Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025
The Yutori Team. Introducing navigator.https://yutori.com/blog/introducing-navigator, 2025. Yutori Blog
work page 2025
-
[32]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Fengyi Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single- stream diffusion transformer...
work page internal anchor Pith review arXiv 2025
-
[33]
Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025
Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025
-
[34]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
-
[37]
Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025
-
[38]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 12
work page internal anchor Pith review arXiv 2024
-
[39]
Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025
-
[40]
arXiv preprint arXiv:2412.09605 (2024)
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024
- [41]
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022
work page 2022
-
[45]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[46]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Yutori. The bitter lesson for web agents.https://yutori.com/blog/the-bitter-lesson-for-web-agents, December 2025. Ac- cessed: 2026-04-30
work page 2025
-
[48]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024
-
[49]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[50]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Appendix A Additional Quantitative Results 15 A.1 Results across model sizes . . . . . . . . . . . . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.