pith. machine review for the scientific record. sign in

arxiv: 2604.08516 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Ali Farhadi, Boyuan Zheng, Caleb Ouellette, Diego Llanes, Harsh Trivedi, Peter Sushko, Piper Wolters, Ranjay Krishna, Rock Yuren Pang, Taira Anderson, Tanmay Gupta, Taylor Blanton, Winson Han, Yue Yang, Zhongzheng Ren, Zixian Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords web agentsvisual language modelsbrowser navigationmultimodal action predictionopen source AIGUI perception
0
0 comments X

The pith

MolmoWeb-8B is an open visual web agent that predicts browser actions from screenshots and instructions alone, outperforming set-of-marks agents built on GPT-4o on WebVoyager and Online-Mind2Web.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases MolmoWebMix, a mixture of over 100K synthetic trajectories and 30K human demonstrations plus perception data, to train MolmoWeb agents. These agents act as instruction-conditioned policies that map a task description and webpage screenshot directly to the next browser action. On standard browser-use benchmarks the 8B version exceeds both smaller open models and larger closed-model agents that rely on set-of-marks annotations. The work supplies the full training mixture, model weights, code, and evaluation harness so others can reproduce and extend the results without proprietary components.

Core claim

MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs, and reach state-of-the-art scores on WebVoyager, Online-Mind2Web, and DeepShop while also improving further with best-of-N test-time scaling.

What carries the argument

Instruction-conditioned visual-language action policy that maps a task instruction and screenshot to a browser action such as click, type, or scroll.

If this is right

  • Web agents can achieve high task success rates without any access to page source code or accessibility trees.
  • Test-time scaling through parallel rollouts and best-of-N selection raises pass rates from 78 percent to 95 percent on WebVoyager.
  • Full release of training data, checkpoints, and evaluation code removes the reproducibility barrier that previously limited open research on web agents.
  • Consistent gains across 4B and 8B sizes show that scale within the open visual-only paradigm continues to improve navigation performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same screenshot-to-action training recipe could be applied to desktop GUI agents or mobile interfaces that also lack reliable HTML equivalents.
  • Open data mixtures of the size released here may accelerate progress on other embodied agents that must act from raw visual observations.
  • If the visual-only approach continues to close the gap with closed models, future web interfaces may need to optimize for visual clarity rather than structured markup.

Load-bearing premise

The chosen benchmarks represent the full range of real-world web navigation difficulty and that visual-only action prediction generalizes reliably beyond the evaluated tasks.

What would settle it

A new suite of web tasks drawn from previously unseen sites and interaction patterns where MolmoWeb-8B no longer exceeds GPT-4o-based set-of-marks agents would falsify the performance claim.

read the original abstract

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MolmoWebMix, a dataset combining over 100K synthetic browser task trajectories from multiple pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data (referring expression grounding and screenshot QA). It presents MolmoWeb, a family of 4B and 8B open multimodal models that function as instruction-conditioned visual-language action policies, predicting browser actions from task instructions and webpage screenshots without HTML, accessibility trees, or APIs. The models are claimed to achieve SOTA on WebVoyager, Online-Mind2Web, and DeepShop, outperforming open-weight models (Fara-7B, UI-Tars-1.5-7B, Holo1-7B) and set-of-marks agents based on larger closed models like GPT-4o, with additional gains from test-time scaling via parallel rollouts and best-of-N selection (e.g., 94.7% pass@4 vs. 78.2% pass@1 on WebVoyager). The authors commit to releasing models, data, code, and an evaluation harness.

Significance. If the empirical claims hold after verification of training details and absence of leakage, the work would be significant for enabling reproducible research on open web agents. The fully open release of data, checkpoints, and code directly addresses the reproducibility gap with proprietary systems and supports community progress. The demonstration of test-time scaling on visual policies and the focus on screenshot-only action prediction provide concrete, falsifiable results that could influence future multimodal agent design.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: The central SOTA claims (MolmoWeb-8B surpassing GPT-4o SoM agents on WebVoyager/Online-Mind2Web) are load-bearing but rest on benchmark comparisons without reported details on baseline implementations (e.g., exact SoM prompting, action space alignment, or visual input processing for closed models), training procedure (optimizer, schedule, data mixing ratios), or statistical significance (number of runs, standard errors). This prevents assessment of whether the reported margins are reliable.
  2. [Data and Experiments] Data and Experiments sections: The 100K+ synthetic trajectories plus 30K human demos are central to the performance claims, yet no analysis addresses potential distribution overlap or leakage with the test sets of WebVoyager, Online-Mind2Web, or DeepShop. Without OOD splits or explicit checks for visual pattern matching vs. true generalization, the assumption that these benchmarks establish reliable visual-only action prediction (especially for dynamic JS or inaccessible text) remains untested.
  3. [Experiments] Experiments section: The test-time scaling results (pass@4 gains) are promising but lack ablation on rollout diversity, selection criteria, or failure modes; it is unclear whether the best-of-N improvement holds under realistic latency constraints or on tasks where visual cues alone are insufficient.
minor comments (2)
  1. [Abstract] Abstract: The description of MolmoWebMix components could be clarified with a brief breakdown of the complementary generation pipelines to improve readability for readers unfamiliar with web-agent data synthesis.
  2. [Abstract] Notation: The paper uses 'pass@1' and 'pass@4' without an explicit definition in the abstract; a short parenthetical (e.g., 'pass rate with 1 or 4 attempts') would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for improving reproducibility and rigor, which we address below. We will revise the manuscript to incorporate additional details, analyses, and clarifications as outlined in our point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central SOTA claims (MolmoWeb-8B surpassing GPT-4o SoM agents on WebVoyager/Online-Mind2Web) are load-bearing but rest on benchmark comparisons without reported details on baseline implementations (e.g., exact SoM prompting, action space alignment, or visual input processing for closed models), training procedure (optimizer, schedule, data mixing ratios), or statistical significance (number of runs, standard errors). This prevents assessment of whether the reported margins are reliable.

    Authors: We agree that greater transparency on these elements is necessary to substantiate the SOTA claims. In the revised manuscript, we will expand the Experiments and Appendix sections to provide: full training details including the optimizer (AdamW), learning rate schedule, number of epochs, batch size, and precise data mixing ratios from MolmoWebMix; explicit descriptions of baseline implementations, including the SoM prompting templates, action space alignment, and visual preprocessing steps used for GPT-4o-based agents (following standard practices from prior literature); and statistical significance via results averaged over multiple independent runs with standard errors. These additions will enable readers to assess the reliability of the reported performance margins. revision: yes

  2. Referee: [Data and Experiments] Data and Experiments sections: The 100K+ synthetic trajectories plus 30K human demos are central to the performance claims, yet no analysis addresses potential distribution overlap or leakage with the test sets of WebVoyager, Online-Mind2Web, or DeepShop. Without OOD splits or explicit checks for visual pattern matching vs. true generalization, the assumption that these benchmarks establish reliable visual-only action prediction (especially for dynamic JS or inaccessible text) remains untested.

    Authors: We acknowledge the importance of rigorously checking for leakage to support claims of generalization. In the revised version, we will add a new subsection in the Data section that reports: similarity analyses between MolmoWebMix trajectories and benchmark test sets using both textual embeddings of instructions and visual embeddings of screenshots; identification and discussion of any near-duplicates; and OOD splits by holding out specific task categories or domains. We will also explicitly discuss limitations for dynamic JavaScript-heavy pages and inaccessible text, clarifying how the visual-only policy is intended to address these via screenshot-based reasoning. While exhaustive verification across all synthetic pipelines is resource-intensive, we will transparently document the checks performed. revision: yes

  3. Referee: [Experiments] Experiments section: The test-time scaling results (pass@4 gains) are promising but lack ablation on rollout diversity, selection criteria, or failure modes; it is unclear whether the best-of-N improvement holds under realistic latency constraints or on tasks where visual cues alone are insufficient.

    Authors: We agree that additional ablations would strengthen the test-time scaling results. In the revised Experiments section, we will include: ablations varying rollout diversity through different sampling temperatures and strategies; details on selection criteria (including comparisons of best-of-N against other practical selectors); analysis of failure modes where parallel rollouts do not yield gains; and evaluation of latency trade-offs under realistic constraints. We will also report performance on task subsets where visual cues are limited (e.g., text-heavy or dynamic elements) to assess when the visual-only approach is sufficient. These will be presented with concrete numbers and discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper introduces a new dataset (MolmoWebMix) and models (MolmoWeb-4B/8B) trained as visual-language action policies, then reports performance on independent external benchmarks (WebVoyager, Online-Mind2Web, DeepShop). No mathematical derivation chain, first-principles predictions, or equations exist that could reduce to fitted parameters or self-referential definitions. Outperformance claims rest on direct benchmark measurements rather than any constructed equivalence. Self-citations, if present, are not load-bearing for the central empirical results, which remain falsifiable via the released evaluation harness and data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central performance claims rest on the validity of existing web-agent benchmarks and standard supervised learning assumptions for trajectory data; no new entities are postulated and no free parameters are introduced beyond typical model training choices.

free parameters (1)
  • model parameter counts (4B and 8B)
    Practical sizes selected for open release; not derived from data in the abstract.
axioms (1)
  • domain assumption Existing benchmarks such as WebVoyager and Online-Mind2Web provide a faithful measure of web-agent capability
    Invoked when claiming state-of-the-art results.

pith-pipeline@v0.9.0 · 5742 in / 1437 out tokens · 47721 ms · 2026-05-10T17:56:52.839394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  2. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 unverdicted novelty 7.0

    GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.

  3. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 accept novelty 7.0

    GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

  4. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

  5. WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent

    cs.AI 2026-04 unverdicted novelty 4.0

    WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic ...

Reference graph

Works this paper leans on

85 extracted references · 39 canonical work pages · cited by 4 Pith papers · 12 internal anchors

  1. [1]

    Facts and figures 2024: Internet use

    International Telecommunication Union (ITU). Facts and figures 2024: Internet use. Web page, 2024. URL https://www.itu.int/itu-d/reports/statistics/2024/11/10/ff24-internet-use/. Accessed: 2026-03-05

  2. [2]

    Briefing note: Digital skills and digital inclusion

    OECD. Briefing note: Digital skills and digital inclusion. PDF, 2023. URL https:// www.oecd.org/content/dam/oecd/en/about/projects/cfe/oecd-city-network-on-jobs-and-skills/ Briefing-note-Digital-skills-and-digital-inclusion.pdf. Accessed: 2026-03-05

  3. [3]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  4. [4]

    Disability and health

    World Health Organization (WHO). Disability and health. Web page, 2023. URL https://www.who.int/ news-room/fact-sheets/detail/disability-and-health. Accessed: 2026-03-05

  5. [5]

    Diverse abilities and barriers

    W3C Web Accessibility Initiative (WAI). Diverse abilities and barriers. Web page, 2024. URLhttps://www.w3. org/WAI/people-use-web/abilities-barriers/. Accessed: 2026-03-05

  6. [6]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024

  7. [7]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.ArXiv, abs/2307.13854, 2023. URLhttps://api.semanticscholar.org/CorpusID:260164780

  8. [8]

    World of bits: An open-domain platform for web-based agents

    Tian Tian Shi, Andrej Karpathy, Linxi Fan, Julio Hernandez, Percy Liang, et al. World of bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning (ICML),

  9. [9]

    URLhttps://proceedings.mlr.press/v70/shi17a/shi17a.pdf

  10. [11]

    Computer use | openai api

    OpenAI. Computer use | openai api. Documentation. URLhttps://developers.openai.com/api/docs/guides/ tools-computer-use/. Accessed: 2026-03-05

  11. [12]

    Openai for developers in 2025

    OpenAI. Openai for developers in 2025. Blog post, 2025. URL https://developers.openai.com/blog/ openai-for-developers-2025/. Accessed: 2026-03-05

  12. [13]

    Computer use | gemini api | google ai for developers

    Google. Computer use | gemini api | google ai for developers. Documentation. URLhttps://ai.google.dev/ gemini-api/docs/computer-use. Accessed: 2026-03-05

  13. [14]

    State of the art: Reproducibility in artificial intelligence

    Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InAAAI Conference on Artificial Intelligence (AAAI), 2018. URL https://ojs.aaai.org/index.php/AAAI/article/ view/11503

  14. [15]

    Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of Machine Learning Research, 22(164), 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivère, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of Machine Learning Research, 22(164), 2021. URLhttps: //jmlr.org/papers/v22/20-303.html

  15. [16]

    Artificial intelligence risk management framework (ai rmf 1.0)

    National Institute of Standards and Technology (NIST). Artificial intelligence risk management framework (ai rmf 1.0). NIST Special Publication, 2023. URLhttps://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf. Accessed: 2026-03-05

  16. [17]

    Taxonomy of risks posed by language models

    Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, John Mellor, et al. Taxonomy of risks posed by language models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022. URLhttps://facctconference.org/static/pdfs_2022/facct22-3533088.pdf

  17. [18]

    arXiv preprint arXiv:2511.19663 , year=

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, and Andrew Zhao. Fara-7b: An efficient agentic model for computer use.arXiv:2511.19663, 2025

  18. [19]

    Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Bir’e, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d’Andign’e, Hubert de la Jonquiere, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin 16 Derupti, Michael Eickenberg, Marcello Fede...

  19. [20]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V.arXiv preprint arXiv:2310.11441, 2023

  20. [21]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InACL, 2024

  21. [22]

    Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

    Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

  22. [23]

    arXiv preprint arXiv:2601.10611 , year=

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision-language m...

  23. [24]

    An illusion of progress? assessing the current state of web agents

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Xiaodong Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.ArXiv, abs/2504.01382, 2025. URL https://api.semanticscholar.org/CorpusID:277502135

  24. [25]

    Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuyi Chen. Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025. URLhttps://api.semanticscholar. org/CorpusID:279118560

  25. [26]

    Introducing navigator

    The Yutori Team. Introducing navigator. https://yutori.com/blog/introducing-navigator, 2025. Yutori Blog

  26. [27]

    Ui-tars-1.5.https://seed-tars.com/1.5, 2025

    ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

  27. [28]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  28. [29]

    The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Alexandre Drouin, Massimo Caccia, L’eo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, and Nicolas Chapados. The browsergym ecosystem for ...

  29. [30]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  30. [31]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URLhttps://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  31. [32]

    Gemini3Promodelcard, 2025

    Google. Gemini3Promodelcard, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf

  32. [33]

    UGround: Towards Unified Visual Grounding with Unrolled Transformers

    Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers.ArXiv, abs/2510.03853, 2025. URL https: //api.semanticscholar.org/CorpusID:281843677

  33. [34]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InAnnual Meeting of the Association for Computational Linguistics, 2024. URLhttps://api.semanticscholar.org/CorpusID:267069082

  34. [35]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 17

  35. [36]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  36. [37]

    A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

  37. [38]

    Xu, Shuyan Zhou, and Graham Neubig

    Yueqi Song, Frank F. Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: Api-based web agents. In Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://api.semanticscholar. org/CorpusID:273507298

  38. [39]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901,

    H. Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Raj Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents.ArXiv, abs/2407.18901, 2024. URL https://api.semanticscholar.org/CorpusID: 271516633

  39. [40]

    Autowebglm: Bootstrap and reinforce a large lan- guage model-based web navigating agent

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648

  40. [41]

    arXiv preprint arXiv:2411.04468 , year=

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2...

  41. [42]

    Tree Search for Language Model Agents,

    Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents. arXiv preprint arXiv:2407.01476, 2024

  42. [43]

    arXiv preprint arXiv:2407.13032 , year=

    Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, and Ravi Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems.ArXiv, abs/2407.13032,

  43. [44]

    URLhttps://api.semanticscholar.org/CorpusID:271270241

  44. [45]

    Gpt-4v(ision) is a generalist web agent, if grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024. URLhttps://api.semanticscholar.org/CorpusID: 266741821

  45. [46]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. CogAgent: A visual language model for GUI agents, 2023. URLhttps://arxiv.org/abs/2312.08914

  46. [47]

    You only look at screens: Multimodal chain-of-action agents.arXiv preprint arXiv:2309.11436, 2023

    Zhuosheng Zhan and Aston Zhang. You only look at screens: Multimodal chain-of-action agents.arXiv preprint arXiv:2309.11436, 2023

  47. [48]

    Computer use | gemini API documentation, 2026

    Google. Computer use | gemini API documentation, 2026. URLhttps://ai.google.dev/gemini-api/docs/ computer-use. Accessed: 2026-03-04

  48. [49]

    Computer use | openai api documentation, 2025

    OpenAI. Computer use | openai api documentation, 2025. URLhttps://developers.openai.com/api/docs/ guides/tools-computer-use/. Accessed: 2026-03-04

  49. [50]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haolin Chen, Zhaojian Li, Haihua Y...

  50. [51]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  51. [52]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Bo Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Hua...

  52. [53]

    Agent Q: Advanced reasoning and learning for autonomous AI agents, 2024

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous AI agents, 2024. URLhttps://arxiv.org/abs/2408. 07199

  53. [54]

    Ferret-ui: Grounded mobile ui understanding with multimodal llms

    Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision, 2024. URLhttps://api.semanticscholar.org/CorpusID:269005503

  54. [55]

    2025 , note =

    Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms.ArXiv, abs/2410.18967, 2024. URLhttps://api.semanticscholar.org/CorpusID:273549934

  55. [56]

    Screenspot-pro: Gui grounding for professional high-resolution computer use.Proceedings of the 33rd ACM International Conference on Multimedia, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.Proceedings of the 33rd ACM International Conference on Multimedia, 2025. URLhttps://api.semanticscholar.org/CorpusID:277740982

  56. [57]

    arXiv preprint arXiv:2506.03143 , year=

    Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Lidén, Qingwei Lin, Huan Zhang, Tongxing Zhang, Jianbing Zhang, Dongmei Zhang, and Jianfeng Gao. GUI-Actor: Coordinate-free visual grounding for GUI agents.ArXiv, abs/2506.03143,

  57. [58]

    URLhttps://api.semanticscholar.org/CorpusID:279118510

  58. [59]

    Screenai: A vision-language model for ui and infographics understanding

    Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. ScreenAI: A vision-language model for UI and infographics understanding.ArXiv, abs/2402.04615, 2024. URLhttps://api.semanticscholar.org/CorpusID:267523393

  59. [60]

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent. ArXiv, abs/2408.00203, 2024. URLhttps://api.semanticscholar.org/CorpusID:271601072

  60. [61]

    OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

    Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. Omniparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models.ArXiv, abs/2502.16161, 2025. URL https://api.semanticscholar.org/CorpusID:276575751

  61. [62]

    Hernández, and Percy Liang

    Tianlin Shi, Andrej Karpathy, Linxi (Jim) Fan, Josefa Z. Hernández, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:34953552

  62. [63]

    Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802, 2018

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration.ArXiv, abs/1802.08802, 2018. URLhttps://api.semanticscholar. org/CorpusID:3530344

  63. [64]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  64. [65]

    arXiv preprint arXiv:2401.13649 , year=

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.ArXiv, abs/2401.13649, 2024. URLhttps://api.semanticscholar.org/CorpusID:267199749

  65. [66]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

  66. [67]

    arXiv preprint arXiv:2402.05930 , year=

    Xing Han Lù, Zdeněk Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024

  67. [68]

    Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam Hadj Laradji, Manuel Del Verme, Tom Marty, L’eo Boisvert, Megh Thakkar, Quentin Cappart, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?ArXiv, abs/2403.07718, 2024. URL https://api.semanticscholar.org/CorpusID:268363855

  68. [69]

    Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks, 2025

    L’eo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional 19 planning and reasoning-based common knowledge work tasks.ArXiv, abs/2407.05291, 2024. URL https: //api.semanticscholar.org/CorpusID:271051028

  69. [70]

    Assistant- bench: Can web agents solve realistic and time-consuming tasks? InConference on Empirical Methods in Natural Language Processing, 2024

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistant- bench: Can web agents solve realistic and time-consuming tasks? InConference on Empirical Methods in Natural Language Processing, 2024. URLhttps://api.semanticscholar.org/CorpusID:271328691

  70. [71]

    arXiv preprint arXiv:2406.12373 , year=

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. WebCanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024

  71. [72]

    GPT-4o System Card

    OpenAI. GPT-4o system card.arXiv:2410.21276, 2024

  72. [73]

    Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

    Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.ArXiv, abs/2406.20094, 2024. URLhttps://api.semanticscholar.org/CorpusID:270845490. 20 A Overview In this appendix, we provide the following: • Details for human (section B)and synthetic (section C) trajectory generation pipelines, includi...

  73. [74]

    {{ q ue st io n }}

    G en er at e tasks that require n a v i g a t i n g to a target page and o p t i o n a l l y a n s w e r i n g a qu es ti on based on the target page . When r e q u i r i n g q ue sti on answering , the last step should be find : answer to "{{ q ue st io n }}"

  74. [75]

    T y p i c a l l y the tasks should have 3 -10 steps

  75. [77]

    Some we bs ite s may provide a dv an ced search f u n c t i o n a l i t y - make use of these in your tasks w hen ev er p os si bl e

  76. [78]

    last 30 days

    When p r o v i d i n g dates for booking or r e s e r v a t i o n tasks provide dates in 2026. For tasks that require s e a r c h i n g exi st in g data provide dates before 2025 or provide re la ti ve dates like " last 30 days " , " next 2 days " , " a week from now " etc

  77. [81]

    B.2.3 LLM sampled tasks with navigation and QA

    G en er at e website re le va nt tasks , s p e c i a l l y search keywords , filters , and form details WEBSITE : { url } Similar to the manually written steps, we generate instructions at 3 levels of specificity for training from these LLM-sampled step-by-step tasks. B.2.3 LLM sampled tasks with navigation and QA. While well structured and easy to follow...

  78. [82]

    G en er at e tasks that require n a v i g a t i n g to a target page and o p t i o n a l l y a n s w e r i n g a qu es ti on based on the target page

  79. [83]

    There are only two kinds of tasks to g en er at e : n a v i g a t i o n only tasks that only contain a single n a v i g a t i o n step ; and 2 step task that contain a n a v i g a t i o n step f ol low ed by a qu es ti on

  80. [84]

    G e n e r a l l y doable in 5 -10 min

    Task should neither be too easy nor too complex . G e n e r a l l y doable in 5 -10 min

Showing first 80 references.