Recognition: 2 theorem links
· Lean TheoremMolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3
The pith
MolmoWeb-8B is an open visual web agent that predicts browser actions from screenshots and instructions alone, outperforming set-of-marks agents built on GPT-4o on WebVoyager and Online-Mind2Web.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs, and reach state-of-the-art scores on WebVoyager, Online-Mind2Web, and DeepShop while also improving further with best-of-N test-time scaling.
What carries the argument
Instruction-conditioned visual-language action policy that maps a task instruction and screenshot to a browser action such as click, type, or scroll.
If this is right
- Web agents can achieve high task success rates without any access to page source code or accessibility trees.
- Test-time scaling through parallel rollouts and best-of-N selection raises pass rates from 78 percent to 95 percent on WebVoyager.
- Full release of training data, checkpoints, and evaluation code removes the reproducibility barrier that previously limited open research on web agents.
- Consistent gains across 4B and 8B sizes show that scale within the open visual-only paradigm continues to improve navigation performance.
Where Pith is reading between the lines
- The same screenshot-to-action training recipe could be applied to desktop GUI agents or mobile interfaces that also lack reliable HTML equivalents.
- Open data mixtures of the size released here may accelerate progress on other embodied agents that must act from raw visual observations.
- If the visual-only approach continues to close the gap with closed models, future web interfaces may need to optimize for visual clarity rather than structured markup.
Load-bearing premise
The chosen benchmarks represent the full range of real-world web navigation difficulty and that visual-only action prediction generalizes reliably beyond the evaluated tasks.
What would settle it
A new suite of web tasks drawn from previously unseen sites and interaction patterns where MolmoWeb-8B no longer exceeds GPT-4o-based set-of-marks agents would falsify the performance claim.
read the original abstract
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MolmoWebMix, a dataset combining over 100K synthetic browser task trajectories from multiple pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data (referring expression grounding and screenshot QA). It presents MolmoWeb, a family of 4B and 8B open multimodal models that function as instruction-conditioned visual-language action policies, predicting browser actions from task instructions and webpage screenshots without HTML, accessibility trees, or APIs. The models are claimed to achieve SOTA on WebVoyager, Online-Mind2Web, and DeepShop, outperforming open-weight models (Fara-7B, UI-Tars-1.5-7B, Holo1-7B) and set-of-marks agents based on larger closed models like GPT-4o, with additional gains from test-time scaling via parallel rollouts and best-of-N selection (e.g., 94.7% pass@4 vs. 78.2% pass@1 on WebVoyager). The authors commit to releasing models, data, code, and an evaluation harness.
Significance. If the empirical claims hold after verification of training details and absence of leakage, the work would be significant for enabling reproducible research on open web agents. The fully open release of data, checkpoints, and code directly addresses the reproducibility gap with proprietary systems and supports community progress. The demonstration of test-time scaling on visual policies and the focus on screenshot-only action prediction provide concrete, falsifiable results that could influence future multimodal agent design.
major comments (3)
- [Abstract and Experiments] Abstract and Experiments section: The central SOTA claims (MolmoWeb-8B surpassing GPT-4o SoM agents on WebVoyager/Online-Mind2Web) are load-bearing but rest on benchmark comparisons without reported details on baseline implementations (e.g., exact SoM prompting, action space alignment, or visual input processing for closed models), training procedure (optimizer, schedule, data mixing ratios), or statistical significance (number of runs, standard errors). This prevents assessment of whether the reported margins are reliable.
- [Data and Experiments] Data and Experiments sections: The 100K+ synthetic trajectories plus 30K human demos are central to the performance claims, yet no analysis addresses potential distribution overlap or leakage with the test sets of WebVoyager, Online-Mind2Web, or DeepShop. Without OOD splits or explicit checks for visual pattern matching vs. true generalization, the assumption that these benchmarks establish reliable visual-only action prediction (especially for dynamic JS or inaccessible text) remains untested.
- [Experiments] Experiments section: The test-time scaling results (pass@4 gains) are promising but lack ablation on rollout diversity, selection criteria, or failure modes; it is unclear whether the best-of-N improvement holds under realistic latency constraints or on tasks where visual cues alone are insufficient.
minor comments (2)
- [Abstract] Abstract: The description of MolmoWebMix components could be clarified with a brief breakdown of the complementary generation pipelines to improve readability for readers unfamiliar with web-agent data synthesis.
- [Abstract] Notation: The paper uses 'pass@1' and 'pass@4' without an explicit definition in the abstract; a short parenthetical (e.g., 'pass rate with 1 or 4 attempts') would aid clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important areas for improving reproducibility and rigor, which we address below. We will revise the manuscript to incorporate additional details, analyses, and clarifications as outlined in our point-by-point responses.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The central SOTA claims (MolmoWeb-8B surpassing GPT-4o SoM agents on WebVoyager/Online-Mind2Web) are load-bearing but rest on benchmark comparisons without reported details on baseline implementations (e.g., exact SoM prompting, action space alignment, or visual input processing for closed models), training procedure (optimizer, schedule, data mixing ratios), or statistical significance (number of runs, standard errors). This prevents assessment of whether the reported margins are reliable.
Authors: We agree that greater transparency on these elements is necessary to substantiate the SOTA claims. In the revised manuscript, we will expand the Experiments and Appendix sections to provide: full training details including the optimizer (AdamW), learning rate schedule, number of epochs, batch size, and precise data mixing ratios from MolmoWebMix; explicit descriptions of baseline implementations, including the SoM prompting templates, action space alignment, and visual preprocessing steps used for GPT-4o-based agents (following standard practices from prior literature); and statistical significance via results averaged over multiple independent runs with standard errors. These additions will enable readers to assess the reliability of the reported performance margins. revision: yes
-
Referee: [Data and Experiments] Data and Experiments sections: The 100K+ synthetic trajectories plus 30K human demos are central to the performance claims, yet no analysis addresses potential distribution overlap or leakage with the test sets of WebVoyager, Online-Mind2Web, or DeepShop. Without OOD splits or explicit checks for visual pattern matching vs. true generalization, the assumption that these benchmarks establish reliable visual-only action prediction (especially for dynamic JS or inaccessible text) remains untested.
Authors: We acknowledge the importance of rigorously checking for leakage to support claims of generalization. In the revised version, we will add a new subsection in the Data section that reports: similarity analyses between MolmoWebMix trajectories and benchmark test sets using both textual embeddings of instructions and visual embeddings of screenshots; identification and discussion of any near-duplicates; and OOD splits by holding out specific task categories or domains. We will also explicitly discuss limitations for dynamic JavaScript-heavy pages and inaccessible text, clarifying how the visual-only policy is intended to address these via screenshot-based reasoning. While exhaustive verification across all synthetic pipelines is resource-intensive, we will transparently document the checks performed. revision: yes
-
Referee: [Experiments] Experiments section: The test-time scaling results (pass@4 gains) are promising but lack ablation on rollout diversity, selection criteria, or failure modes; it is unclear whether the best-of-N improvement holds under realistic latency constraints or on tasks where visual cues alone are insufficient.
Authors: We agree that additional ablations would strengthen the test-time scaling results. In the revised Experiments section, we will include: ablations varying rollout diversity through different sampling temperatures and strategies; details on selection criteria (including comparisons of best-of-N against other practical selectors); analysis of failure modes where parallel rollouts do not yield gains; and evaluation of latency trade-offs under realistic constraints. We will also report performance on task subsets where visual cues are limited (e.g., text-heavy or dynamic elements) to assess when the visual-only approach is sufficient. These will be presented with concrete numbers and discussion. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper introduces a new dataset (MolmoWebMix) and models (MolmoWeb-4B/8B) trained as visual-language action policies, then reports performance on independent external benchmarks (WebVoyager, Online-Mind2Web, DeepShop). No mathematical derivation chain, first-principles predictions, or equations exist that could reduce to fitted parameters or self-referential definitions. Outperformance claims rest on direct benchmark measurements rather than any constructed equivalence. Self-citations, if present, are not load-bearing for the central empirical results, which remain falsifiable via the released evaluation harness and data.
Axiom & Free-Parameter Ledger
free parameters (1)
- model parameter counts (4B and 8B)
axioms (1)
- domain assumption Existing benchmarks such as WebVoyager and Online-Mind2Web provide a faithful measure of web-agent capability
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearMolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe train MolmoWeb end-to-end via supervised fine-tuning (SFT) on all training data... mixing ratios... ablate different mixtures
Forward citations
Cited by 5 Pith papers
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic ...
Reference graph
Works this paper leans on
-
[1]
Facts and figures 2024: Internet use
International Telecommunication Union (ITU). Facts and figures 2024: Internet use. Web page, 2024. URL https://www.itu.int/itu-d/reports/statistics/2024/11/10/ff24-internet-use/. Accessed: 2026-03-05
2024
-
[2]
Briefing note: Digital skills and digital inclusion
OECD. Briefing note: Digital skills and digital inclusion. PDF, 2023. URL https:// www.oecd.org/content/dam/oecd/en/about/projects/cfe/oecd-city-network-on-jobs-and-skills/ Briefing-note-Digital-skills-and-digital-inclusion.pdf. Accessed: 2026-03-05
2023
-
[3]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review arXiv 2021
-
[4]
Disability and health
World Health Organization (WHO). Disability and health. Web page, 2023. URL https://www.who.int/ news-room/fact-sheets/detail/disability-and-health. Accessed: 2026-03-05
2023
-
[5]
Diverse abilities and barriers
W3C Web Accessibility Initiative (WAI). Diverse abilities and barriers. Web page, 2024. URLhttps://www.w3. org/WAI/people-use-web/abilities-barriers/. Accessed: 2026-03-05
2024
-
[6]
Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[7]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.ArXiv, abs/2307.13854, 2023. URLhttps://api.semanticscholar.org/CorpusID:260164780
work page internal anchor Pith review arXiv 2023
-
[8]
World of bits: An open-domain platform for web-based agents
Tian Tian Shi, Andrej Karpathy, Linxi Fan, Julio Hernandez, Percy Liang, et al. World of bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning (ICML),
-
[9]
URLhttps://proceedings.mlr.press/v70/shi17a/shi17a.pdf
-
[11]
Computer use | openai api
OpenAI. Computer use | openai api. Documentation. URLhttps://developers.openai.com/api/docs/guides/ tools-computer-use/. Accessed: 2026-03-05
2026
-
[12]
Openai for developers in 2025
OpenAI. Openai for developers in 2025. Blog post, 2025. URL https://developers.openai.com/blog/ openai-for-developers-2025/. Accessed: 2026-03-05
2025
-
[13]
Computer use | gemini api | google ai for developers
Google. Computer use | gemini api | google ai for developers. Documentation. URLhttps://ai.google.dev/ gemini-api/docs/computer-use. Accessed: 2026-03-05
2026
-
[14]
State of the art: Reproducibility in artificial intelligence
Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InAAAI Conference on Artificial Intelligence (AAAI), 2018. URL https://ojs.aaai.org/index.php/AAAI/article/ view/11503
2018
-
[15]
Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of Machine Learning Research, 22(164), 2021
Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivère, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).Journal of Machine Learning Research, 22(164), 2021. URLhttps: //jmlr.org/papers/v22/20-303.html
2019
-
[16]
Artificial intelligence risk management framework (ai rmf 1.0)
National Institute of Standards and Technology (NIST). Artificial intelligence risk management framework (ai rmf 1.0). NIST Special Publication, 2023. URLhttps://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf. Accessed: 2026-03-05
2023
-
[17]
Taxonomy of risks posed by language models
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, John Mellor, et al. Taxonomy of risks posed by language models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022. URLhttps://facctconference.org/static/pdfs_2022/facct22-3533088.pdf
2022
-
[18]
arXiv preprint arXiv:2511.19663 , year=
Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, and Andrew Zhao. Fara-7b: An efficient agentic model for computer use.arXiv:2511.19663, 2025
-
[19]
Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Bir’e, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d’Andign’e, Hubert de la Jonquiere, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin 16 Derupti, Michael Eickenberg, Marcello Fede...
-
[20]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V.arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review arXiv 2023
-
[21]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InACL, 2024
2024
-
[22]
Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026
Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296
-
[23]
arXiv preprint arXiv:2601.10611 , year=
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision-language m...
-
[24]
An illusion of progress? assessing the current state of web agents
Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Xiaodong Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.ArXiv, abs/2504.01382, 2025. URL https://api.semanticscholar.org/CorpusID:277502135
-
[25]
Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025
Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuyi Chen. Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025. URLhttps://api.semanticscholar. org/CorpusID:279118560
-
[26]
Introducing navigator
The Yutori Team. Introducing navigator. https://yutori.com/blog/introducing-navigator, 2025. Yutori Blog
2025
-
[27]
Ui-tars-1.5.https://seed-tars.com/1.5, 2025
ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025
2025
-
[28]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,
Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Alexandre Drouin, Massimo Caccia, L’eo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, and Nicolas Chapados. The browsergym ecosystem for ...
-
[30]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URLhttps://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
2024
-
[32]
Gemini3Promodelcard, 2025
Google. Gemini3Promodelcard, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf
2025
-
[33]
UGround: Towards Unified Visual Grounding with Unrolled Transformers
Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers.ArXiv, abs/2510.03853, 2025. URL https: //api.semanticscholar.org/CorpusID:281843677
work page internal anchor Pith review arXiv 2025
-
[34]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InAnnual Meeting of the Association for Computational Linguistics, 2024. URLhttps://api.semanticscholar.org/CorpusID:267069082
2024
-
[35]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 17
work page internal anchor Pith review arXiv 2024
-
[36]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[37]
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023
-
[38]
Xu, Shuyan Zhou, and Graham Neubig
Yueqi Song, Frank F. Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: Api-based web agents. In Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://api.semanticscholar. org/CorpusID:273507298
2024
-
[39]
H. Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Raj Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents.ArXiv, abs/2407.18901, 2024. URL https://api.semanticscholar.org/CorpusID: 271516633
-
[40]
Autowebglm: Bootstrap and reinforce a large lan- guage model-based web navigating agent
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648
-
[41]
arXiv preprint arXiv:2411.04468 , year=
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2...
-
[42]
Tree Search for Language Model Agents,
Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents. arXiv preprint arXiv:2407.01476, 2024
-
[43]
arXiv preprint arXiv:2407.13032 , year=
Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, and Ravi Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems.ArXiv, abs/2407.13032,
-
[44]
URLhttps://api.semanticscholar.org/CorpusID:271270241
-
[45]
Gpt-4v(ision) is a generalist web agent, if grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024. URLhttps://api.semanticscholar.org/CorpusID: 266741821
2024
-
[46]
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. CogAgent: A visual language model for GUI agents, 2023. URLhttps://arxiv.org/abs/2312.08914
-
[47]
You only look at screens: Multimodal chain-of-action agents.arXiv preprint arXiv:2309.11436, 2023
Zhuosheng Zhan and Aston Zhang. You only look at screens: Multimodal chain-of-action agents.arXiv preprint arXiv:2309.11436, 2023
-
[48]
Computer use | gemini API documentation, 2026
Google. Computer use | gemini API documentation, 2026. URLhttps://ai.google.dev/gemini-api/docs/ computer-use. Accessed: 2026-03-04
2026
-
[49]
Computer use | openai api documentation, 2025
OpenAI. Computer use | openai api documentation, 2025. URLhttps://developers.openai.com/api/docs/ guides/tools-computer-use/. Accessed: 2026-03-04
2025
-
[50]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haolin Chen, Zhaojian Li, Haihua Y...
work page internal anchor Pith review arXiv 2025
-
[51]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025
work page internal anchor Pith review arXiv 2025
-
[52]
Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Bo Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Hua...
-
[53]
Agent Q: Advanced reasoning and learning for autonomous AI agents, 2024
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous AI agents, 2024. URLhttps://arxiv.org/abs/2408. 07199
2024
-
[54]
Ferret-ui: Grounded mobile ui understanding with multimodal llms
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. InEuropean Conference on Computer Vision, 2024. URLhttps://api.semanticscholar.org/CorpusID:269005503
2024
-
[55]
Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms.ArXiv, abs/2410.18967, 2024. URLhttps://api.semanticscholar.org/CorpusID:273549934
-
[56]
Screenspot-pro: Gui grounding for professional high-resolution computer use.Proceedings of the 33rd ACM International Conference on Multimedia, 2025
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.Proceedings of the 33rd ACM International Conference on Multimedia, 2025. URLhttps://api.semanticscholar.org/CorpusID:277740982
2025
-
[57]
arXiv preprint arXiv:2506.03143 , year=
Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Lidén, Qingwei Lin, Huan Zhang, Tongxing Zhang, Jianbing Zhang, Dongmei Zhang, and Jianfeng Gao. GUI-Actor: Coordinate-free visual grounding for GUI agents.ArXiv, abs/2506.03143,
-
[58]
URLhttps://api.semanticscholar.org/CorpusID:279118510
-
[59]
Screenai: A vision-language model for ui and infographics understanding
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. ScreenAI: A vision-language model for UI and infographics understanding.ArXiv, abs/2402.04615, 2024. URLhttps://api.semanticscholar.org/CorpusID:267523393
-
[60]
Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent. ArXiv, abs/2408.00203, 2024. URLhttps://api.semanticscholar.org/CorpusID:271601072
-
[61]
Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. Omniparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models.ArXiv, abs/2502.16161, 2025. URL https://api.semanticscholar.org/CorpusID:276575751
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Hernández, and Percy Liang
Tianlin Shi, Andrej Karpathy, Linxi (Jim) Fan, Josefa Z. Hernández, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:34953552
2017
-
[63]
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration.ArXiv, abs/1802.08802, 2018. URLhttps://api.semanticscholar. org/CorpusID:3530344
-
[64]
Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022
2022
-
[65]
arXiv preprint arXiv:2401.13649 , year=
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.ArXiv, abs/2401.13649, 2024. URLhttps://api.semanticscholar.org/CorpusID:267199749
-
[66]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972
work page internal anchor Pith review arXiv 2024
-
[67]
arXiv preprint arXiv:2402.05930 , year=
Xing Han Lù, Zdeněk Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024
-
[68]
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam Hadj Laradji, Manuel Del Verme, Tom Marty, L’eo Boisvert, Megh Thakkar, Quentin Cappart, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?ArXiv, abs/2403.07718, 2024. URL https://api.semanticscholar.org/CorpusID:268363855
-
[69]
Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks, 2025
L’eo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional 19 planning and reasoning-based common knowledge work tasks.ArXiv, abs/2407.05291, 2024. URL https: //api.semanticscholar.org/CorpusID:271051028
-
[70]
Assistant- bench: Can web agents solve realistic and time-consuming tasks? InConference on Empirical Methods in Natural Language Processing, 2024
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistant- bench: Can web agents solve realistic and time-consuming tasks? InConference on Empirical Methods in Natural Language Processing, 2024. URLhttps://api.semanticscholar.org/CorpusID:271328691
2024
-
[71]
arXiv preprint arXiv:2406.12373 , year=
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. WebCanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024
-
[72]
OpenAI. GPT-4o system card.arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024
Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.ArXiv, abs/2406.20094, 2024. URLhttps://api.semanticscholar.org/CorpusID:270845490. 20 A Overview In this appendix, we provide the following: • Details for human (section B)and synthetic (section C) trajectory generation pipelines, includi...
-
[74]
{{ q ue st io n }}
G en er at e tasks that require n a v i g a t i n g to a target page and o p t i o n a l l y a n s w e r i n g a qu es ti on based on the target page . When r e q u i r i n g q ue sti on answering , the last step should be find : answer to "{{ q ue st io n }}"
-
[75]
T y p i c a l l y the tasks should have 3 -10 steps
-
[77]
Some we bs ite s may provide a dv an ced search f u n c t i o n a l i t y - make use of these in your tasks w hen ev er p os si bl e
-
[78]
last 30 days
When p r o v i d i n g dates for booking or r e s e r v a t i o n tasks provide dates in 2026. For tasks that require s e a r c h i n g exi st in g data provide dates before 2025 or provide re la ti ve dates like " last 30 days " , " next 2 days " , " a week from now " etc
2026
-
[81]
B.2.3 LLM sampled tasks with navigation and QA
G en er at e website re le va nt tasks , s p e c i a l l y search keywords , filters , and form details WEBSITE : { url } Similar to the manually written steps, we generate instructions at 3 levels of specificity for training from these LLM-sampled step-by-step tasks. B.2.3 LLM sampled tasks with navigation and QA. While well structured and easy to follow...
-
[82]
G en er at e tasks that require n a v i g a t i n g to a target page and o p t i o n a l l y a n s w e r i n g a qu es ti on based on the target page
-
[83]
There are only two kinds of tasks to g en er at e : n a v i g a t i o n only tasks that only contain a single n a v i g a t i o n step ; and 2 step task that contain a n a v i g a t i o n step f ol low ed by a qu es ti on
-
[84]
G e n e r a l l y doable in 5 -10 min
Task should neither be too easy nor too complex . G e n e r a l l y doable in 5 -10 min
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.