Recognition: 3 theorem links
· Lean TheoremWebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Pith reviewed 2026-05-15 22:40 UTC · model grok-4.3
The pith
WebVoyager shows that large multimodal models can drive an end-to-end agent that completes open-ended tasks on live websites.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WebVoyager is a Large Multimodal Model powered web agent that completes user instructions end-to-end by interacting with real-world websites. On a new benchmark compiled from tasks on 15 popular sites, it records a 59.1 percent task success rate, which exceeds the performance of GPT-4 using all tools and the text-only version of WebVoyager. The automatic evaluation protocol, which relies on GPT-4V multimodal understanding, reaches 85.3 percent agreement with human judgment of task completion.
What carries the argument
WebVoyager, an end-to-end agent that takes both screenshot images and text from live web pages as input and outputs actions to navigate and complete tasks.
If this is right
- Web agents can now operate directly on live sites instead of static snapshots or simulators.
- Multimodal input yields measurable gains over text-only agents on realistic web tasks.
- An automatic multimodal judge can substitute for human evaluation at 85 percent agreement.
- Success on fifteen distinct sites indicates the method generalizes across common web interfaces.
Where Pith is reading between the lines
- If success rates continue to rise, such agents could handle routine personal tasks like form filling or information gathering without human oversight.
- The same multimodal loop might transfer to other screen-based environments such as desktop software or mobile apps.
- Pairing the agent with external tools or memory could address remaining failure cases on complex multi-step tasks.
Load-bearing premise
The assumption that GPT-4V's judgments of task completion on website screenshots match how humans would rate the same outcomes.
What would settle it
A side-by-side study in which human raters independently score the same set of WebVoyager task traces and produce a success rate differing from 59.1 percent by more than ten points.
read the original abstract
The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebVoyager, an LMM-powered web agent for end-to-end interaction with real-world websites to complete open-ended user instructions. It compiles a benchmark of tasks from 15 popular sites and proposes an automatic evaluation protocol that leverages GPT-4V to judge task success, reporting a 59.1% success rate that exceeds both GPT-4 (All Tools) and a text-only WebVoyager ablation, with the GPT-4V judge achieving 85.3% agreement with human raters on a held-out set.
Significance. If the evaluation protocol is shown to be robust, the result would be significant for demonstrating that multimodal models can outperform text-only agents on dynamic, real websites rather than simulators or static snapshots, providing a concrete step toward practical autonomous web agents.
major comments (1)
- [§4.2] §4.2: The headline 59.1% success rate and all comparative claims rest exclusively on GPT-4V judgments of open-ended tasks against live page states. While 85.3% agreement with humans is reported on a held-out subset, the manuscript provides no human labels on the full test set, no inter-judge variance across alternative LMMs, and no ablation measuring how the judge handles partial or context-dependent completions; this leaves the reported performance gap vulnerable to systematic misjudgment.
minor comments (1)
- The benchmark construction (task selection criteria, website sampling, and any train/test split details) is described only at a high level; adding an explicit table or appendix listing the 15 sites and representative task templates would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential significance of demonstrating multimodal agents on live websites. We address the concern about the robustness of the GPT-4V evaluation protocol in detail below.
read point-by-point responses
-
Referee: [§4.2] §4.2: The headline 59.1% success rate and all comparative claims rest exclusively on GPT-4V judgments of open-ended tasks against live page states. While 85.3% agreement with humans is reported on a held-out subset, the manuscript provides no human labels on the full test set, no inter-judge variance across alternative LMMs, and no ablation measuring how the judge handles partial or context-dependent completions; this leaves the reported performance gap vulnerable to systematic misjudgment.
Authors: We appreciate the referee's emphasis on rigorous validation of the automatic evaluator. The 85.3% agreement was measured on a held-out set of 200 tasks (stratified across the 15 sites and task categories) that were independently labeled by two human annotators, with disagreements resolved by discussion. We acknowledge that labeling the entire test set would further strengthen confidence and that the current manuscript does not include inter-judge comparisons with other LMMs or explicit ablations on partial/context-dependent cases. In the revised manuscript we will: (1) report human judgments on an additional 150 tasks drawn from the full test set, (2) add a comparison using Claude-3-Opus as an alternative judge on the same held-out set, and (3) include a qualitative analysis of 50 edge cases (including partial completions and context-dependent tasks) with both GPT-4V and human verdicts. These additions will be presented in an expanded §4.2 and a new appendix. We believe the core comparative claims remain supported by the existing evidence, but the requested extensions will make the evaluation protocol more transparent. revision: partial
Circularity Check
No significant circularity in WebVoyager evaluation or claims
full rationale
The paper's core result is an empirical success rate (59.1%) on a benchmark of tasks compiled from 15 real-world websites. The automatic evaluation protocol using GPT-4V is described separately and validated against human judgments on a held-out set (85.3% agreement), providing an independent check rather than a self-referential definition. No equations, predictions, or derivations reduce by construction to fitted inputs, self-citations, or renamed known results; the benchmark tasks and success criteria are external to the model. The derivation chain consists of system description followed by standard empirical measurement, with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large multimodal models can reliably interpret webpage screenshots and decide actions
- domain assumption Task completion can be judged from the final visual state of the webpage
Forward citations
Cited by 22 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
Computer-use agents show attack success rates above 90% on benign instructions that produce harm via context or execution, with safety-aligned Claude 4.5 Sonnet at 73% ASR rising to 92.7% in multi-agent deployments.
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
WASP benchmark reveals web agents are vulnerable to simple prompt injections with partial success rates up to 86%, but agents frequently fail to complete attacker objectives.
-
Laundering AI Authority with Adversarial Examples
Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
-
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Reference graph
Works this paper leans on
-
[1]
Mind2Web: Towards a Generalist Agent for the Web
Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thoma...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
GAIA: a benchmark for General AI Assistants
Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332. OpenAI. 2023. ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
In addition, we also explore the performance of WebV oyager on websites in other languages, and we provide two example trajectories in Chinese and Spanish in Figure 23 and Figure 24. E Additional Related Work Vision-based Agents Concurrent to our work, a few related works also studied vision-based au- tonomous agents. VisualWebArena (Koh et al.,
-
[5]
extends WebArena with additional websites and tasks that focus on visual reasoning to facili- tate research on vision-based web agents. SeeClick (Cheng et al., 2024) focused on finetuning an LMM to solely leverage screenshots as inputs to interact Imagine you are a robot browsing the web, just like humans. Now you need to complete a task. In each iteratio...
work page 2024
-
[6]
Delete existing content in a textbox and then type content
-
[7]
Scroll up or down. ... Correspondingly, Action should STRICTLY follow the format: - Click [Numerical_Label] - Type [Numerical_Label]; [Content] - Scroll [Numerical_Label or WINDOW]; [up or down] - Wait - GoBack - Google - ANSWER; [content] Key Guidelines You MUST follow: * Action guidelines *
-
[8]
Execute only one action per iteration. ... * Web Browsing Guidelines *
-
[9]
Don't interact with useless web elements like Login, Sign-in, donation that appear in Webpages. ... Your reply should strictly follow the format: Thought: {Your brief thoughts (briefly summarize the info that will help ANSWER)} Action: {One Action format you choose} Then the User will provide: Observation: {A labeled screenshot Given by User} Figure 7: Sy...
work page 2023
-
[10]
These works further underscore the promising prospects in this field
and AppAgent (Zhang et al., 2023) instead focus on building agents that can operate smart- phone apps using the GPT-4V as the backbone. These works further underscore the promising prospects in this field. Large Multimodal Models. In recent years, sig- nificant strides have been made in unifying image and text representations within a single multimodal mo...
work page 2023
-
[11]
Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out. These requirements may include conducting searches, verifying information, comparing prices, checking availability, or any other action relevant to the specified web service (such as Amazon, Apple, ArXiv, BBC News, Bo...
-
[12]
It serves as visual proof of the actions taken in response to the instruction
Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. It serves as visual proof of the actions taken in response to the instruction
-
[13]
Artificial Intelligence for Healthcare
Result Response: This is a textual response obtained after the execution of the web task. It serves as textual result in response to the instruction. -- You DO NOT NEED to interact with web pages or perform actions such as booking flights or conducting searches on websites. -- You SHOULD NOT make assumptions based on information not presented in the scree...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.