arxiv: 2401.13919 · v4 · submitted 2024-01-25 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He , Wenlin Yao , Kaixin Ma , Wenhao Yu , Yong Dai , Hongming Zhang , Zhenzhong Lan , Dong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords web agentlarge multimodal modelsend-to-end interactionreal-world websitesbenchmark evaluationautonomous agentsmultimodal understandingtask success rate

0 comments

The pith

WebVoyager shows that large multimodal models can drive an end-to-end agent that completes open-ended tasks on live websites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WebVoyager as a web agent powered by a large multimodal model that processes visual and textual information directly from real websites to finish user instructions without intermediate simulators. It establishes a benchmark of realistic tasks drawn from 15 popular sites and pairs it with an automatic scoring method that uses GPT-4V to judge outcomes. The agent reaches a 59.1 percent success rate, exceeding both a text-only version of itself and GPT-4 equipped with all available tools. The evaluation protocol itself aligns with human judgments at 85.3 percent, offering a scalable way to measure progress on open web tasks. This matters because prior agents stayed confined to simplified environments or single input types, limiting their usefulness for everyday online work.

Core claim

WebVoyager is a Large Multimodal Model powered web agent that completes user instructions end-to-end by interacting with real-world websites. On a new benchmark compiled from tasks on 15 popular sites, it records a 59.1 percent task success rate, which exceeds the performance of GPT-4 using all tools and the text-only version of WebVoyager. The automatic evaluation protocol, which relies on GPT-4V multimodal understanding, reaches 85.3 percent agreement with human judgment of task completion.

What carries the argument

WebVoyager, an end-to-end agent that takes both screenshot images and text from live web pages as input and outputs actions to navigate and complete tasks.

If this is right

Web agents can now operate directly on live sites instead of static snapshots or simulators.
Multimodal input yields measurable gains over text-only agents on realistic web tasks.
An automatic multimodal judge can substitute for human evaluation at 85 percent agreement.
Success on fifteen distinct sites indicates the method generalizes across common web interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If success rates continue to rise, such agents could handle routine personal tasks like form filling or information gathering without human oversight.
The same multimodal loop might transfer to other screen-based environments such as desktop software or mobile apps.
Pairing the agent with external tools or memory could address remaining failure cases on complex multi-step tasks.

Load-bearing premise

The assumption that GPT-4V's judgments of task completion on website screenshots match how humans would rate the same outcomes.

What would settle it

A side-by-side study in which human raters independently score the same set of WebVoyager task traces and produce a success rate differing from 59.1 percent by more than ten points.

read the original abstract

The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebVoyager gets a multimodal agent to 59% success on live sites with a new benchmark, but the GPT-4V evaluator carries most of the weight.

read the letter

This paper gets a multimodal agent to 59.1% task success on real websites across 15 sites, beating both GPT-4 with tools and the text-only version of their own agent. They also release a benchmark of actual tasks and an automatic GPT-4V evaluator that matches human labels at 85.3% agreement on a held-out set. The shift from simulators to live sites is the clearest advance here. Using vision directly for navigation and page understanding fits the problem better than text-only pipelines, and the numerical gains over baselines are reported plainly. The benchmark tasks look drawn from common user needs rather than contrived examples. The main limitation is that every success number depends on the GPT-4V judge. Even with solid agreement on the small held-out set, the judge still has to interpret open-ended criteria against changing page states without the original intent or full trace. That leaves room for it to credit superficial changes or miss correct but indirect paths. No ablation replaces the judge with full human labels, and the abstract gives little on error breakdowns or data splits. Full methods would clarify how much prompt tuning or post-hoc selection shaped the result. This is worth reading for anyone building web agents or testing multimodal models on interactive tasks. It supplies a practical baseline on live sites instead of toy environments. I would bring it to a reading group to talk through the evaluator design. It deserves peer review because the core setup and numbers are concrete enough to check in detail.

Referee Report

1 major / 1 minor

Summary. The paper introduces WebVoyager, an LMM-powered web agent for end-to-end interaction with real-world websites to complete open-ended user instructions. It compiles a benchmark of tasks from 15 popular sites and proposes an automatic evaluation protocol that leverages GPT-4V to judge task success, reporting a 59.1% success rate that exceeds both GPT-4 (All Tools) and a text-only WebVoyager ablation, with the GPT-4V judge achieving 85.3% agreement with human raters on a held-out set.

Significance. If the evaluation protocol is shown to be robust, the result would be significant for demonstrating that multimodal models can outperform text-only agents on dynamic, real websites rather than simulators or static snapshots, providing a concrete step toward practical autonomous web agents.

major comments (1)

[§4.2] §4.2: The headline 59.1% success rate and all comparative claims rest exclusively on GPT-4V judgments of open-ended tasks against live page states. While 85.3% agreement with humans is reported on a held-out subset, the manuscript provides no human labels on the full test set, no inter-judge variance across alternative LMMs, and no ablation measuring how the judge handles partial or context-dependent completions; this leaves the reported performance gap vulnerable to systematic misjudgment.

minor comments (1)

The benchmark construction (task selection criteria, website sampling, and any train/test split details) is described only at a high level; adding an explicit table or appendix listing the 15 sites and representative task templates would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of demonstrating multimodal agents on live websites. We address the concern about the robustness of the GPT-4V evaluation protocol in detail below.

read point-by-point responses

Referee: [§4.2] §4.2: The headline 59.1% success rate and all comparative claims rest exclusively on GPT-4V judgments of open-ended tasks against live page states. While 85.3% agreement with humans is reported on a held-out subset, the manuscript provides no human labels on the full test set, no inter-judge variance across alternative LMMs, and no ablation measuring how the judge handles partial or context-dependent completions; this leaves the reported performance gap vulnerable to systematic misjudgment.

Authors: We appreciate the referee's emphasis on rigorous validation of the automatic evaluator. The 85.3% agreement was measured on a held-out set of 200 tasks (stratified across the 15 sites and task categories) that were independently labeled by two human annotators, with disagreements resolved by discussion. We acknowledge that labeling the entire test set would further strengthen confidence and that the current manuscript does not include inter-judge comparisons with other LMMs or explicit ablations on partial/context-dependent cases. In the revised manuscript we will: (1) report human judgments on an additional 150 tasks drawn from the full test set, (2) add a comparison using Claude-3-Opus as an alternative judge on the same held-out set, and (3) include a qualitative analysis of 50 edge cases (including partial completions and context-dependent tasks) with both GPT-4V and human verdicts. These additions will be presented in an expanded §4.2 and a new appendix. We believe the core comparative claims remain supported by the existing evidence, but the requested extensions will make the evaluation protocol more transparent. revision: partial

Circularity Check

0 steps flagged

No significant circularity in WebVoyager evaluation or claims

full rationale

The paper's core result is an empirical success rate (59.1%) on a benchmark of tasks compiled from 15 real-world websites. The automatic evaluation protocol using GPT-4V is described separately and validated against human judgments on a held-out set (85.3% agreement), providing an independent check rather than a self-referential definition. No equations, predictions, or derivations reduce by construction to fitted inputs, self-citations, or renamed known results; the benchmark tasks and success criteria are external to the model. The derivation chain consists of system description followed by standard empirical measurement, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions that current LMMs can interpret web screenshots and that task success can be judged from final page state; no new physical or mathematical axioms are introduced.

axioms (2)

domain assumption Large multimodal models can reliably interpret webpage screenshots and decide actions
Implicit in the agent design and performance claims
domain assumption Task completion can be judged from the final visual state of the webpage
Basis for both human and GPT-4V evaluation

pith-pipeline@v0.9.0 · 5524 in / 1235 out tokens · 17843 ms · 2026-05-15T22:40:28.332635+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
cs.CR 2026-04 unverdicted novelty 7.0

Computer-use agents show attack success rates above 90% on benign instructions that produce harm via context or execution, with safety-aligned Claude 4.5 Sonnet at 73% ASR rising to 92.7% in multi-agent deployments.
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
cs.CR 2026-04 unverdicted novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
cs.AI 2026-03 unverdicted novelty 6.0

WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
cs.CR 2025-04 accept novelty 6.0

WASP benchmark reveals web agents are vulnerable to simple prompt injections with partial success rates up to 86%, but agents frequently fail to complete attacker objectives.
Laundering AI Authority with Adversarial Examples
cs.CR 2026-05 unverdicted novelty 5.0

Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 5.0

GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
cs.HC 2026-02 unverdicted novelty 5.0

Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 19 Pith papers · 3 internal anchors

[1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thoma...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

GAIA: a benchmark for General AI Assistants

Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332. OpenAI. 2023. ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

E Additional Related Work Vision-based Agents Concurrent to our work, a few related works also studied vision-based au- tonomous agents

In addition, we also explore the performance of WebV oyager on websites in other languages, and we provide two example trajectories in Chinese and Spanish in Figure 23 and Figure 24. E Additional Related Work Vision-based Agents Concurrent to our work, a few related works also studied vision-based au- tonomous agents. VisualWebArena (Koh et al.,

work page
[5]

SeeClick (Cheng et al., 2024) focused on finetuning an LMM to solely leverage screenshots as inputs to interact Imagine you are a robot browsing the web, just like humans

extends WebArena with additional websites and tasks that focus on visual reasoning to facili- tate research on vision-based web agents. SeeClick (Cheng et al., 2024) focused on finetuning an LMM to solely leverage screenshots as inputs to interact Imagine you are a robot browsing the web, just like humans. Now you need to complete a task. In each iteratio...

work page 2024
[6]

Delete existing content in a textbox and then type content

work page
[7]

Scroll up or down. ... Correspondingly, Action should STRICTLY follow the format: - Click [Numerical_Label] - Type [Numerical_Label]; [Content] - Scroll [Numerical_Label or WINDOW]; [up or down] - Wait - GoBack - Google - ANSWER; [content] Key Guidelines You MUST follow: * Action guidelines *

work page
[8]

Execute only one action per iteration. ... * Web Browsing Guidelines *

work page
[9]

Don't interact with useless web elements like Login, Sign-in, donation that appear in Webpages. ... Your reply should strictly follow the format: Thought: {Your brief thoughts (briefly summarize the info that will help ANSWER)} Action: {One Action format you choose} Then the User will provide: Observation: {A labeled screenshot Given by User} Figure 7: Sy...

work page 2023
[10]

These works further underscore the promising prospects in this field

and AppAgent (Zhang et al., 2023) instead focus on building agents that can operate smart- phone apps using the GPT-4V as the backbone. These works further underscore the promising prospects in this field. Large Multimodal Models. In recent years, sig- nificant strides have been made in unifying image and text representations within a single multimodal mo...

work page 2023
[11]

Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out. These requirements may include conducting searches, verifying information, comparing prices, checking availability, or any other action relevant to the specified web service (such as Amazon, Apple, ArXiv, BBC News, Bo...

work page
[12]

It serves as visual proof of the actions taken in response to the instruction

Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. It serves as visual proof of the actions taken in response to the instruction

work page
[13]

Artificial Intelligence for Healthcare

Result Response: This is a textual response obtained after the execution of the web task. It serves as textual result in response to the instruction. -- You DO NOT NEED to interact with web pages or perform actions such as booking flights or conducting searches on websites. -- You SHOULD NOT make assumptions based on information not presented in the scree...

work page 2024