arxiv: 2401.13649 · v2 · pith:JD2ZQ3EInew · submitted 2024-01-24 · 💻 cs.LG · cs.CL· cs.CV

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh , Robert Lo , Lawrence Jang , Vikram Duvvur , Ming Chong Lim , Po-Yu Huang , Graham Neubig , Shuyan Zhou

show 2 more authors

Ruslan Salakhutdinov Daniel Fried

This is my paper

Pith reviewed 2026-05-17 15:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords multimodal agentsweb benchmarksvisual groundingautonomous agentsLLM evaluationweb navigationagent limitationsvisually grounded tasks

0 comments

The pith

VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisualWebArena as a benchmark for testing autonomous agents on web tasks that rely on visual information such as images and layouts. It evaluates several state-of-the-art language models and multimodal agents on these tasks. The results highlight that text-only agents face major limitations because they cannot use visual cues effectively. Even multimodal agents show notable gaps in their ability to interpret instructions and execute actions correctly on complex websites. This evaluation framework helps identify what is needed to build more capable web agents for real-world use.

Core claim

VisualWebArena consists of a set of diverse and complex web-based tasks designed to evaluate the capabilities of multimodal autonomous agents. Agents must process image-text inputs, interpret natural language instructions, and perform actions on websites to meet user objectives. Extensive evaluations reveal limitations in text-only LLM agents and gaps in the performance of state-of-the-art multimodal models.

What carries the argument

The VisualWebArena benchmark, which comprises realistic visually grounded tasks on various websites.

If this is right

Text-only approaches are inadequate for most web automation tasks that involve visual elements.
Multimodal agents require further development to handle image interpretation and action execution reliably.
The benchmark serves as a tool to measure and improve the performance of future autonomous web agents.
Insights from the analysis point toward specific areas where current models fall short in visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other domains like mobile apps or desktop interfaces to test visual agent capabilities more broadly.
If the tasks are representative, improving visual processing in agents could lead to better automation of everyday computer tasks.
Connections to other agent evaluation methods might help isolate whether the gaps are specific to web navigation or general to multimodal reasoning.

Load-bearing premise

The chosen websites and task templates are representative of the visual and interaction challenges in real-world web use.

What would settle it

If a multimodal agent achieves high success rates on the benchmark tasks without demonstrating effective use of visual information, such as by succeeding equally well when images are removed or obscured.

read the original abstract

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisualWebArena introduces a practical benchmark for multimodal web agents but the narrow selection of sites limits how much we can generalize from the results.

read the letter

The key thing here is that VisualWebArena is a benchmark for multimodal agents doing visually grounded tasks on actual websites, and it demonstrates clear limitations in both text-only and current multimodal models. What stands out as new is the shift to using live website screenshots and multimodal inputs instead of text-only web agent benchmarks. The paper evaluates several models quantitatively and adds qualitative error analysis to point out where they fall short. It does well in releasing the code, baselines, and data, which supports reproducibility. The abstract indicates they have metrics and analysis across models. The soft spot is around how representative the four chosen sites and task templates are. The stress-test concern about covering dynamic UIs or other variations seems fair, as the paper has limited justification for the site selection even if task construction is solid. This could mean the identified gaps don't apply as broadly as claimed. This work is aimed at people building and evaluating autonomous agents for web tasks. Anyone working on multimodal LLMs or web automation would get value from the benchmark and the insights. I think it deserves a serious referee. The benchmark fills a gap and has public artifacts, so it should go to peer review with attention to the generalizability question.

Referee Report

2 major / 2 minor

Summary. The paper introduces VisualWebArena, a benchmark for evaluating multimodal autonomous agents on realistic visually grounded web tasks. It includes a collection of tasks across selected websites (e.g., Reddit, Amazon, GitHub) that require agents to process image-text inputs, interpret instructions, and execute actions. Through quantitative performance metrics and qualitative error analysis of state-of-the-art LLM-based and multimodal agents, the authors identify limitations of text-only agents and gaps in current multimodal capabilities, while releasing code, baselines, and data publicly.

Significance. If the tasks prove representative, this benchmark fills a notable gap in web agent evaluation by emphasizing visual grounding, which is essential for most real interfaces. The empirical analysis and public resources offer concrete directions for improving multimodal agents and could accelerate progress in the field.

major comments (2)

[Benchmark construction] Benchmark construction section: the selection of only four websites and associated task templates receives limited justification regarding diversity and coverage of real-world visual challenges (e.g., dynamic JavaScript-heavy UIs, mobile views, or dense text-image mixes). Since the central claims about limitations of text-only agents and gaps in multimodal agents rest on these tasks being representative, additional evidence or explicit discussion of selection criteria and edge-case coverage is needed to support generalizability.
[Evaluation and analysis] Evaluation and analysis section: while quantitative success rates and qualitative error breakdowns are reported across models, the absence of detailed task-selection criteria or ablation on visual vs. textual components makes it harder to isolate whether the observed gaps are due to inherent multimodal shortcomings or to the specific distribution of visual cues in the chosen sites.

minor comments (2)

[Abstract] Abstract: the phrasing 'comprises of' is grammatically imprecise and should be revised to 'comprises' or 'consists of'.
[Abstract] Abstract: specifying the exact multimodal models evaluated (rather than 'several multimodal models') would improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment below and revised the paper accordingly to strengthen the justification for benchmark design and evaluation analysis.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the selection of only four websites and associated task templates receives limited justification regarding diversity and coverage of real-world visual challenges (e.g., dynamic JavaScript-heavy UIs, mobile views, or dense text-image mixes). Since the central claims about limitations of text-only agents and gaps in multimodal agents rest on these tasks being representative, additional evidence or explicit discussion of selection criteria and edge-case coverage is needed to support generalizability.

Authors: We agree that additional justification is needed to support the representativeness of the benchmark. In the revised manuscript, we have substantially expanded the Benchmark Construction section to explicitly detail our website selection criteria: we prioritized popular, publicly accessible sites spanning diverse domains (e-commerce on Amazon, social discussion on Reddit, collaborative development on GitHub, and information lookup on a fourth site) that exhibit varied visual interfaces. We now discuss coverage of real-world challenges including dynamic JS elements, dense text-image combinations, and interactive components, with concrete examples of how task templates incorporate these. While we acknowledge that exhaustive coverage of all UI types (such as mobile views) is beyond the scope of this initial benchmark, we provide evidence from pilot studies showing these sites capture key visual grounding requirements that text-only agents fail on. This supports the generalizability of our claims about multimodal limitations. revision: yes
Referee: [Evaluation and analysis] Evaluation and analysis section: while quantitative success rates and qualitative error breakdowns are reported across models, the absence of detailed task-selection criteria or ablation on visual vs. textual components makes it harder to isolate whether the observed gaps are due to inherent multimodal shortcomings or to the specific distribution of visual cues in the chosen sites.

Authors: We appreciate this observation and have revised the Evaluation and Analysis section to include more detailed task-selection criteria, explaining how tasks were curated to necessitate visual information (e.g., identifying UI elements or content only discernible from screenshots rather than HTML text). To better isolate multimodal gaps, we have added an ablation study comparing agent performance with and without visual inputs on a subset of tasks where visual cues are essential. The results, now reported in the paper, confirm that performance drops significantly without visuals, supporting that the gaps are due to multimodal shortcomings rather than site-specific distributions. We discuss limitations of this ablation approach and how it aligns with the benchmark's focus on visually grounded tasks. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper presents an empirical benchmark for multimodal web agents with no mathematical derivations, fitted parameters, or load-bearing self-citations that reduce claims to inputs by construction. Performance results are measured directly against external real-world websites and tasks rather than being defined in terms of the benchmark itself. The identification of agent limitations arises from quantitative and qualitative analysis on these independent sites, rendering the evaluation self-contained without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are introduced; the contribution is an empirical evaluation framework.

pith-pipeline@v0.9.0 · 5572 in / 935 out tokens · 41202 ms · 2026-05-17T15:15:32.796807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
cs.SE 2026-03 unverdicted novelty 7.0

Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
cs.CR 2025-10 unverdicted novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
cs.MM 2026-05 unverdicted novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning
cs.AI 2026-03 conditional novelty 6.0

AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downs...
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
cs.AI 2026-03 unverdicted novelty 6.0

WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
cs.CV 2024-08 unverdicted novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
cs.HC 2026-02 unverdicted novelty 5.0

Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
cs.CL 2025-03 unverdicted novelty 5.0

Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking
cs.SE 2026-05 unverdicted novelty 4.0

Claude outperformed other LLM families in generating functional single-file HTML under fixed public conditions, but neither technical variables nor prompt details reliably predicted 24-hour social media impressions.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
cs.MA 2026-05 unverdicted novelty 4.0

The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 19 Pith papers · 5 internal anchors

[1]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mind2web: Towards a generalist agent for the web. NeurIPS. Stan Franklin and Art Graesser. 1996. Is it an agent, or just a program?: A taxonomy for autonomous agents. In International workshop on agent theories, architectures, and languages, pages 21–35. Springer. Gemini Team Google. 2023. Gemini: a family of highly capable multimodal models. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv 1996
[3]

Language models can solve computer tasks. NeurIPS. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi

work page
[4]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. ICML. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. ECCV. Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian- lin Shi, and...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

GAIA: a benchmark for General AI Assistants

Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983. OpenAI. 2023. Gpt-4 technical report. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Training language models to follow instruc- tions with human feedback. NeurIPS. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446. T...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

You should only issue an action that is valid given the current observation

work page
[8]

You should only issue one action at a time

work page
[9]

You should follow the examples to reason step by step and then issue the next action

work page
[10]

In summary, the next action I will perform is

Generate the action in the correct format. Start with a “In summary, the next action I will perform is” phrase, followed by action insideˋˋˋˋˋˋ. For example, “In summary, the next action I will perform isˋˋˋclick [1234]ˋˋˋ”

work page
[11]

Don’t generate anything after stop

Issue stop action when you think you have achieved the objective. Don’t generate anything after stop. Figure 16: System message used in our SoM agent. 24 example_user: IMAGES: (1) current page screenshot OBSERV ATION:

work page
[12]

[IMG] [Image, description: hp fx-7010dn fax machine, url: B08GKZ3ZKD.0.jpg]

work page
[13]

[A] [HP CB782A#ABA 640 Inkjet Fax Machine (Renewed)] [ ] [StaticText] [$279.49]

work page
[14]

[BUTTON] [Add to Cart]

work page
[15]

[A] [Add to Wish List]

work page
[16]

This page list the information of HP Inkjet Fax Machine, which is the product identified in the objective

[A] [Add to Compare] URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine? PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. This page list the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I think I have ...

work page
[17]

[IMG] [ ] [ ] [StaticText] [Submitted by kneechalice t3_yid9lu 1 year ago]

work page
[18]

[A] [45 comments] [ ] [StaticText] [[I ate] Maple Pecan Croissant Submitted by AccordingtoJP t3_y3hrpn 1 year ago]

work page
[19]

[IMG] [ ] [ ] [StaticText] [Submitted by AccordingtoJP t3_y3hrpn 1 year ago]

work page
[20]

PREVIOUS ACTION: None example_assistant: Let’s think step-by-step

[A] [204 comments] URL: http://reddit.com OBJECTIVE: Tell me what the top comment on the croissant post says. PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. This page has a post titled ’[I ate] Maple Pecan Croissant’, which is the post mentioned in the objective. In order to find the top comment, I will navigate into the comments secti...

work page
[21]

[SELECT] [Select a category]

work page
[22]

[BUTTON] [Search] [ ] [StaticText] [Latest Listings] [ ] [StaticText] [Atlas Powered Audio System w/ Tripod 150.00 $ Music instruments Borough of Red Lion (Pennsylvania) 2023/11/16]

work page 2023
[23]

[IMG] [Atlas Powered Audio System w/ Tripod]

work page
[24]

[A] [Atlas Powered Audio System w/ Tripod] [ ] [StaticText] [150.00 $] [ ] [StaticText] [Neptune Gaming Console 350.00 $ Video gaming Pennwyn (Pennsylvania) 2023/11/16]

work page 2023
[25]

[IMG] [Neptune Gaming Console]

work page
[26]

PREVIOUS ACTION: None example_assistant: Let’s think step-by-step

[A] [Neptune Gaming Console] [ ] [StaticText] [350.00 $] URL: http://classifieds.com OBJECTIVE: Help me find the cheapest dark colored guitar. PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. The objective is to find the cheapest dark colored guitar on the site. The site has a search box whose ID is [5]. I can search for guitars by enter...

work page