VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Pith reviewed 2026-05-17 15:15 UTC · model grok-4.3
The pith
VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VisualWebArena consists of a set of diverse and complex web-based tasks designed to evaluate the capabilities of multimodal autonomous agents. Agents must process image-text inputs, interpret natural language instructions, and perform actions on websites to meet user objectives. Extensive evaluations reveal limitations in text-only LLM agents and gaps in the performance of state-of-the-art multimodal models.
What carries the argument
The VisualWebArena benchmark, which comprises realistic visually grounded tasks on various websites.
If this is right
- Text-only approaches are inadequate for most web automation tasks that involve visual elements.
- Multimodal agents require further development to handle image interpretation and action execution reliably.
- The benchmark serves as a tool to measure and improve the performance of future autonomous web agents.
- Insights from the analysis point toward specific areas where current models fall short in visual reasoning.
Where Pith is reading between the lines
- Similar benchmarks could be developed for other domains like mobile apps or desktop interfaces to test visual agent capabilities more broadly.
- If the tasks are representative, improving visual processing in agents could lead to better automation of everyday computer tasks.
- Connections to other agent evaluation methods might help isolate whether the gaps are specific to web navigation or general to multimodal reasoning.
Load-bearing premise
The chosen websites and task templates are representative of the visual and interaction challenges in real-world web use.
What would settle it
If a multimodal agent achieves high success rates on the benchmark tasks without demonstrating effective use of visual information, such as by succeeding equally well when images are removed or obscured.
read the original abstract
Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VisualWebArena, a benchmark for evaluating multimodal autonomous agents on realistic visually grounded web tasks. It includes a collection of tasks across selected websites (e.g., Reddit, Amazon, GitHub) that require agents to process image-text inputs, interpret instructions, and execute actions. Through quantitative performance metrics and qualitative error analysis of state-of-the-art LLM-based and multimodal agents, the authors identify limitations of text-only agents and gaps in current multimodal capabilities, while releasing code, baselines, and data publicly.
Significance. If the tasks prove representative, this benchmark fills a notable gap in web agent evaluation by emphasizing visual grounding, which is essential for most real interfaces. The empirical analysis and public resources offer concrete directions for improving multimodal agents and could accelerate progress in the field.
major comments (2)
- [Benchmark construction] Benchmark construction section: the selection of only four websites and associated task templates receives limited justification regarding diversity and coverage of real-world visual challenges (e.g., dynamic JavaScript-heavy UIs, mobile views, or dense text-image mixes). Since the central claims about limitations of text-only agents and gaps in multimodal agents rest on these tasks being representative, additional evidence or explicit discussion of selection criteria and edge-case coverage is needed to support generalizability.
- [Evaluation and analysis] Evaluation and analysis section: while quantitative success rates and qualitative error breakdowns are reported across models, the absence of detailed task-selection criteria or ablation on visual vs. textual components makes it harder to isolate whether the observed gaps are due to inherent multimodal shortcomings or to the specific distribution of visual cues in the chosen sites.
minor comments (2)
- [Abstract] Abstract: the phrasing 'comprises of' is grammatically imprecise and should be revised to 'comprises' or 'consists of'.
- [Abstract] Abstract: specifying the exact multimodal models evaluated (rather than 'several multimodal models') would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment below and revised the paper accordingly to strengthen the justification for benchmark design and evaluation analysis.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: the selection of only four websites and associated task templates receives limited justification regarding diversity and coverage of real-world visual challenges (e.g., dynamic JavaScript-heavy UIs, mobile views, or dense text-image mixes). Since the central claims about limitations of text-only agents and gaps in multimodal agents rest on these tasks being representative, additional evidence or explicit discussion of selection criteria and edge-case coverage is needed to support generalizability.
Authors: We agree that additional justification is needed to support the representativeness of the benchmark. In the revised manuscript, we have substantially expanded the Benchmark Construction section to explicitly detail our website selection criteria: we prioritized popular, publicly accessible sites spanning diverse domains (e-commerce on Amazon, social discussion on Reddit, collaborative development on GitHub, and information lookup on a fourth site) that exhibit varied visual interfaces. We now discuss coverage of real-world challenges including dynamic JS elements, dense text-image combinations, and interactive components, with concrete examples of how task templates incorporate these. While we acknowledge that exhaustive coverage of all UI types (such as mobile views) is beyond the scope of this initial benchmark, we provide evidence from pilot studies showing these sites capture key visual grounding requirements that text-only agents fail on. This supports the generalizability of our claims about multimodal limitations. revision: yes
-
Referee: [Evaluation and analysis] Evaluation and analysis section: while quantitative success rates and qualitative error breakdowns are reported across models, the absence of detailed task-selection criteria or ablation on visual vs. textual components makes it harder to isolate whether the observed gaps are due to inherent multimodal shortcomings or to the specific distribution of visual cues in the chosen sites.
Authors: We appreciate this observation and have revised the Evaluation and Analysis section to include more detailed task-selection criteria, explaining how tasks were curated to necessitate visual information (e.g., identifying UI elements or content only discernible from screenshots rather than HTML text). To better isolate multimodal gaps, we have added an ablation study comparing agent performance with and without visual inputs on a subset of tasks where visual cues are essential. The results, now reported in the paper, confirm that performance drops significantly without visuals, supporting that the gaps are due to multimodal shortcomings rather than site-specific distributions. We discuss limitations of this ablation approach and how it aligns with the benchmark's focus on visually grounded tasks. revision: yes
Circularity Check
No circularity in empirical benchmark evaluation
full rationale
The paper presents an empirical benchmark for multimodal web agents with no mathematical derivations, fitted parameters, or load-bearing self-citations that reduce claims to inputs by construction. Performance results are measured directly against external real-world websites and tasks rather than being defined in terms of the benchmark itself. The identification of agent limitations arises from quantitative and qualitative analysis on these independent sites, rendering the evaluation self-contained without any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning
AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downs...
-
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
-
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
-
The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking
Claude outperformed other LLM families in generating functional single-file HTML under fixed public conditions, but neither technical variables nor prompt details reliably predicted 24-hour social media impressions.
-
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Reference graph
Works this paper leans on
-
[1]
Scaling Instruction-Finetuned Language Models
Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Mind2web: Towards a generalist agent for the web. NeurIPS. Stan Franklin and Art Graesser. 1996. Is it an agent, or just a program?: A taxonomy for autonomous agents. In International workshop on agent theories, architectures, and languages, pages 21–35. Springer. Gemini Team Google. 2023. Gemini: a family of highly capable multimodal models. arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv 1996
-
[3]
Language models can solve computer tasks. NeurIPS. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi
-
[4]
Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. ICML. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. ECCV. Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian- lin Shi, and...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[5]
GAIA: a benchmark for General AI Assistants
Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983. OpenAI. 2023. Gpt-4 technical report. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Training language models to follow instruc- tions with human feedback. NeurIPS. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446. T...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
You should only issue an action that is valid given the current observation
-
[8]
You should only issue one action at a time
-
[9]
You should follow the examples to reason step by step and then issue the next action
-
[10]
In summary, the next action I will perform is
Generate the action in the correct format. Start with a “In summary, the next action I will perform is” phrase, followed by action insideˋˋˋˋˋˋ. For example, “In summary, the next action I will perform isˋˋˋclick [1234]ˋˋˋ”
-
[11]
Don’t generate anything after stop
Issue stop action when you think you have achieved the objective. Don’t generate anything after stop. Figure 16: System message used in our SoM agent. 24 example_user: IMAGES: (1) current page screenshot OBSERV ATION:
-
[12]
[IMG] [Image, description: hp fx-7010dn fax machine, url: B08GKZ3ZKD.0.jpg]
-
[13]
[A] [HP CB782A#ABA 640 Inkjet Fax Machine (Renewed)] [ ] [StaticText] [$279.49]
-
[14]
[BUTTON] [Add to Cart]
-
[15]
[A] [Add to Wish List]
-
[16]
[A] [Add to Compare] URL: http://onestopmarket.com/office-products/office-electronics.html OBJECTIVE: What is the price of HP Inkjet Fax Machine? PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. This page list the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I think I have ...
-
[17]
[IMG] [ ] [ ] [StaticText] [Submitted by kneechalice t3_yid9lu 1 year ago]
-
[18]
[A] [45 comments] [ ] [StaticText] [[I ate] Maple Pecan Croissant Submitted by AccordingtoJP t3_y3hrpn 1 year ago]
-
[19]
[IMG] [ ] [ ] [StaticText] [Submitted by AccordingtoJP t3_y3hrpn 1 year ago]
-
[20]
PREVIOUS ACTION: None example_assistant: Let’s think step-by-step
[A] [204 comments] URL: http://reddit.com OBJECTIVE: Tell me what the top comment on the croissant post says. PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. This page has a post titled ’[I ate] Maple Pecan Croissant’, which is the post mentioned in the objective. In order to find the top comment, I will navigate into the comments secti...
-
[21]
[SELECT] [Select a category]
-
[22]
[BUTTON] [Search] [ ] [StaticText] [Latest Listings] [ ] [StaticText] [Atlas Powered Audio System w/ Tripod 150.00 $ Music instruments Borough of Red Lion (Pennsylvania) 2023/11/16]
work page 2023
-
[23]
[IMG] [Atlas Powered Audio System w/ Tripod]
-
[24]
[A] [Atlas Powered Audio System w/ Tripod] [ ] [StaticText] [150.00 $] [ ] [StaticText] [Neptune Gaming Console 350.00 $ Video gaming Pennwyn (Pennsylvania) 2023/11/16]
work page 2023
-
[25]
[IMG] [Neptune Gaming Console]
-
[26]
PREVIOUS ACTION: None example_assistant: Let’s think step-by-step
[A] [Neptune Gaming Console] [ ] [StaticText] [350.00 $] URL: http://classifieds.com OBJECTIVE: Help me find the cheapest dark colored guitar. PREVIOUS ACTION: None example_assistant: Let’s think step-by-step. The objective is to find the cheapest dark colored guitar on the site. The site has a search box whose ID is [5]. I can search for guitars by enter...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.