super hub Canonical reference

WebArena: A Realistic Web Environment for Building Autonomous Agents

Abishek Sridhar, Frank F. Xu, Hao Zhu, Robert Lo, Shuyan Zhou, Xuhui Zhou · 2023 · cs.AI · arXiv 2307.13854

Canonical reference. 76% of citing Pith papers cite this work as background.

235 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 235 citing papers more from Abishek Sridhar arXiv PDF

abstract

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 dataset 4 baseline 2 method 1

citation-polarity summary

background 32 use dataset 4 baseline 2 support 2 unclear 1 use method 1

claims ledger

abstract With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop

authors

Abishek Sridhar Frank F. Xu Hao Zhu Robert Lo Shuyan Zhou Xuhui Zhou

co-cited works

representative citing papers

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

cs.AI · 2026-06-01 · conditional · novelty 8.0

Current benchmarks overlook abstention competence in agents due to compliance bias; a new three-gap taxonomy and metrics (Safety Rate, Usability Rate, Informed Refusal Rate) demonstrate tunable safety-usability tradeoffs in preliminary tests across five model families.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.

Same-Origin Policy for Agentic Browsers

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

The paper builds SOPBench showing frequent SOP violations in agentic browsers and introduces SOPGuard to enforce the policy with low overhead in BrowserOS.

HLL: Can Agents Cross Humanity's Last Line of Verification?

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

cs.CR · 2026-05-26 · unverdicted · novelty 7.0

AgenticVBench evaluates frontier VLMs on 100 real-world video post-production tasks across four families, with the best agent stack scoring just over 30% versus human experts.

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

cs.SE · 2026-05-26 · conditional · novelty 7.0

LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.

JobBench: Aligning Agent Work With Human Will

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

cs.CR · 2026-05-25 · unverdicted · novelty 7.0

CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

cs.SE · 2026-05-22 · unverdicted · novelty 7.0

VISTA is a new benchmark for end-to-end visual spec-to-web-app generation by LLM agents, featuring five prompt conditions, manual UI annotations, multi-metric evaluation, and results on four agent systems showing partial decoupling of visual and functional performance.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

citing papers explorer

Showing 43 of 43 citing papers after filters.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 59 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages cs.CL · 2026-06-27 · unverdicted · none · ref 14 · internal anchor
SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 99 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 60 · internal anchor
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents cs.CL · 2026-05-18 · unverdicted · none · ref 31 · 2 links · internal anchor
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation cs.CL · 2026-05-12 · conditional · none · ref 33 · 2 links · internal anchor
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems cs.CL · 2026-05-10 · unverdicted · none · ref 28 · internal anchor
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 68 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios cs.CL · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
DV-World is a benchmark of 260 tasks across spreadsheet manipulation, visual evolution, and interactive intent alignment that shows state-of-the-art AI models achieve less than 50% overall performance on real-world data visualization challenges.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents cs.CL · 2026-03-16 · unverdicted · none · ref 8 · internal anchor
CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents cs.CL · 2026-03-16 · unverdicted · none · ref 7 · internal anchor
TelcoAgent-Bench is a new framework that evaluates how well multilingual LLM agents recognize intents, execute troubleshooting steps, and stay consistent across variations in telecom scenarios.
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments cs.CL · 2025-06-09 · unverdicted · none · ref 17 · internal anchor
EconWebArena is a new benchmark with 360 curated economic tasks across 82 authoritative websites for evaluating multimodal web agents on navigation, grounding, and data extraction.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 34
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Signal-Driven Observation for Long-Horizon Web Agents cs.CL · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents cs.CL · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
SEAL: Synergistic Co-Evolution of Agents and Learning Environments cs.CL · 2026-05-23 · unverdicted · none · ref 42 · internal anchor
SEAL co-evolves LLM agents and environments via shared turn-level failure diagnoses, yielding +8.25 to +26.25 point gains on tool-use tasks with only 400 samples.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 46 · internal anchor
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 62 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents cs.CL · 2026-05-20 · unverdicted · none · ref 45 · internal anchor
Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents cs.CL · 2026-05-19 · unverdicted · none · ref 44 · internal anchor
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? cs.CL · 2026-05-15 · unverdicted · none · ref 66 · internal anchor
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL · 2026-05-15 · unverdicted · none · ref 17 · 2 links · internal anchor
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction cs.CL · 2026-05-11 · unverdicted · none · ref 8 · 2 links · internal anchor
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
Milestone-Guided Policy Learning for Long-Horizon Language Agents cs.CL · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving cs.CL · 2026-04-08 · unverdicted · none · ref 44 · internal anchor
TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 23 · internal anchor
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies cs.CL · 2026-02-10 · conditional · none · ref 43 · internal anchor
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 274 · internal anchor
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency cs.CL · 2025-11-29 · unverdicted · none · ref 5 · internal anchor
CryptoBench is a new dynamic benchmark for LLM agents in cryptocurrency that reveals a retrieval-prediction imbalance in model performance.
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains cs.CL · 2025-08-18 · unverdicted · none · ref 22 · internal anchor
The paper proposes Amazon-Bench, a functionality-grounded benchmark for web agents in e-commerce that generates diverse task queries from webpage elements and evaluates both task performance and safety risks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 39 · internal anchor
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 56 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search cs.CL · 2026-04-15 · accept · none · ref 33
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 60 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings cs.CL · 2026-04-26 · unverdicted · none · ref 3 · internal anchor
MTRouter learns turn-level model utility predictors from logged trajectories using history-model joint embeddings, delivering 58.7% cost reduction on ScienceWorld and 43.4% on HLE while matching or exceeding GPT-5 performance.
AlphaEval: Evaluating Agents in Production cs.CL · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
Near-Miss: Latent Policy Failure Detection in Agentic Workflows cs.CL · 2026-03-31 · unverdicted · none · ref 2 · internal anchor
A new metric detects latent policy failures in 8-17% of agent trajectories with mutating tool calls on the Airlines benchmark, even when final outcomes are correct.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks cs.CL · 2025-03-12 · unverdicted · none · ref 61 · internal anchor
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 83
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability cs.CL · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub cs.CL · 2026-03-19 · unverdicted · none · ref 23 · internal anchor
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models cs.CL · 2026-04-22 · unverdicted · none · ref 57
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

WebArena: A Realistic Web Environment for Building Autonomous Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer