super hub Canonical reference

WebArena: A Realistic Web Environment for Building Autonomous Agents

Abishek Sridhar, Frank F. Xu, Hao Zhu, Robert Lo, Shuyan Zhou, Xuhui Zhou · 2023 · cs.AI · arXiv 2307.13854

Canonical reference. 76% of citing Pith papers cite this work as background.

194 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 194 citing papers more from Abishek Sridhar arXiv PDF

abstract

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 dataset 4 baseline 2 method 1

citation-polarity summary

background 32 use dataset 4 baseline 2 support 2 unclear 1 use method 1

claims ledger

abstract With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop

authors

Abishek Sridhar Frank F. Xu Hao Zhu Robert Lo Shuyan Zhou Xuhui Zhou

co-cited works

representative citing papers

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

State-Centric Decision Process

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.

citing papers explorer

Showing 33 of 33 citing papers after filters.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare cs.AI · 2026-05-12 · conditional · none · ref 44 · internal anchor
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 59 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 67 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection cs.CR · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 68 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios cs.CR · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 50 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Inference-Time Budget Control for LLM Search Agents cs.AI · 2026-05-07 · unverdicted · none · ref 77 · internal anchor
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 96 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment cs.AI · 2025-06-09 · unverdicted · none · ref 27 · internal anchor
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains cs.AI · 2024-06-17 · unverdicted · none · ref 29 · internal anchor
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 63
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 50 · internal anchor
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 16 · 2 links · internal anchor
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration cs.LG · 2026-05-09 · unverdicted · none · ref 61 · internal anchor
PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 128 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 54 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training cs.LG · 2026-04-26 · unverdicted · none · ref 70 · internal anchor
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 173 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 135 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 12
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology cs.AI · 2026-04-19 · unverdicted · none · ref 56
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces cs.CV · 2026-04-06 · unverdicted · none · ref 28
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 39 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 45 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 199 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 139 · internal anchor
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unreviewed · ref 82 · 2 links · internal anchor

WebArena: A Realistic Web Environment for Building Autonomous Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer