arxiv: 2604.08523 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Recognition: unknown

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang , Yubo Wang , Yipeng Zhu , Penghui Du , Junwen Miao , Xuan Lu , Wendong Xu , Yunzhuo Hao

show 13 more authors

Songcheng Cai Xiaochen Wang Huaisong Zhang Xian Wu Yi Lu Minyi Lei Kai Zou Huifeng Yin Ping Nie Liang Chen Dongfu Jiang Wenhu Chen Kelsey R. Allen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI agentsweb navigationbenchmarkonline tasksevaluation frameworkClawBenchfrontier modelstask completion

0 comments

The pith

AI agents complete only a small portion of everyday online tasks on live websites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClawBench as an evaluation framework with 153 tasks that mirror routine activities people perform online, such as purchases, bookings, and job applications across 144 production platforms in 15 categories. These tasks demand multi-step navigation, extraction of information from user documents, and precise completion of detailed forms, all on dynamic live sites rather than static sandboxes. Evaluations of seven frontier models show low success, with the strongest result being Claude Sonnet 4.6 at 33.3 percent task completion. A lightweight interception layer blocks only final submissions to keep evaluations safe while preserving real complexity. The work positions progress on ClawBench as a step toward agents that can serve as dependable general-purpose online assistants.

Core claim

ClawBench consists of 153 tasks spanning 15 categories on 144 live platforms that require agents to obtain information from documents, execute multi-step workflows, and perform write-heavy form filling on production websites. The framework employs a lightweight interception layer that captures and blocks only the final submission request, allowing safe testing without real-world effects. Evaluations across seven frontier models reveal that both proprietary and open-source systems finish only a small fraction of the tasks, with Claude Sonnet 4.6 reaching 33.3 percent success.

What carries the argument

ClawBench, the evaluation framework that runs agents directly on live production websites while using an interception layer to prevent actual submissions and thereby maintain safety.

If this is right

Current frontier models lack the capabilities needed to automate most routine online tasks reliably.
Benchmarks must shift from static sandboxes to production environments to capture real dynamic challenges.
Agents require stronger skills in document information use, long-horizon planning, and accurate form completion.
Progress measured by ClawBench would directly advance agents toward functioning as general-purpose assistants.
Both proprietary and open-source models exhibit similar limitations on these practical web workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better results on ClawBench would likely translate to agents handling more personal administrative work without human oversight.
The benchmark could be extended to include tasks with financial or legal consequences to test higher-stakes reliability.
Training focused on live-site navigation and error recovery might close the gap shown in the current evaluations.

Load-bearing premise

The 153 tasks are representative of everyday online activities and the interception layer preserves full task complexity without introducing evaluation artifacts.

What would settle it

A new model that completes more than 50 percent of the same 153 tasks under the identical live-site and interception conditions would indicate the reported performance ceiling no longer holds.

Figures

Figures reproduced from arXiv: 2604.08523 by Dongfu Jiang, Huaisong Zhang, Huifeng Yin, Junwen Miao, Kai Zou, Kelsey R. Allen, Liang Chen, Minyi Lei, Penghui Du, Ping Nie, Songcheng Cai, Wendong Xu, Wenhu Chen, Xian Wu, Xiaochen Wang, Xuan Lu, Yi Lu, Yipeng Zhu, Yubo Wang, Yunzhuo Hao, Yuxuan Zhang.

**Figure 1.** Figure 1: CLAWBENCH overview. Left: 153 tasks across 15 life categories. Middle: existing benchmarks evaluate agents in offline sandboxes with static HTML and fixed DOM structures; CLAWBENCH evaluates on live websites with real-world complexity and provides rich, traceable verdicts via an agentic evaluator. Right: Claude-Sonnet-4.6 and GPT-5.4 achieve 65-75% task completion on established benchmarks such as OSWorld… view at source ↗

**Figure 2.** Figure 2: Main results: success rate on CLAWBENCH for 7 frontier models. Even the strongest model (Claude Sonnet 4.6) completes only 33.3% of tasks, while two of seven models score below 5%. See [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: The CLAWBENCH evaluation pipeline. Setup: a human-authored task with explicit verification conditions. Execution: the agent operates in a real browser while five layers of behavioral data are recorded. Evaluation: the recorded trajectory is scored against a human ground-truth trajectory via an Agentic Evaluator, producing a binary pass/fail verdict with step-level justification. We evaluate 7 frontier mode… view at source ↗

**Figure 4.** Figure 4: Task taxonomy of CLAWBENCH. Inner ring: 8 high-level category groups; outer ring: 15 fine-grained categories. The dataset spans 153 tasks across diverse real-world domains. Score % (Claude-Sonnet-4.6) 0 20 40 60 80 100 33.3 51.1 66.4 72.5 77.6 88.0 PinchBench Claw-Eval OSWorld-Verified WebArena-Verified WildClawBench ClawBench Benchmark Saturation Top models are saturating existing benchmarks ClawBench rem… view at source ↗

**Figure 6.** Figure 6: Agentic Evaluator Inference Pipeline. The evaluator determines whether a browser [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation protocol. The evaluator takes as input the task instruction together with [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawBench moves agent testing to live sites with document and form tasks, but the interception layer may be adding difficulty that the numbers don't isolate.

read the letter

ClawBench's core move is running 153 tasks on real production websites instead of static sandboxes. The tasks include pulling details from user documents, multi-step navigation, and heavy form filling across 144 platforms in categories like purchases and job applications. That setup is a clear step past earlier benchmarks that freeze pages or simplify the environment. The lightweight interception that blocks only the final submission is a practical way to keep evaluations safe without letting agents actually complete purchases or bookings. The reported results, such as Claude Sonnet 4.6 at 33.3 percent success, line up with the claim that frontier models still fall short on these everyday workflows. The paper earns credit for making the evaluation more realistic and for highlighting write-heavy and document-dependent actions that prior suites largely ignored. The numbers across seven models give a consistent picture of current limits. The main soft spot is the missing check on whether the interception layer itself changes task difficulty. Blocking the terminal request can alter page state, remove confirmation feedback, or force extra recovery steps that would not happen on an untouched site. The abstract asserts the layer preserves full complexity, but without ablations, trace comparisons, or human baselines the claim stays untested. Task selection details are also thin, so it is hard to judge how representative the 153 items really are. This paper is aimed at researchers who build or benchmark web agents. Anyone working on practical assistants will find the live-site framing useful as a reality check. It deserves peer review because the benchmark idea is concrete and the gap it targets is real, even though the current evidence needs more grounding on the safety mechanism and human performance. I would send it to referees with a request for those additions rather than desk-reject.

Referee Report

1 major / 2 minor

Summary. This paper introduces ClawBench, a benchmark of 153 everyday online tasks spanning 144 live platforms in 15 categories. Unlike prior benchmarks using static pages, it runs on production websites with a lightweight interception layer to safely block final submissions. Evaluations of seven frontier models reveal low completion rates, with Claude Sonnet 4.6 achieving only 33.3%, suggesting significant limitations in current AI agents for real-world web tasks.

Significance. Should the evaluation methodology prove robust, ClawBench offers a valuable, realistic testbed for AI agent capabilities in practical scenarios such as purchases, bookings, and job applications. The focus on dynamic live platforms and demanding write-heavy operations (e.g., detailed form filling and document-based information extraction) provides a stronger signal than sandbox-based evaluations. The reported results establish a clear baseline for measuring progress toward reliable general-purpose agents.

major comments (1)

[§3 (Benchmark Design)] §3 (Benchmark Design): The description of the interception layer asserts that it 'preserves the full complexity, dynamic nature, and challenges of real-world web interaction' by blocking only the final submission. However, the manuscript does not include any validation such as ablation studies comparing performance with and without the layer, analysis of altered page states or feedback loops, or human performance baselines on the same tasks. This is a load-bearing assumption for the central claim that the low success rates (e.g., 33.3% for Claude Sonnet 4.6) reflect inherent model limitations rather than artifacts from the evaluation setup.

minor comments (2)

[Abstract] Abstract: It would improve clarity to report success rates for all evaluated models rather than highlighting only the best one (Claude Sonnet 4.6 at 33.3%).
[Experiments] Experiments section: Details on the number of trials per task, variance in results, or statistical tests for the performance differences are not mentioned, which would help assess the reliability of the findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's detailed feedback on our ClawBench benchmark paper. We are grateful for the positive remarks on its significance and the call for robust evaluation methodology. Below, we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [§3 (Benchmark Design)] §3 (Benchmark Design): The description of the interception layer asserts that it 'preserves the full complexity, dynamic nature, and challenges of real-world web interaction' by blocking only the final submission. However, the manuscript does not include any validation such as ablation studies comparing performance with and without the layer, analysis of altered page states or feedback loops, or human performance baselines on the same tasks. This is a load-bearing assumption for the central claim that the low success rates (e.g., 33.3% for Claude Sonnet 4.6) reflect inherent model limitations rather than artifacts from the evaluation setup.

Authors: We agree that empirical validation of the interception layer's neutrality would strengthen our claims. The layer is implemented as a minimal proxy that only prevents the final submission HTTP request from reaching the server, without modifying any preceding network responses, DOM elements, or JavaScript behavior. All agent actions, including navigation, clicking, typing, and reading page content, occur exactly as they would in an unmediated session. To address this, we have revised Section 3 to include a more precise specification of the interception logic, including the criteria used to identify the 'final submission' request. We have also added a limitations paragraph noting the absence of ablations and human baselines, and we will prioritize collecting human performance data on a subset of tasks for a follow-up study. This revision clarifies the methodology and acknowledges the assumption's importance. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmark paper with no derivations or self-referential reductions.

full rationale

ClawBench is an evaluation framework that defines 153 tasks on live platforms and reports direct success rates for 7 models (e.g., Claude Sonnet 4.6 at 33.3%). The manuscript contains no equations, fitted parameters, predictions derived from prior inputs, or mathematical derivations. The central claims are empirical measurements on the introduced benchmark; the interception layer is presented as a design choice that preserves complexity, not as a result derived from or reducing to the reported percentages. No self-citation chains, ansatzes, or renamings of known results appear in the load-bearing steps. The paper is self-contained as a benchmark introduction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation framework rests on assumptions about task representativeness and the non-interference of the safety interception mechanism.

axioms (2)

domain assumption The selected 153 tasks represent typical everyday online activities across 15 categories.
The benchmark's validity depends on this selection being representative of routine life and work tasks.
domain assumption Intercepting only the final submission request preserves evaluation validity while ensuring safety.
This is invoked to justify testing on live sites without real-world side effects.

pith-pipeline@v0.9.0 · 5591 in / 1258 out tokens · 92227 ms · 2026-05-10T17:29:22.834511+00:00 · methodology

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
cs.CR 2026-05 conditional novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
NeuroClaw Technical Report
cs.CV 2026-04 unverdicted novelty 6.0

NeuroClaw introduces a three-tier multi-agent framework and NeuroBench benchmark that improve executability and reproducibility scores for neuroimaging tasks when used with multimodal LLMs.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · cited by 7 Pith papers · 4 internal anchors

[1]

Introducing computer use

Anthropic. Introducing computer use. https://www.anthropic.com/news/ 3-5-models-and-computer-use, 2025a. Accessed: 2026-03-20. Anthropic. Claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5 , 2025b. Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 ,

2026
[2]

Ersoy, B

Devin Ersoy, Brandon Lee, Ananth Shreekumar, Arjun Arunasalam, Muhammad Ibrahim, Antonio Bianchi, and Z Berkay Celik. Investigating the impact of dark patterns on llm-based web agents.arXiv preprint arXiv:2510.18113,

work page arXiv
[3]

arXiv preprint arXiv:2504.11543 , year =

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, et al. Real: Bench- marking autonomous agents on deterministic simulations of real websites.arXiv preprint arXiv:2504.11543,

work page arXiv
[4]

A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

arXiv:2307.12856. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InAnnual Meeting of the Association for Computational Linguistics (ACL),

work page arXiv
[5]

Econwebarena: Benchmarking autonomous agents on economic tasks in realistic web environments.arXiv preprint arXiv:2506.08136,

Zefang Liu and Yinzhu Quan. Econwebarena: Benchmarking autonomous agents on economic tasks in realistic web environments.arXiv preprint arXiv:2506.08136,

work page internal anchor Pith review arXiv
[6]

arXiv:2112.09332. OpenAI. Introducing operator.https://openai.com/index/introducing-operator/,

work page internal anchor Pith review arXiv
[7]

arXiv preprint arXiv:2406.12373 , year=

11 Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,

work page arXiv
[8]

Kimi K2.5: Visual Agentic Intelligence

Accessed: 2026-03-20. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

work page arXiv
[10]

Accessed: 2026-03-20

Peking University & University of Hong Kong. Accessed: 2026-03-20. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8938–8968,

2026
[11]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review arXiv
[12]

Boyuan Zheng, Michael Y

arXiv:2401.01614. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR),

work page arXiv