pith. machine review for the scientific record. sign in

arxiv: 2604.08523 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Recognition: unknown

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords AI agentsweb navigationbenchmarkonline tasksevaluation frameworkClawBenchfrontier modelstask completion
0
0 comments X

The pith

AI agents complete only a small portion of everyday online tasks on live websites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClawBench as an evaluation framework with 153 tasks that mirror routine activities people perform online, such as purchases, bookings, and job applications across 144 production platforms in 15 categories. These tasks demand multi-step navigation, extraction of information from user documents, and precise completion of detailed forms, all on dynamic live sites rather than static sandboxes. Evaluations of seven frontier models show low success, with the strongest result being Claude Sonnet 4.6 at 33.3 percent task completion. A lightweight interception layer blocks only final submissions to keep evaluations safe while preserving real complexity. The work positions progress on ClawBench as a step toward agents that can serve as dependable general-purpose online assistants.

Core claim

ClawBench consists of 153 tasks spanning 15 categories on 144 live platforms that require agents to obtain information from documents, execute multi-step workflows, and perform write-heavy form filling on production websites. The framework employs a lightweight interception layer that captures and blocks only the final submission request, allowing safe testing without real-world effects. Evaluations across seven frontier models reveal that both proprietary and open-source systems finish only a small fraction of the tasks, with Claude Sonnet 4.6 reaching 33.3 percent success.

What carries the argument

ClawBench, the evaluation framework that runs agents directly on live production websites while using an interception layer to prevent actual submissions and thereby maintain safety.

If this is right

  • Current frontier models lack the capabilities needed to automate most routine online tasks reliably.
  • Benchmarks must shift from static sandboxes to production environments to capture real dynamic challenges.
  • Agents require stronger skills in document information use, long-horizon planning, and accurate form completion.
  • Progress measured by ClawBench would directly advance agents toward functioning as general-purpose assistants.
  • Both proprietary and open-source models exhibit similar limitations on these practical web workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better results on ClawBench would likely translate to agents handling more personal administrative work without human oversight.
  • The benchmark could be extended to include tasks with financial or legal consequences to test higher-stakes reliability.
  • Training focused on live-site navigation and error recovery might close the gap shown in the current evaluations.

Load-bearing premise

The 153 tasks are representative of everyday online activities and the interception layer preserves full task complexity without introducing evaluation artifacts.

What would settle it

A new model that completes more than 50 percent of the same 153 tasks under the identical live-site and interception conditions would indicate the reported performance ceiling no longer holds.

Figures

Figures reproduced from arXiv: 2604.08523 by Dongfu Jiang, Huaisong Zhang, Huifeng Yin, Junwen Miao, Kai Zou, Kelsey R. Allen, Liang Chen, Minyi Lei, Penghui Du, Ping Nie, Songcheng Cai, Wendong Xu, Wenhu Chen, Xian Wu, Xiaochen Wang, Xuan Lu, Yi Lu, Yipeng Zhu, Yubo Wang, Yunzhuo Hao, Yuxuan Zhang.

Figure 1
Figure 1. Figure 1: CLAWBENCH overview. Left: 153 tasks across 15 life categories. Middle: existing benchmarks evaluate agents in offline sandboxes with static HTML and fixed DOM struc￾tures; CLAWBENCH evaluates on live websites with real-world complexity and provides rich, traceable verdicts via an agentic evaluator. Right: Claude-Sonnet-4.6 and GPT-5.4 achieve 65-75% task completion on established benchmarks such as OSWorld… view at source ↗
Figure 2
Figure 2. Figure 2: Main results: success rate on CLAWBENCH for 7 frontier models. Even the strongest model (Claude Sonnet 4.6) completes only 33.3% of tasks, while two of seven models score below 5%. See [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The CLAWBENCH evaluation pipeline. Setup: a human-authored task with explicit verification conditions. Execution: the agent operates in a real browser while five layers of behavioral data are recorded. Evaluation: the recorded trajectory is scored against a human ground-truth trajectory via an Agentic Evaluator, producing a binary pass/fail verdict with step-level justification. We evaluate 7 frontier mode… view at source ↗
Figure 4
Figure 4. Figure 4: Task taxonomy of CLAWBENCH. Inner ring: 8 high-level category groups; outer ring: 15 fine-grained categories. The dataset spans 153 tasks across diverse real-world domains. Score % (Claude-Sonnet-4.6) 0 20 40 60 80 100 33.3 51.1 66.4 72.5 77.6 88.0 PinchBench Claw-Eval OSWorld-Verified WebArena-Verified WildClawBench ClawBench Benchmark Saturation Top models are saturating existing benchmarks ClawBench rem… view at source ↗
Figure 6
Figure 6. Figure 6: Agentic Evaluator Inference Pipeline. The evaluator determines whether a browser [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation protocol. The evaluator takes as input the task instruction together with [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper introduces ClawBench, a benchmark of 153 everyday online tasks spanning 144 live platforms in 15 categories. Unlike prior benchmarks using static pages, it runs on production websites with a lightweight interception layer to safely block final submissions. Evaluations of seven frontier models reveal low completion rates, with Claude Sonnet 4.6 achieving only 33.3%, suggesting significant limitations in current AI agents for real-world web tasks.

Significance. Should the evaluation methodology prove robust, ClawBench offers a valuable, realistic testbed for AI agent capabilities in practical scenarios such as purchases, bookings, and job applications. The focus on dynamic live platforms and demanding write-heavy operations (e.g., detailed form filling and document-based information extraction) provides a stronger signal than sandbox-based evaluations. The reported results establish a clear baseline for measuring progress toward reliable general-purpose agents.

major comments (1)
  1. [§3 (Benchmark Design)] §3 (Benchmark Design): The description of the interception layer asserts that it 'preserves the full complexity, dynamic nature, and challenges of real-world web interaction' by blocking only the final submission. However, the manuscript does not include any validation such as ablation studies comparing performance with and without the layer, analysis of altered page states or feedback loops, or human performance baselines on the same tasks. This is a load-bearing assumption for the central claim that the low success rates (e.g., 33.3% for Claude Sonnet 4.6) reflect inherent model limitations rather than artifacts from the evaluation setup.
minor comments (2)
  1. [Abstract] Abstract: It would improve clarity to report success rates for all evaluated models rather than highlighting only the best one (Claude Sonnet 4.6 at 33.3%).
  2. [Experiments] Experiments section: Details on the number of trials per task, variance in results, or statistical tests for the performance differences are not mentioned, which would help assess the reliability of the findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's detailed feedback on our ClawBench benchmark paper. We are grateful for the positive remarks on its significance and the call for robust evaluation methodology. Below, we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [§3 (Benchmark Design)] §3 (Benchmark Design): The description of the interception layer asserts that it 'preserves the full complexity, dynamic nature, and challenges of real-world web interaction' by blocking only the final submission. However, the manuscript does not include any validation such as ablation studies comparing performance with and without the layer, analysis of altered page states or feedback loops, or human performance baselines on the same tasks. This is a load-bearing assumption for the central claim that the low success rates (e.g., 33.3% for Claude Sonnet 4.6) reflect inherent model limitations rather than artifacts from the evaluation setup.

    Authors: We agree that empirical validation of the interception layer's neutrality would strengthen our claims. The layer is implemented as a minimal proxy that only prevents the final submission HTTP request from reaching the server, without modifying any preceding network responses, DOM elements, or JavaScript behavior. All agent actions, including navigation, clicking, typing, and reading page content, occur exactly as they would in an unmediated session. To address this, we have revised Section 3 to include a more precise specification of the interception logic, including the criteria used to identify the 'final submission' request. We have also added a limitations paragraph noting the absence of ablations and human baselines, and we will prioritize collecting human performance data on a subset of tasks for a follow-up study. This revision clarifies the methodology and acknowledges the assumption's importance. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmark paper with no derivations or self-referential reductions.

full rationale

ClawBench is an evaluation framework that defines 153 tasks on live platforms and reports direct success rates for 7 models (e.g., Claude Sonnet 4.6 at 33.3%). The manuscript contains no equations, fitted parameters, predictions derived from prior inputs, or mathematical derivations. The central claims are empirical measurements on the introduced benchmark; the interception layer is presented as a design choice that preserves complexity, not as a result derived from or reducing to the reported percentages. No self-citation chains, ansatzes, or renamings of known results appear in the load-bearing steps. The paper is self-contained as a benchmark introduction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation framework rests on assumptions about task representativeness and the non-interference of the safety interception mechanism.

axioms (2)
  • domain assumption The selected 153 tasks represent typical everyday online activities across 15 categories.
    The benchmark's validity depends on this selection being representative of routine life and work tasks.
  • domain assumption Intercepting only the final submission request preserves evaluation validity while ensuring safety.
    This is invoked to justify testing on live sites without real-world side effects.

pith-pipeline@v0.9.0 · 5591 in / 1258 out tokens · 92227 ms · 2026-05-10T17:29:22.834511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  2. LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

    cs.CR 2026-05 conditional novelty 8.0

    LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...

  3. Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

  4. AcademiClaw: When Students Set Challenges for AI Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

  5. Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    cs.CV 2026-05 unverdicted novelty 6.0

    Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

  6. NeuroClaw Technical Report

    cs.CV 2026-04 unverdicted novelty 6.0

    NeuroClaw introduces a three-tier multi-agent framework and NeuroBench benchmark that improve executability and reproducibility scores for neuroimaging tasks when used with multimodal LLMs.

  7. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · cited by 7 Pith papers · 4 internal anchors

  1. [1]

    Introducing computer use

    Anthropic. Introducing computer use. https://www.anthropic.com/news/ 3-5-models-and-computer-use, 2025a. Accessed: 2026-03-20. Anthropic. Claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5 , 2025b. Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 ,

  2. [2]

    Ersoy, B

    Devin Ersoy, Brandon Lee, Ananth Shreekumar, Arjun Arunasalam, Muhammad Ibrahim, Antonio Bianchi, and Z Berkay Celik. Investigating the impact of dark patterns on llm-based web agents.arXiv preprint arXiv:2510.18113,

  3. [3]

    arXiv preprint arXiv:2504.11543 , year =

    Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, et al. Real: Bench- marking autonomous agents on deterministic simulations of real websites.arXiv preprint arXiv:2504.11543,

  4. [4]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    arXiv:2307.12856. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InAnnual Meeting of the Association for Computational Linguistics (ACL),

  5. [5]

    Econwebarena: Benchmarking autonomous agents on economic tasks in realistic web environments.arXiv preprint arXiv:2506.08136,

    Zefang Liu and Yinzhu Quan. Econwebarena: Benchmarking autonomous agents on economic tasks in realistic web environments.arXiv preprint arXiv:2506.08136,

  6. [6]

    arXiv:2112.09332. OpenAI. Introducing operator.https://openai.com/index/introducing-operator/,

  7. [7]

    arXiv preprint arXiv:2406.12373 , year=

    11 Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,

  8. [8]

    Kimi K2.5: Visual Agentic Intelligence

    Accessed: 2026-03-20. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  9. [9]

    arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

    Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

  10. [10]

    Accessed: 2026-03-20

    Peking University & University of Hong Kong. Accessed: 2026-03-20. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8938–8968,

  11. [11]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763,

  12. [12]

    Boyuan Zheng, Michael Y

    arXiv:2401.01614. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR),