pith. sign in

arxiv: 2606.03889 · v2 · pith:GN52DZPLnew · submitted 2026-06-02 · 💻 cs.CL

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

Pith reviewed 2026-06-28 09:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords agent benchmarksrealistic evaluationdeveloper agentstask reconstructionmodel performancelive sessionsautomatic scoring
0
0 comments X

The pith

RealClawBench converts live developer-agent sessions into 281 reproducible tasks via environment reconstruction and automatic scorers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build an evaluation set that matches the distribution and difficulty of tasks users actually give to deployed agents rather than synthetic ones. It does so by taking real OpenClaw sessions, rebuilding their execution environments, and adding deterministic scorers that turn implicit or local-dependent requests into automatically graded items. A sympathetic reader would care because this approach lets benchmark scores track genuine progress toward agents that succeed on the workloads that matter in practice. The released set of 281 tasks keeps the original distribution to within 0.0448 Jensen-Shannon divergence, and testing 14 models shows the strongest reaches only 65.8 percent success.

Core claim

By applying reconstructed execution environments and deterministic verifiable scorers to real user sessions, RealClawBench produces 281 executable tasks that preserve the source distribution with a maximum Jensen-Shannon divergence of 0.0448. Evaluation of 14 models shows the strongest one succeeds on 65.8 percent of the tasks.

What carries the argument

Reconstructed execution environments paired with deterministic verifiable scorers that convert live sessions into controlled, automatically scored benchmark tasks.

If this is right

  • Real sessions can be turned into benchmark tasks while keeping their statistical distribution intact.
  • Current models leave more than a third of realistic developer tasks unsolved.
  • Live-derived benchmarks give a direct signal of how close agents are to handling actual deployed workloads.
  • The construction method scales to larger pools while maintaining low divergence from the source.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reconstruction approach could be applied to sessions from other agent platforms to produce comparable realistic tests.
  • Gains on this benchmark would likely require better agent handling of local file systems and clarification of underspecified requests.
  • Repeated evaluation against live-derived sets could steer research toward capabilities that directly affect real usage rather than benchmark-specific tricks.

Load-bearing premise

That the reconstructed environments and scorers faithfully reproduce the original sessions without systematically altering their difficulty or success criteria.

What would settle it

A side-by-side run showing that success rates on the benchmark tasks differ markedly from success rates observed when the same agents or humans attempt the original unreconstructed sessions.

Figures

Figures reproduced from arXiv: 2606.03889 by Guangxiang Zhao, Lin Sun, Tong Yang, Weihong Lin, Xiangzheng Zhang, Yaoming Li, Yilun Yao, Yuxuan Tian, Zhewen Tan, Zongwei Lv.

Figure 1
Figure 1. Figure 1: Conceptual overview of the realism gap and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: shows, the pipeline samples deployed ses￾sions, filters low-quality or unsafe cases, recon￾structs execution environments, rewrites requests into standalone instructions, and builds determin￾istic verifiers. The resulting benchmark therefore measures agents on tasks closer to what users actu￾ally experience in deployment. Because the same construction process can be rerun on later sessions, REALCLAWBENCH a… view at source ↗
Figure 3
Figure 3. Figure 3: Task composition of the final evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Construction-stage distribution fidelity. Across task type, user turns, tools defined, and tool calls, the final [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy-cost tradeoff between sample￾average pass rate and per-case cost. The figure shows that higher spending does not directly translate into higher accuracy, and that several mid-cost models offer competitive frontier points. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Subtask robustness matrix showing model pass rates across the full subtask taxonomy. The matrix reveals [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Live-window retention after normalizing each [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution drift under live benchmarking. The figure tests whether the later window remains comparable [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scorer-validation metrics comparing agree [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Verdict confusion matrices with human labels as rows and scorer labels as columns. The full-evidence [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RealClawBench, a benchmark framework derived from real OpenClaw developer-agent sessions. It uses reconstructed execution environments and deterministic verifiable scorers to convert live sessions into 281 reproducible, automatically scored tasks that preserve the source distribution (maximum JS divergence of 0.0448). Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic workloads.

Significance. If the reconstruction and scoring mechanisms faithfully reproduce original session difficulty and success criteria, the benchmark would provide a valuable resource for measuring agent performance on actual deployed workloads rather than synthetic tasks. The code release and distribution-matching approach are positive elements supporting potential reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that the benchmark captures 'real-world difficulty' and provides a 'practical path toward benchmarks that better measure agent capability in actual use' rests on the unvalidated assumption that reconstructed environments and deterministic scorers introduce no systematic distortion relative to original sessions. Only surface-level task-type distribution matching (JS divergence 0.0448) is reported; no evidence is given on execution-path fidelity, scorer encoding of implicit user intent, or empirical match of success rates to the live sessions.
  2. [Abstract] Abstract: the reported 65.8% ceiling on 281 tasks is presented as evidence of headroom, but without any described validation (e.g., human review of scorer accuracy or ablation comparing reconstructed vs. original outcomes), it is unclear whether this figure reflects genuine model limitations or artifacts of the reconstruction process.
minor comments (1)
  1. [Abstract] The abstract states 'Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B' but provides no further details on what artifacts (environments, scorers, task definitions) are included or how they can be used to reproduce the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation of reconstruction fidelity. We address each major point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the benchmark captures 'real-world difficulty' and provides a 'practical path toward benchmarks that better measure agent capability in actual use' rests on the unvalidated assumption that reconstructed environments and deterministic scorers introduce no systematic distortion relative to original sessions. Only surface-level task-type distribution matching (JS divergence 0.0448) is reported; no evidence is given on execution-path fidelity, scorer encoding of implicit user intent, or empirical match of success rates to the live sessions.

    Authors: The primary quantitative support for fidelity in the manuscript is the reported maximum Jensen-Shannon divergence of 0.0448 on task-type distributions between the 281 sampled tasks and the source pool. The methods describe environment reconstruction from session logs and the design of deterministic scorers tied to observable session outcomes. We did not perform ablations on execution-path equivalence, human review of scorer alignment with implicit intent, or direct comparisons of model success rates between reconstructed and original live sessions. We will add a limitations paragraph in revision to explicitly note these gaps and the practical challenges of obtaining such matches for live developer sessions. revision: partial

  2. Referee: [Abstract] Abstract: the reported 65.8% ceiling on 281 tasks is presented as evidence of headroom, but without any described validation (e.g., human review of scorer accuracy or ablation comparing reconstructed vs. original outcomes), it is unclear whether this figure reflects genuine model limitations or artifacts of the reconstruction process.

    Authors: The 65.8% figure is the highest success rate observed across the 14 models on the 281 reconstructed tasks. We present it as indicating headroom because the tasks are drawn from real sessions while preserving the measured distribution. We agree that the absence of human scorer validation or direct reconstructed-vs-original ablations leaves open the possibility of reconstruction artifacts. In revision we will update the abstract and results discussion to qualify the headroom claim as based on distribution-matched tasks and to reference the new limitations paragraph. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark is constructed via independent data collection and direct evaluation.

full rationale

The paper presents RealClawBench as the result of sampling real sessions, reconstructing environments, and building deterministic scorers—an empirical data-collection pipeline with no equations, fitted parameters, or derivations. The reported 65.8% solve rate is a direct measurement on the released tasks; the JS divergence of 0.0448 is a post-hoc distribution statistic, not a prediction that reduces to any fitted input. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is therefore self-contained against external benchmarks and contains no reductions of the enumerated circular kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5776 in / 1141 out tokens · 17893 ms · 2026-06-28T09:54:26.365019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    International Conference on Learning Representations , volume=

    Swe-bench: Can language models resolve real-world github issues? , author=. International Conference on Learning Representations , volume=

  2. [2]

    arXiv preprint arXiv:2605.27922 , year=

    Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows , author=. arXiv preprint arXiv:2605.27922 , year=

  3. [3]

    arXiv preprint arXiv:2512.12730 , year=

    NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents , author=. arXiv preprint arXiv:2512.12730 , year=

  4. [4]

    arXiv preprint arXiv:2112.09332 , year=

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  5. [5]

    arXiv preprint arXiv:2205.00445 , year=

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning , author=. arXiv preprint arXiv:2205.00445 , year=

  6. [6]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  7. [7]

    arXiv preprint arXiv:2210.03629 , year=

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  8. [8]

    International Conference on Learning Representations , volume=

    Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

  9. [9]

    International Conference on Learning Representations , volume=

    Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

  10. [10]

    International Conference on Learning Representations , volume=

    Gaia: a benchmark for general ai assistants , author=. International Conference on Learning Representations , volume=

  11. [11]

    International Conference on Learning Representations , volume=

    Livebench: A challenging, contamination-limited llm benchmark , author=. International Conference on Learning Representations , volume=

  12. [12]

    International Conference on Learning Representations , volume=

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=

  13. [13]

    International Conference on Learning Representations , volume=

    Lmsys-chat-1m: A large-scale real-world llm conversation dataset , author=. International Conference on Learning Representations , volume=

  14. [14]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  15. [15]

    arXiv preprint arXiv:2009.03300 , year=

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  16. [16]

    Transactions on machine learning research , year=

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on machine learning research , year=

  17. [17]

    arXiv preprint arXiv:2211.09110 , year=

    Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

  18. [18]

    2026 , howpublished=

  19. [19]

    Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

    Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  20. [20]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  21. [21]

    arXiv preprint arXiv:2406.12045 , year=

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    arXiv preprint arXiv:2403.07718 , year=

    Workarena: How capable are web agents at solving common knowledge work tasks? , author=. arXiv preprint arXiv:2403.07718 , year=

  24. [24]

    arXiv preprint arXiv:2410.03859 , year=

    Swe-bench multimodal: Do ai systems generalize to visual software domains? , author=. arXiv preprint arXiv:2410.03859 , year=

  25. [25]

    International Conference on Learning Representations , volume=

    Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. International Conference on Learning Representations , volume=

  26. [26]

    arXiv preprint arXiv:2403.04132 , year=

    Chatbot arena: An open platform for evaluating llms by human preference , author=. arXiv preprint arXiv:2403.04132 , year=

  27. [27]

    arXiv preprint arXiv:2405.01470 , year=

    Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

  28. [28]

    International Conference on Learning Representations , volume=

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild , author=. International Conference on Learning Representations , volume=

  29. [29]

    arXiv preprint arXiv:2605.10912 , year=

    WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation , author=. arXiv preprint arXiv:2605.10912 , year=

  30. [30]

    2026 , url =

    ClawBench: Trace-Scored Agent Benchmark with Dynamical-Systems Diagnostics , author =. 2026 , url =

  31. [31]

    arXiv preprint arXiv:2604.04759 , year=

    Your agent, their asset: A real-world safety analysis of openclaw , author=. arXiv preprint arXiv:2604.04759 , year=

  32. [32]

    arXiv preprint arXiv:2604.14858 , year=

    Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX , author=. arXiv preprint arXiv:2604.14858 , year=

  33. [33]

    2026 , howpublished=

    Introducing. 2026 , howpublished=

  34. [34]

    2025 , howpublished=

  35. [35]

    A Primer in Post-Training Reasoning Data: What We Know About How It Works

    Li, Yaoming and Zhao, Guangxiang and Shi, Qilong and Sun, Lin and Zhang, Xiangzheng and Yang, Tong , year =. doi:10.48550/arXiv.2606.02113 , url =. 2606.02113 , archivePrefix =

  36. [36]

    2026 , eprint=

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents , author=. 2026 , eprint=

  37. [37]

    2026 , eprint=

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=