pith. sign in

arxiv: 2606.07591 · v3 · pith:IIHFIAMAnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Pith reviewed 2026-06-29 08:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords autonomous scientific researchAI agent benchmarkre-discovery evaluationLLM research capabilityscientific workflowmultimodal rubricsend-to-end research tasksagent evaluation protocol
0
0 comments X

The pith

A benchmark shows top AI agents and LLMs average only 21-26 when tasked with re-discovering results from real published papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ResearchClawBench, a collection of 40 tasks drawn from actual papers across ten scientific domains, each supplied with related literature and raw data but not the target paper. Expert-designed rubrics break down the scientific artifacts into weighted criteria so that agent outputs can be scored for how well they recover the core findings. When seven autonomous research agents and seventeen LLMs are tested under a single protocol, the best agent reaches 21.5, the best LLM reaches 20.7, and the frontier average sits at 26.5. These low scores indicate that current systems cannot yet perform reliable end-to-end autonomous scientific research. The benchmark therefore supplies a concrete, reproducible yardstick for measuring future progress.

Core claim

ResearchClawBench evaluates autonomous research agents on 40 tasks grounded in published papers across ten domains. The strongest autonomous agent, Claude Code, averages 21.5 while the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, and the overall LLM frontier mean is 26.5. Failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. The benchmark supplies expert-curated multimodal rubrics that allow scoring of target-paper-level re-discovery while still permitting new discovery.

What carries the argument

ResearchClawBench, a benchmark of 40 tasks each tied to a hidden published paper, equipped with expert-curated multimodal rubrics that decompose scientific artifacts into weighted scoring criteria.

If this is right

  • Development of scientific AI agents can now be tracked against a fixed set of re-discovery tasks with public rubrics.
  • Error patterns concentrated in protocol and evidence matching identify specific capabilities that must improve.
  • The rubric design leaves explicit room for agents to produce discoveries beyond the original papers.
  • A unified evaluation protocol enables direct comparison across different agent architectures and LLM backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be expanded with tasks that require generating and testing new hypotheses rather than recovering existing ones.
  • Persistent low scores point to a need for better long-horizon coordination between literature search, experiment design, and result interpretation.
  • Adding simulation or robotic execution layers to the tasks would expose whether current failures are mainly reasoning or execution bottlenecks.
  • Regular addition of newer papers would keep the benchmark from becoming a static target that agents overfit to.

Load-bearing premise

The expert-curated rubrics correctly identify the scientific core of each paper and fairly score agent outputs that may differ in form from the original publication.

What would settle it

An agent that scores above 70 on the majority of the 40 tasks under the same evaluation protocol would demonstrate reliable re-discovery of published scientific results.

Figures

Figures reproduced from arXiv: 2606.07591 by Bin Wang, Bo Zhang, Chaofan Hu, Chunfeng Song, Dongzhan Zhou, Fangchen Yu, Fenghua Ling, Guangtao Zhai, Haoxiang Yin, Haoxuan Li, Haoyu Zhou, Hengjian Gao, Jiamin Wu, Koutian Wu, Kun Li, Lei Bai, Lixue Cheng, Lu Mi, Mao Su, Mianxin Liu, Peng Ye, Qi Li, Qinglong Cao, Ruizhe Chen, Shengdu Chai, Sheng Xu, Shengyuan Xu, Shixiang Tang, Shuo Li, Siqi Sun, Tianfan Fu, Tianlin Ye, Wanghan Xu, Wanli Ouyang, Weijie Ma, Wenlong Zhang, Xiangyu Zhao, Xingjian Guo, Xinyu Gu, Xue Yang, Xuming He, Xuxuan Xie, Yifan Zhou, Yiheng Wang, Yixin Chen, Yuhao Zhou, Yuqiang Li, Zhangrui Zhao, Zhenfei Yin, Zhiwang Zhou, Zijie Guo.

Figure 1
Figure 1. Figure 1: Overview of ResearchClawBench. (a) ResearchClawBench spans 10 domains and 40 end-to-end tasks, covering diverse scientific questions and data modalities. (b) Overall scores of agents and LLMs; the 50-point line marks target-paper-level re-discovery, and scores above it indicate the discovery regime. arXiv:2606.07591v3 [cs.LG] 17 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Pump-off, pump-on at thetap = 0 deg, and pump-induced difference maps. The cyan marker denotes the processed replica target region used for raw-window validation. 3. Methods 3.1 Replica-band energy test For each processed replica entry with order n = +-1, I computed an inferred parent energy 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 = 𝐸𝑟𝑒𝑝𝑙𝑖𝑐𝑎 − 𝑛ℏ𝜔, using the pump energy stored in the processed feature file, pump_energy = 0.248 eV. A Floq… view at source ↗
Figure 2
Figure 2. Figure 2: Left: extracted Dirac-cone dispersion and identified replica features. Right: order-averaged replica-parent separations compared with the 0.248 eV pump photon energy. 4.2 Raw pump-induced signal near the replica region The raw HDF5 maps support the presence of a pump-induced feature near the processed target re￾gion. Averaging pump-on minus pump-off intensity in the target window gives positive values for … view at source ↗
Figure 3
Figure 3. Figure 3: Pump-induced difference maps for thetap = 0 deg and 90 deg, an energy distribution curve through the target momentum, and comparison of raw-window signal with mean-subtracted processed polarization intensity. 4.3 Polarization dependence and Volkov final-state interpretation The polarization CSV shows a weak but structured intensity variation. The fitted pi-periodic model gives: • model: I(theta)=c+a cos(2t… view at source ↗
Figure 4
Figure 4. Figure 4: Replica intensity versus pump polarization angle with a pi-periodic fit, shown both on linear and polar axes. 5. Validation and traceability 5.1 Directly verified from workspace data • The raw HDF5 axes, spectra shapes, and intensity ranges are summarized in outputs/data_ overview.json. • The processed replicas are photon-spaced from their inferred parent energy by 0.248 eV for both first-order sidebands; … view at source ↗
read the original abstract

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ResearchClawBench, a benchmark with 40 tasks drawn from real published papers across 10 scientific domains. Each task supplies related literature and raw data while hiding the target paper; expert-curated multimodal rubrics decompose the target artifacts into weighted criteria for scoring autonomous re-discovery. Evaluations of seven auto-research agents and seventeen LLMs via ResearchHarness yield low averages (Claude Code at 21.5, Claude-Opus-4.7 at 20.7, frontier LLM mean 26.5), with errors concentrated in protocol mismatch, evidence mismatch, and missing scientific core. The work positions the benchmark as a reproducible frontier for measuring progress toward autonomous scientific research.

Significance. If the rubrics prove reliable, the benchmark supplies a concrete, reproducible yardstick that quantifies the distance between current AI systems and reliable end-to-end scientific re-discovery. The grounding in actual published papers, provision of raw data and literature, and explicit allowance for new discovery within the rubrics are constructive design choices that distinguish it from purely synthetic or narrow coding benchmarks.

major comments (3)
  1. [Abstract / Rubric Construction] Abstract and rubric description: The headline claim that systems remain 'far from reliable re-discovery' rests entirely on the reported scores (21.5–26.5). No details are supplied on how the expert-curated multimodal rubrics were constructed, what inter-rater agreement was achieved, or whether they were validated against human re-implementations that use different but scientifically equivalent protocols. Without this, it is impossible to determine whether the low scores reflect agent limitations or rubric choices that over-penalize format or protocol differences.
  2. [Error Analysis] Error analysis: The manuscript states that failures concentrate in 'experimental protocol mismatch' and 'evidence mismatch.' Because the rubrics are described only at a high level and no sensitivity analysis is reported, it remains unclear whether these categories would still dominate if the rubrics explicitly credited methodologically distinct but scientifically equivalent outputs, as the abstract claims they 'leave room for new discovery.'
  3. [Benchmark Construction] Task selection and generalizability: The 40 tasks are said to be 'grounded in a real published paper' across 10 domains, yet no explicit criteria for task selection, difficulty calibration, or domain representativeness are provided. This omission weakens the inference that the observed performance gap is representative of autonomous research capability in general rather than an artifact of the chosen papers.
minor comments (1)
  1. [Evaluation Protocol] The abstract refers to 'multimodal rubrics' but the manuscript should clarify whether scoring criteria include visual or data-visualization elements and how these are evaluated when agent outputs differ in presentation format.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the manuscript's transparency and rigor without altering its core claims or results.

read point-by-point responses
  1. Referee: [Abstract / Rubric Construction] Abstract and rubric description: The headline claim that systems remain 'far from reliable re-discovery' rests entirely on the reported scores (21.5–26.5). No details are supplied on how the expert-curated multimodal rubrics were constructed, what inter-rater agreement was achieved, or whether they were validated against human re-implementations that use different but scientifically equivalent protocols. Without this, it is impossible to determine whether the low scores reflect agent limitations or rubric choices that over-penalize format or protocol differences.

    Authors: We agree that the manuscript would benefit from greater transparency on rubric construction. In the revised version we will add a dedicated subsection describing the process: domain experts decomposed each target paper into weighted criteria based on scientific importance, with explicit allowance for equivalent protocols and new discoveries. We will report any inter-rater agreement metrics that were collected during rubric finalization. A formal validation study against independent human re-implementations was not conducted, as the benchmark's primary purpose is to measure AI performance relative to the published artifacts; however, we will clarify this scope limitation and discuss its implications for interpreting the low scores. revision: yes

  2. Referee: [Error Analysis] Error analysis: The manuscript states that failures concentrate in 'experimental protocol mismatch' and 'evidence mismatch.' Because the rubrics are described only at a high level and no sensitivity analysis is reported, it remains unclear whether these categories would still dominate if the rubrics explicitly credited methodologically distinct but scientifically equivalent outputs, as the abstract claims they 'leave room for new discovery.'

    Authors: The error categories were derived from qualitative review of agent outputs against the rubrics. To directly address the concern, the revision will include a sensitivity analysis on a representative subset of tasks: we will re-score outputs while explicitly crediting methodologically distinct but scientifically equivalent approaches and report the resulting changes in error distributions and aggregate scores. This will test whether the dominant failure modes persist under more flexible interpretations consistent with the benchmark's design intent. revision: yes

  3. Referee: [Benchmark Construction] Task selection and generalizability: The 40 tasks are said to be 'grounded in a real published paper' across 10 domains, yet no explicit criteria for task selection, difficulty calibration, or domain representativeness are provided. This omission weakens the inference that the observed performance gap is representative of autonomous research capability in general rather than an artifact of the chosen papers.

    Authors: We will expand the Benchmark Construction section to list the explicit selection criteria (availability of open raw data and code, presence of multimodal artifacts, coverage across 10 domains, and feasibility of expert rubric creation). Difficulty was calibrated via internal pilot runs with frontier models; we will report these steps and any observed variance. While the 40 tasks cannot claim exhaustive representativeness of all scientific research, the selection aimed for diversity; the revision will include a limitations paragraph on generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark uses external ground truth

full rationale

The paper defines ResearchClawBench using 40 tasks drawn from independently published external papers as ground truth, with expert-curated rubrics applied to score agent and LLM outputs. Reported averages (e.g., 21.5 for Claude Code) are computed directly from these fixed external references and rubrics under a unified protocol. No equations, fitted parameters, self-referential predictions, or derivations appear in the abstract or described structure; the central claim of low re-discovery performance rests on empirical evaluation against outside artifacts rather than reducing to quantities defined within the benchmark itself. This satisfies the condition of being self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's central claim rests on the assumption that the chosen tasks and rubrics are representative and that hiding the target paper prevents trivial leakage; no free parameters or invented entities are introduced beyond the benchmark construction itself.

axioms (2)
  • domain assumption The 40 tasks drawn from published papers are representative of end-to-end scientific research across the 10 domains.
    Invoked when generalizing the low scores to the broader claim that current systems are far from reliable autonomous research.
  • domain assumption Expert-curated rubrics provide a valid decomposition of scientific artifacts into weighted, scorable criteria.
    This premise is required for the reported average scores to be interpreted as measures of research capability.

pith-pipeline@v0.9.1-grok · 5922 in / 1472 out tokens · 32282 ms · 2026-06-29T08:22:17.833630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

    cs.AI 2026-06 unverdicted novelty 6.0

    Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.

  2. EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

    cs.AI 2026-06 unverdicted novelty 5.0

    EurekAgent achieves new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks by engineering agent environments for autonomous scientific discovery, including a 26-circle packing resu...

Reference graph

Works this paper leans on

38 extracted references · 22 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  2. [2]

    Humanity's Last Exam

    Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    International Conference on Learning Representations , volume=

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=

  5. [5]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

  6. [6]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=

  7. [7]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Scienceworld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  8. [8]

    arXiv preprint arXiv:2502.14499 , year=

    Mlgym: A new framework and benchmark for advancing ai research agents , author=. arXiv preprint arXiv:2502.14499 , year=

  9. [9]

    Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

    Crowdsourcing multiple choice science questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

  10. [10]

    Huang, J

    Mlagentbench: Evaluating language agents on machine learning experimentation , author=. arXiv preprint arXiv:2310.03302 , year=

  11. [11]

    International Conference on Learning Representations , volume=

    Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. International Conference on Learning Representations , volume=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Lupidi, B

    AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents , author=. arXiv preprint arXiv:2602.06855 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Mlr-bench: Evaluating ai agents on open-ended machine learning research , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    AI & SOCIETY , volume=

    Researchers’ perceptions of automating scientific research , author=. AI & SOCIETY , volume=. 2025 , publisher=

  16. [16]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [17]

    EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

    Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery , author=. arXiv preprint arXiv:2603.08127 , year=

  18. [18]

    Journal of Systems and Software , volume=

    Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents , author=. Journal of Systems and Software , volume=. 2025 , publisher=

  19. [19]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Scibench: Evaluating college-level scientific problem-solving abilities of large language models , author=. arXiv preprint arXiv:2307.10635 , year=

  22. [22]

    arXiv preprint arXiv:2506.12958 , year=

    Domain specific benchmarks for evaluating multimodal large language models , author=. arXiv preprint arXiv:2506.12958 , year=

  23. [23]

    Bioinformatics , volume=

    Chembench: a cheminformatics workbench , author=. Bioinformatics , volume=. 2010 , publisher=

  24. [24]

    Advances in neural information processing systems , volume=

    What can large language models do in chemistry? a comprehensive benchmark on eight tasks , author=. Advances in neural information processing systems , volume=

  25. [25]

    The Fourteenth International Conference on Learning Representations , year=

    Earthse: A benchmark evaluating earth scientific exploration capability for large language models , author=. The Fourteenth International Conference on Learning Representations , year=

  26. [26]

    MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

    MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science , author=. arXiv preprint arXiv:2505.20740 , year=

  27. [27]

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

    Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=

  28. [28]

    AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

    Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

  29. [29]

    ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

    ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration , author=. arXiv preprint arXiv:2605.03042 , year=

  30. [30]

    arXiv preprint arXiv:2512.16969 , year=

    Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows , author=. arXiv preprint arXiv:2512.16969 , year=

  31. [31]

    arXiv preprint arXiv:2511.14366 , year=

    ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning , author=. arXiv preprint arXiv:2511.14366 , year=

  32. [32]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

  33. [33]

    arXiv preprint arXiv:2602.09132 , year=

    SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery , author=. arXiv preprint arXiv:2602.09132 , year=

  34. [34]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

  35. [35]

    2026 , howpublished =

    Mingxin Yang , title =. 2026 , howpublished =

  36. [36]

    Towards an AI co-scientist

    Towards an AI co-scientist , author=. arXiv preprint arXiv:2502.18864 , year=

  37. [37]

    arXiv preprint arXiv:2505.18705 , year=

    Ai-researcher: Autonomous scientific innovation , author=. arXiv preprint arXiv:2505.18705 , year=

  38. [38]

    2026 , eprint=

    InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery , author=. 2026 , eprint=