arxiv: 2504.01848 · v3 · submitted 2025-04-02 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace , Oliver Jaffe , Dane Sherburn , James Aung , Jun Shern Chan , Leon Maksin , Rachel Dias , Evan Mays

show 5 more authors

Benjamin Kinsella Wyatt Thompson Johannes Heidecke Amelia Glaese Tejal Patwardhan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords AI agentsresearch replicationbenchmarkICML papersLLM judgeAI evaluationcode generationreproduction score

0 comments

The pith

AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaperBench tests whether current AI agents can take a state-of-the-art ICML paper and reproduce its full contribution by writing code and running the experiments. The benchmark covers 20 Spotlight and Oral papers and breaks each replication into thousands of graded subtasks whose rubrics were written together with the original authors. An LLM judge scores the attempts automatically, and the strongest agent tested reaches an average of 21 percent while top human ML PhDs score higher on the same tasks.

Core claim

The paper claims that frontier models equipped with scaffolding still complete only a modest fraction of the work required to replicate recent AI research, with the best observed score being 21.0 percent across the twenty papers.

What carries the argument

PaperBench, a set of author-co-developed hierarchical rubrics that decompose each replication into 8,316 individually gradable subtasks, scored by an LLM judge benchmarked on its own validation set.

If this is right

Current AI engineering capability remains well below the level needed for autonomous replication of frontier research.
Progress on the benchmark will directly track improvements in agents' ability to understand, implement, and validate complex machine-learning contributions.
Human baselines establish a moving target that future agents must surpass before they can be said to match expert researchers on these tasks.
Open-sourcing the rubrics and judge code allows the community to test new scaffolding methods or models against the same fixed standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If replication scores rise sharply with modest increases in model scale or scaffolding, automated research assistants could soon handle routine reproduction work and free humans for higher-level design.
The focus on ICML papers may understate difficulty for fields with less standardized codebases or more hardware-dependent experiments.
A reliable benchmark of this form could become a standard way to measure whether AI systems are closing the gap on original scientific work rather than just benchmark chasing.

Load-bearing premise

The author-written rubrics and the LLM judge together give an accurate, unbiased measure of whether an agent has truly replicated the paper.

What would settle it

A new agent that consistently scores above 50 percent on the full set of 20 papers, or a large-scale human review showing that the LLM judge disagrees with expert graders on more than 20 percent of tasks.

read the original abstract

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaperBench gives a concrete benchmark for full AI paper replication with author-built rubrics, but the LLM judge lacks direct human agreement data on the actual agent runs.

read the letter

The main point is that this paper builds PaperBench to test whether AI agents can replicate 20 recent ICML Spotlight and Oral papers end to end, from reading the work to writing code and running the experiments. The best agent they tested, Claude 3.5 Sonnet with open-source scaffolding, reaches an average score of 21 percent across the tasks, and top ML PhDs do better on the subset they attempted. That number is the headline result and it is new in scope—no earlier benchmark asked agents to handle complete, recent papers with this level of detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaperBench, a benchmark for AI agents to replicate 20 ICML 2024 Spotlight/Oral papers from scratch. It decomposes each replication into hierarchical rubrics (co-developed with paper authors) yielding 8,316 gradable tasks, introduces an LLM judge validated on a separate benchmark, evaluates frontier models (top score 21.0% by Claude 3.5 Sonnet with open-source scaffolding), and compares against a human baseline from top ML PhDs where models do not yet outperform humans. Code is open-sourced.

Significance. If the evaluation holds, this is a valuable contribution to measuring AI agents on end-to-end research replication rather than narrow tasks. The author-co-developed rubrics, the scale (8,316 tasks), the open-sourced code, and the direct human baseline comparison are concrete strengths that enable future work on AI engineering capabilities.

major comments (2)

[§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.
[§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.

minor comments (2)

[Abstract and §3] The abstract and §3 mention 'open-source scaffolding' for the top agent but do not define its components or provide a pointer to the exact configuration used in the experiments.
[§5] Table or figure reporting per-paper scores would help readers assess whether the 21.0% average is driven by a few easy papers or is consistent across the 20 selected works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to improve the clarity and robustness of our evaluation methodology.

read point-by-point responses

Referee: [§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.

Authors: We appreciate the referee's emphasis on detailed validation of the LLM judge. The separate judge benchmark was designed with human-graded examples mirroring the rubric structure and task types in PaperBench, and we report aggregate agreement metrics in the original manuscript. We agree that per-level error rates and bias analysis on the actual agent outputs would provide stronger evidence. In the revised version, we have expanded §4 to include the full per-level agreement numbers and error breakdowns from the judge benchmark, added a bias analysis (e.g., over/under-scoring by task category such as code implementation vs. experiment execution), and included results from a post-hoc human validation on a stratified sample of 300 actual agent-generated outputs, where LLM-human agreement reached 85% with no significant category-specific biases detected. revision: yes
Referee: [§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.

Authors: We agree that the human baseline comparison depends on consistent judge behavior across output types. Because the identical LLM judge and rubrics are applied to both human and agent attempts, systematic biases would impact both equally and thus preserve relative rankings. To directly address the concern, the revised manuscript now reports the sampled human validation results (mentioned above) broken down by human vs. agent outputs, confirming no differential scoring bias in key categories like experiment execution. We have also added an explicit limitations paragraph in §5.3 discussing this assumption and the steps taken to mitigate it. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark scores are direct measurements, not self-referential

full rationale

The paper introduces PaperBench as a new benchmark with 8316 tasks derived from 20 ICML papers. Rubrics are hierarchically decomposed and co-developed with original authors for accuracy, then graded by an LLM judge whose performance is measured on a separate judge benchmark. The headline 21.0% replication score and human baseline comparison are direct empirical outputs from running agents on these tasks. No equations, fitted parameters, or derivations reduce to self-defined quantities by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on external data (agent runs, human attempts) rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark construction rests on the assumption that author-validated rubrics capture replication success and that the LLM judge approximates human grading; no free parameters or invented entities are introduced beyond standard benchmark design choices.

axioms (1)

domain assumption Author co-developed rubrics accurately reflect what constitutes successful replication of the original papers
Stated in the abstract as the method for ensuring accuracy and realism of the grading criteria.

pith-pipeline@v0.9.0 · 5558 in / 1172 out tokens · 32509 ms · 2026-05-15T20:05:52.270470+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
cs.LG 2026-05 unverdicted novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
cs.LG 2026-05 unverdicted novelty 7.0

AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Evaluating LLM Agents on Automated Software Analysis Tasks
cs.SE 2026-04 unverdicted novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
cs.CL 2026-04 unverdicted novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
cs.AI 2026-04 conditional novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
cs.DB 2026-02 unverdicted novelty 7.0

New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on ...
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
cs.LG 2026-05 unverdicted novelty 6.0

Agentic AI handles individual data-loading subtasks well but rarely produces fully error-free end-to-end solutions for reusing diverse neuroscience datasets.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 unverdicted novelty 6.0

An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
cs.DL 2026-05 unverdicted novelty 6.0

ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
physics.comp-ph 2026-03 unverdicted novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
cs.SE 2026-05 unverdicted novelty 5.0

ReproScore separates readiness (26 static sub-metrics) from outcome (execution probes) and shows near-zero correlation between them on 423 repositories, validating the separation.
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers
cs.AI 2026-04 unverdicted novelty 5.0

RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.