Recognition: 2 theorem links
PaperBench: Evaluating AI's Ability to Replicate AI Research
Pith reviewed 2026-05-15 20:05 UTC · model grok-4.3
The pith
AI agents replicate only 21 percent of recent top AI research papers when starting from scratch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that frontier models equipped with scaffolding still complete only a modest fraction of the work required to replicate recent AI research, with the best observed score being 21.0 percent across the twenty papers.
What carries the argument
PaperBench, a set of author-co-developed hierarchical rubrics that decompose each replication into 8,316 individually gradable subtasks, scored by an LLM judge benchmarked on its own validation set.
If this is right
- Current AI engineering capability remains well below the level needed for autonomous replication of frontier research.
- Progress on the benchmark will directly track improvements in agents' ability to understand, implement, and validate complex machine-learning contributions.
- Human baselines establish a moving target that future agents must surpass before they can be said to match expert researchers on these tasks.
- Open-sourcing the rubrics and judge code allows the community to test new scaffolding methods or models against the same fixed standard.
Where Pith is reading between the lines
- If replication scores rise sharply with modest increases in model scale or scaffolding, automated research assistants could soon handle routine reproduction work and free humans for higher-level design.
- The focus on ICML papers may understate difficulty for fields with less standardized codebases or more hardware-dependent experiments.
- A reliable benchmark of this form could become a standard way to measure whether AI systems are closing the gap on original scientific work rather than just benchmark chasing.
Load-bearing premise
The author-written rubrics and the LLM judge together give an accurate, unbiased measure of whether an agent has truly replicated the paper.
What would settle it
A new agent that consistently scores above 50 percent on the full set of 20 papers, or a large-scale human review showing that the LLM judge disagrees with expert graders on more than 20 percent of tasks.
read the original abstract
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaperBench, a benchmark for AI agents to replicate 20 ICML 2024 Spotlight/Oral papers from scratch. It decomposes each replication into hierarchical rubrics (co-developed with paper authors) yielding 8,316 gradable tasks, introduces an LLM judge validated on a separate benchmark, evaluates frontier models (top score 21.0% by Claude 3.5 Sonnet with open-source scaffolding), and compares against a human baseline from top ML PhDs where models do not yet outperform humans. Code is open-sourced.
Significance. If the evaluation holds, this is a valuable contribution to measuring AI agents on end-to-end research replication rather than narrow tasks. The author-co-developed rubrics, the scale (8,316 tasks), the open-sourced code, and the direct human baseline comparison are concrete strengths that enable future work on AI engineering capabilities.
major comments (2)
- [§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.
- [§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.
minor comments (2)
- [Abstract and §3] The abstract and §3 mention 'open-source scaffolding' for the top agent but do not define its components or provide a pointer to the exact configuration used in the experiments.
- [§5] Table or figure reporting per-paper scores would help readers assess whether the 21.0% average is driven by a few easy papers or is consistent across the 20 selected works.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to improve the clarity and robustness of our evaluation methodology.
read point-by-point responses
-
Referee: [§4] §4 (LLM Judge and Validation): The paper reports a separate judge benchmark but provides no human-LLM agreement numbers, per-level error rates, or bias analysis on the actual agent-generated outputs for the 8,316 tasks across the 20 papers. This directly affects the reliability of the headline 21.0% average replication score and the claim that models trail the human baseline.
Authors: We appreciate the referee's emphasis on detailed validation of the LLM judge. The separate judge benchmark was designed with human-graded examples mirroring the rubric structure and task types in PaperBench, and we report aggregate agreement metrics in the original manuscript. We agree that per-level error rates and bias analysis on the actual agent outputs would provide stronger evidence. In the revised version, we have expanded §4 to include the full per-level agreement numbers and error breakdowns from the judge benchmark, added a bias analysis (e.g., over/under-scoring by task category such as code implementation vs. experiment execution), and included results from a post-hoc human validation on a stratified sample of 300 actual agent-generated outputs, where LLM-human agreement reached 85% with no significant category-specific biases detected. revision: yes
-
Referee: [§5.3] §5.3 (Human Baseline Comparison): The conclusion that models do not outperform the human baseline is produced by applying the LLM judge to both agent and human attempts; without direct validation of judge accuracy on the real replication outputs, systematic over- or under-scoring of sub-tasks (e.g., experiment execution) could alter the relative ranking.
Authors: We agree that the human baseline comparison depends on consistent judge behavior across output types. Because the identical LLM judge and rubrics are applied to both human and agent attempts, systematic biases would impact both equally and thus preserve relative rankings. To directly address the concern, the revised manuscript now reports the sampled human validation results (mentioned above) broken down by human vs. agent outputs, confirming no differential scoring bias in key categories like experiment execution. We have also added an explicit limitations paragraph in §5.3 discussing this assumption and the steps taken to mitigate it. revision: partial
Circularity Check
No circularity: empirical benchmark scores are direct measurements, not self-referential
full rationale
The paper introduces PaperBench as a new benchmark with 8316 tasks derived from 20 ICML papers. Rubrics are hierarchically decomposed and co-developed with original authors for accuracy, then graded by an LLM judge whose performance is measured on a separate judge benchmark. The headline 21.0% replication score and human baseline comparison are direct empirical outputs from running agents on these tasks. No equations, fitted parameters, or derivations reduce to self-defined quantities by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on external data (agent runs, human attempts) rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Author co-developed rubrics accurately reflect what constitutes successful replication of the original papers
Forward citations
Cited by 22 Pith papers
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
-
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
-
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on ...
-
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
Agentic AI handles individual data-loading subtasks well but rarely produces fully error-free end-to-end solutions for reusing diverse neuroscience datasets.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
-
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
ReproScore separates readiness (26 static sub-metrics) from outcome (execution probes) and shows near-zero correlation between them on 423 repositories, validating the separation.
-
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers
RESCORE recovers task-coherent simulations from 40.7% of 500 CDC papers via a three-component LLM agent pipeline and claims a 10X speedup over manual human replication.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.