ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Ameet Talwalkar; Jingwu Tang; Nihar B Shah; Qiuhong Anna Wei; Shanda Li; Tim Dettmers; Valerie Chen; Yiming Yang

arxiv: 2606.18237 · v1 · pith:UPRPS3TTnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI· cs.LG

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Shanda Li , Qiuhong Anna Wei , Jingwu Tang , Valerie Chen , Nihar B Shah , Tim Dettmers , Yiming Yang , Ameet Talwalkar This is my paper

Pith reviewed 2026-06-27 00:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reproducibilityLLM agentsGitHub issuesmachine learning papersreproduction blockersscalable evaluationauditing

0 comments

The pith

LLM agents surface at least one human-reported reproducibility blocker for roughly 90 percent of machine learning papers from paper and repository text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReproRepo as a framework that treats existing GitHub issues on paper repositories as ready-made labels for real reproduction problems. This replaces the manual curation required by prior benchmarks and lets evaluation reach 1,149 recent machine learning papers. Four frontier agent setups are tested; the strongest (Codex with GPT-5.5) matches at least one semantically related human issue in about 90 percent of cases even though the agents never execute code. Agents prove good at locating visible failures and the right semantic area but weaker at pinning down exact details. The same framework can be reused to keep testing new agents on fresh papers.

Core claim

ReproRepo treats human-raised GitHub issues on paper repositories as naturally occurring supervision that marks genuine reproduction blockers. On a corpus of 1,149 recent machine learning papers, LLM agents that receive only the paper text and repository contents (no code execution) identify at least one semantically related blocker for approximately 90 percent of the papers, with the Codex-plus-GPT-5.5 configuration performing best. The agents are especially reliable at surfacing visible failures and identifying the correct semantic region yet remain limited in exact localization.

What carries the argument

ReproRepo, the framework that converts human-raised GitHub issues into scalable, naturally occurring labels for evaluating LLM agents on paper-repository pairs.

If this is right

Reproducibility checks can be run at the scale of thousands of papers using only existing issue data.
LLM agents supply a practical first filter that catches most visible blockers before any code is run.
Evaluation effort can shift from labeling new examples to refining how agents localize issues more precisely.
ReproRepo itself becomes a reusable testbed for comparing future agent versions on the same real-world task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining the current text-only agents with lightweight code-execution steps could close the remaining gap in exact localization.
The same GitHub-issue approach might transfer to other fields that maintain public code repositories with issue trackers.
Patterns in the issues that agents consistently miss could guide targeted improvements in agent prompting or retrieval.
Over time the growing set of agent outputs could itself become a dataset for training more specialized reproduction checkers.

Load-bearing premise

Human-raised GitHub issues accurately represent the true reproducibility blockers and semantic relatedness between agent output and those issues is a sufficient signal that the agent has found the problem.

What would settle it

An independent expert review on a fresh sample of papers showing that the issues agents flag are not the actual blockers that prevent reproduction, or a new run on held-out papers where the semantic-match rate falls well below 90 percent.

read the original abstract

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReproRepo's scalable use of GitHub issues as supervision is a practical framing, but the 90% claim rests on unvalidated assumptions about what those issues actually measure.

read the letter

The paper's main contribution is a framework that pulls human-raised GitHub issues from paper repositories to serve as supervision for testing LLM agents on reproducibility. They assemble 1,149 recent ML papers, run four agent setups without code execution, and report that the strongest one (Codex with GPT-5.5) produces output semantically related to at least one human issue for roughly 90% of the cases.

This approach has a clear practical upside. It avoids the heavy manual curation that limits earlier benchmarks, and the released code makes the setup reusable. Treating naturally occurring issues as labels is a distinct move that could let others run larger audits without starting from scratch.

The soft spot is the leap from the reported numbers to the claim that agents identify real reproducibility problems. The abstract gives no details on filtering issues for actual reproduction failures versus installation questions or feature requests, nor any manual validation or inter-annotator checks on the semantic relatedness judgments. Only repositories that already have issues are included, which narrows the population being tested. Without those steps, the 90% figure is difficult to read as strong evidence of diagnostic ability.

This is for researchers building or evaluating LLM agents for code-related scientific tasks. A reader looking for new benchmark ideas will find something usable here, but anyone wanting solid evidence on agent performance will need the full methods and validation details.

I would send it to peer review with a request for explicit checks on issue content and relatedness quality.

Referee Report

3 major / 1 minor

Summary. The paper introduces ReproRepo, a scalable framework that treats human-raised GitHub issues on paper repositories as naturally occurring labels for reproducibility blockers. It evaluates four LLM agent configurations on 1,149 recent ML papers and reports that the strongest configuration (Codex with GPT-5.5) surfaces at least one semantically related human-reported blocker for ~90% of papers, even without code execution. The work positions this approach as a reusable alternative to manually curated reproducibility benchmarks and releases the associated code.

Significance. If the assumptions about issue validity and semantic relatedness hold, the framework provides a low-cost, scalable method for auditing LLM agents on realistic reproducibility tasks, which could accelerate evaluation beyond the small-scale manual benchmarks common in the field. The public code release is a concrete strength that supports future reuse and extension.

major comments (3)

[Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.
[Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.
[Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.

minor comments (1)

[Abstract] The model name 'Codex with GPT-5.5' is non-standard and should be clarified with exact API identifiers or version numbers used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below and commit to revisions that strengthen the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results paragraph: The central quantitative claim (~90% of papers have at least one semantically related blocker surfaced) rests on an unvalidated proxy; the manuscript provides no description of how semantic relatedness is operationalized (e.g., embedding similarity threshold, LLM judge prompt, or human annotation protocol) nor any inter-annotator agreement or manual validation that the matched issues actually describe reproducibility failures rather than installation queries or feature requests.

Authors: We agree that additional detail is required. The current manuscript describes the matching procedure at a high level but does not provide the precise operationalization or validation statistics. In the revision we will expand the Methods section with the exact procedure (embedding model and threshold or LLM judge prompt), report inter-annotator agreement from a human validation study on a sampled subset, and clarify the criteria used to confirm that matched issues describe reproducibility blockers. revision: yes
Referee: [Abstract] Dataset construction (implied in abstract and methods): The 1,149-paper corpus is restricted to repositories that already contain GitHub issues; no statistics or filtering criteria are reported to confirm that the retained issues predominantly concern reproducibility blockers, which directly affects whether the 90% figure can be interpreted as evidence that agents identify real-world reproducibility problems.

Authors: We will add an explicit subsection on dataset construction that reports the repository and issue selection criteria together with summary statistics (e.g., proportion of issues manually categorized as reproducibility-related versus installation queries or feature requests) on a representative sample. This will allow readers to assess the composition of the supervision signal. revision: yes
Referee: [Results paragraph] Evaluation design: The claim that agents are 'particularly effective for surfacing visible failures' but 'insufficient in exact localization' is presented without accompanying quantitative breakdowns, example agent outputs, or error analysis that would allow readers to assess the distinction between semantic-region identification and actionable diagnosis.

Authors: We accept that the current presentation lacks supporting detail. The revision will include (i) quantitative breakdowns of success rates stratified by failure visibility and localization granularity, (ii) representative agent output examples, and (iii) a dedicated error-analysis subsection that distinguishes semantic-region matches from precise localization failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external human-generated labels as independent ground truth

full rationale

The paper's central evaluation compares LLM agent outputs against pre-existing human-raised GitHub issues on paper repositories, treating those issues as naturally occurring external supervision. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction; the reported ~90% figure is a direct empirical match rate against an independent dataset. The framework is self-contained against these external benchmarks, with no load-bearing self-citations or ansatzes that collapse the claim into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical framework built on existing public data.

pith-pipeline@v0.9.1-grok · 5764 in / 1152 out tokens · 50612 ms · 2026-06-27T00:58:28.008852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...

2019
[2]

Daniel Nüst and Stephen J Eglen. CODECHECK: an open science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve re- producibility.F1000Research, 10:253, 2021. doi: 10.12688/f1000research.51738.2. URLhttps: //f1000research.com/articles/10-253/v2. [version 2; peer review: 2 approved]

work page doi:10.12688/f1000research.51738.2 2021
[3]

PaperBench: Evaluating AI’s ability to replicate AI research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedin...

2025
[4]

Paper2Code: Automating code generation from scientific papers in machine learning

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3DcaUTjdKc

2026
[5]

CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024

Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=BsMMc4MEGS

2024
[6]

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO- bench: Can agentic AI systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, Vienna, Austria,

2025
[7]

doi: 10.18653/v1/2025.findings-acl.1210

Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1210. URL https://aclanthology.org/2025.findings-acl.1210/

work page doi:10.18653/v1/2025.findings-acl.1210 2025
[8]

Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian LV Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, et al. Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

arXiv 2025
[9]

Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

Syed Mehtab Hussain Shah, Frank Hopfgartner, and Arnim Bleier. Automating computational reproducibility in social science: Comparing prompt-based and agent-based approaches.arXiv preprint arXiv:2602.08561, 2026. doi: 10.48550/arXiv.2602.08561. URLhttps://arxiv.org/ abs/2602.08561. 12 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.08561 2026
[10]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoReproduce: Automatic AI experiment reproduction with paper lineage.arXiv preprint arXiv:2505.20662, 2025. doi: 10.48550/arXiv.2505.20662. URL https://arxiv.org/abs/2505.20662. Accepted by ACL 2026 Main

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20662 2025
[11]

The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, and Chenhao Tan. The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

arXiv 2026
[12]

Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Replication and Reanalysis

Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large-scale replication and reanalysis.arXiv preprint arXiv:2602.16733, 2026. doi: 10.48550/arXiv.2602.16733. URLhttps://arxiv.org/abs/2602.16733

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.16733 2026
[13]

Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, and Elliott Ash. Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

Pith/arXiv arXiv 2026
[14]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

BangNguyen, DominikSoós, QianMa, RochanaRObadage, ZackRanjan, SaiKoneru, AnnaSzabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, and Meng Jiang. ReplicatorBench: Benchmarking LLM agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. doi: 10.48550/arXiv.2602.11354. URLhttps://a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026
[15]

Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023

Ian Magnusson, Noah A Smith, and Jesse Dodge. Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023. doi: 10.18653/v1/2023.findings-acl.809. URLhttps://aclanthology. org/2023.findings-acl.809/

work page doi:10.18653/v1/2023.findings-acl.809 2023
[16]

ML code completeness checklist

Robert Stojnic. ML code completeness checklist. Papers with Code Blog, 2020. URL https: //medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501

2020
[17]

SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. doi: 10.48550/ arXiv.2310.06770. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024
[18]

MLE-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=6s5uXNWGIh

2025
[19]

SciCode: A research coding benchmark curated by scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

2024
[20]

Measuring risk of bias in biomedical reports: The RoBBR benchmark

Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025 2025
[21]

Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024

Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B Shah. Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024. doi: 10.48550/arXiv.2411. 03417. URLhttps://arxiv.org/abs/2411.03417

work page doi:10.48550/arxiv.2411 2024
[22]

ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023

Ryan Liu and Nihar Shah. ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023. AAAI 2024 Workshop on Scientific Document Understanding

arXiv 2023
[23]

When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025. doi: 10.48550/arXiv.2505.11855. URLhttps://arxiv.org/abs/2505.11855

work page doi:10.48550/arxiv.2505.11855 2025
[24]

Guo and Y

Sarina Xi, Vishisht Rao, Justin Payan, and Nihar B Shah. FLAWS: A benchmark for error identification and localization in scientific papers.arXiv preprint arXiv:2511.21843, 2025. doi: 10.48550/arXiv. 2511.21843. URLhttps://arxiv.org/abs/2511.21843

work page internal anchor Pith review doi:10.48550/arxiv 2025
[25]

Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026. URLhttps://arxiv.org/abs/2605.30329

Pith/arXiv arXiv 2026
[26]

Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InI...

2025
[27]

The more you automate, the less you see: The hidden pitfalls of AI scientist systems

Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: The hidden pitfalls of AI scientist systems. InNeurIPS 2025 AI for Science Workshop, 2025. URL https://openreview.net/forum?id=7Sndugns1l

2025
[28]

Xing, and Zhiting Hu

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, and Zhiting Hu. FIRE-bench: Evaluating agents on the rediscovery of scientific insights.arXiv preprint arXiv:2602.02905, 2026. doi: 10.48550/arXiv.2602.02905. URL https://arxiv.org/abs/2602.02905

work page doi:10.48550/arxiv.2602.02905 2026
[29]

Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025

Mingyang Zhou, Quanming Yao, Lun Du, Lanning Wei, and Da Zheng. Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025. doi: 10.48550/arXiv.2508.16671. URLhttps://arxiv.org/abs/2508.16671

work page doi:10.48550/arxiv.2508.16671 2025
[30]

FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research

Hui Chen, James Xu Zhao, Dongfu Jiang, Qianyun Guo, Jiefeng Chen, Yiwei Wang, Muhao Chen, See-Kiong Ng, Pang Wei Koh, and Bryan Hooi. FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research. InICML 2026 AI for Science Workshop, 2026. URLhttps://openreview. net/f...

2026
[31]

PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026

Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, and Yong Li. PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026. doi: 10.48550/arXiv.2603.00058. URLhttps://arxiv.org/abs/2603.00058

work page doi:10.48550/arxiv.2603.00058 2026
[32]

Paper Copilot: Tracking the evolution of peer review in AI conferences

Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper Copilot: Tracking the evolution of peer review in AI conferences. InInternational Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=CyKVrhNABo

2026
[33]

DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

2026
[34]

System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April

Anthropic. System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April
[35]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed 2026-05-25

2026
[36]

GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April

OpenAI. GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April
[37]

reproducibility_assessment

Accessed 2026-05-25. 15 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues A. Artifact Use, Licenses, & Intended Use Our study builds on existing public artifacts, including conference paper metadata, public GitHub reposito- ries, GitHub issue threads, and repository links discovered from Paper Copilot and conference metadata. We use...

2026

[1] [1]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...

2019

[2] [2]

Daniel Nüst and Stephen J Eglen. CODECHECK: an open science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve re- producibility.F1000Research, 10:253, 2021. doi: 10.12688/f1000research.51738.2. URLhttps: //f1000research.com/articles/10-253/v2. [version 2; peer review: 2 approved]

work page doi:10.12688/f1000research.51738.2 2021

[3] [3]

PaperBench: Evaluating AI’s ability to replicate AI research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedin...

2025

[4] [4]

Paper2Code: Automating code generation from scientific papers in machine learning

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. Paper2Code: Automating code generation from scientific papers in machine learning. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3DcaUTjdKc

2026

[5] [5]

CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024

Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE- bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=BsMMc4MEGS

2024

[6] [6]

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO- bench: Can agentic AI systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, Vienna, Austria,

2025

[7] [7]

doi: 10.18653/v1/2025.findings-acl.1210

Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1210. URL https://aclanthology.org/2025.findings-acl.1210/

work page doi:10.18653/v1/2025.findings-acl.1210 2025

[8] [8]

Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian LV Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, et al. Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

arXiv 2025

[9] [9]

Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

Syed Mehtab Hussain Shah, Frank Hopfgartner, and Arnim Bleier. Automating computational reproducibility in social science: Comparing prompt-based and agent-based approaches.arXiv preprint arXiv:2602.08561, 2026. doi: 10.48550/arXiv.2602.08561. URLhttps://arxiv.org/ abs/2602.08561. 12 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.08561 2026

[10] [10]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. AutoReproduce: Automatic AI experiment reproduction with paper lineage.arXiv preprint arXiv:2505.20662, 2025. doi: 10.48550/arXiv.2505.20662. URL https://arxiv.org/abs/2505.20662. Accepted by ACL 2026 Main

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20662 2025

[11] [11]

The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, and Chenhao Tan. The story is not the science: Execution-grounded evaluation of mechanistic interpretability research.arXiv preprint arXiv:2602.18458, 2026

arXiv 2026

[12] [12]

Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Replication and Reanalysis

Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large-scale replication and reanalysis.arXiv preprint arXiv:2602.16733, 2026. doi: 10.48550/arXiv.2602.16733. URLhttps://arxiv.org/abs/2602.16733

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.16733 2026

[13] [13]

Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, and Elliott Ash. Read the paper, write the code: Agentic reproduction of social-science results.arXiv preprint arXiv:2604.21965, 2026

Pith/arXiv arXiv 2026

[14] [14]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

BangNguyen, DominikSoós, QianMa, RochanaRObadage, ZackRanjan, SaiKoneru, AnnaSzabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, and Meng Jiang. ReplicatorBench: Benchmarking LLM agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. doi: 10.48550/arXiv.2602.11354. URLhttps://a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.11354 2026

[15] [15]

Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023

Ian Magnusson, Noah A Smith, and Jesse Dodge. Reproducibility in NLP: What have we learned from the checklist? InFindings of the Association for Computational Linguistics: ACL 2023, pages 12789–12811, 2023. doi: 10.18653/v1/2023.findings-acl.809. URLhttps://aclanthology. org/2023.findings-acl.809/

work page doi:10.18653/v1/2023.findings-acl.809 2023

[16] [16]

ML code completeness checklist

Robert Stojnic. ML code completeness checklist. Papers with Code Blog, 2020. URL https: //medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501

2020

[17] [17]

SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. doi: 10.48550/ arXiv.2310.06770. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024

[18] [18]

MLE-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=6s5uXNWGIh

2025

[19] [19]

SciCode: A research coding benchmark curated by scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

2024

[20] [20]

Measuring risk of bias in biomedical reports: The RoBBR benchmark

Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, Juntong Ni, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025 2025

[21] [21]

Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024

Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B Shah. Usefulness of LLMs as an author checklist assistant for scientific papers: NeurIPS’24 experiment.arXiv preprint arXiv:2411.03417, 2024. doi: 10.48550/arXiv.2411. 03417. URLhttps://arxiv.org/abs/2411.03417

work page doi:10.48550/arxiv.2411 2024

[22] [22]

ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023

Ryan Liu and Nihar Shah. ReviewerGPT? An exploratory study on using large language models for paper reviewing.arXiv preprint 2306.00622, 2023. AAAI 2024 Workshop on Scientific Document Understanding

arXiv 2023

[23] [23]

When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025

Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, and Stella Biderman. When AI co-scientists fail: SPOT-a benchmark for automated verification of scientific research.arXiv preprint arXiv:2505.11855, 2025. doi: 10.48550/arXiv.2505.11855. URLhttps://arxiv.org/abs/2505.11855

work page doi:10.48550/arxiv.2505.11855 2025

[24] [24]

Guo and Y

Sarina Xi, Vishisht Rao, Justin Payan, and Nihar B Shah. FLAWS: A benchmark for error identification and localization in scientific papers.arXiv preprint arXiv:2511.21843, 2025. doi: 10.48550/arXiv. 2511.21843. URLhttps://arxiv.org/abs/2511.21843

work page internal anchor Pith review doi:10.48550/arxiv 2025

[25] [25]

Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, and Furong Huang. Soundnessbench: Can your AI scientist reallytell goodresearch ideas frombad ones?, 2026. URLhttps://arxiv.org/abs/2605.30329

Pith/arXiv arXiv 2026

[26] [26]

Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InI...

2025

[27] [27]

The more you automate, the less you see: The hidden pitfalls of AI scientist systems

Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: The hidden pitfalls of AI scientist systems. InNeurIPS 2025 AI for Science Workshop, 2025. URL https://openreview.net/forum?id=7Sndugns1l

2025

[28] [28]

Xing, and Zhiting Hu

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, and Zhiting Hu. FIRE-bench: Evaluating agents on the rediscovery of scientific insights.arXiv preprint arXiv:2602.02905, 2026. doi: 10.48550/arXiv.2602.02905. URL https://arxiv.org/abs/2602.02905

work page doi:10.48550/arxiv.2602.02905 2026

[29] [29]

Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025

Mingyang Zhou, Quanming Yao, Lun Du, Lanning Wei, and Da Zheng. Reflective paper-to-code reproduction enabled by fine-grained verification.arXiv preprint arXiv:2508.16671, 2025. doi: 10.48550/arXiv.2508.16671. URLhttps://arxiv.org/abs/2508.16671

work page doi:10.48550/arxiv.2508.16671 2025

[30] [30]

FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research

Hui Chen, James Xu Zhao, Dongfu Jiang, Qianyun Guo, Jiefeng Chen, Yiwei Wang, Muhao Chen, See-Kiong Ng, Pang Wei Koh, and Bryan Hooi. FabScore: Fine-grained evaluation of fabrications in 14 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues automated AI research. InICML 2026 AI for Science Workshop, 2026. URLhttps://openreview. net/f...

2026

[31] [31]

PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026

Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, and Yong Li. PaperRepro: Automated computa- tional reproducibility assessment for social science papers.arXiv preprint arXiv:2603.00058, 2026. doi: 10.48550/arXiv.2603.00058. URLhttps://arxiv.org/abs/2603.00058

work page doi:10.48550/arxiv.2603.00058 2026

[32] [32]

Paper Copilot: Tracking the evolution of peer review in AI conferences

Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper Copilot: Tracking the evolution of peer review in AI conferences. InInternational Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=CyKVrhNABo

2026

[33] [33]

DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

2026

[34] [34]

System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April

Anthropic. System Card: Claude Opus 4.7.https://www.anthropic.com/system-cards, April

[35] [35]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed 2026-05-25

2026

[36] [36]

GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April

OpenAI. GPT-5.5 System Card.https://openai.com/index/gpt-5-5-system-card/, April

[37] [37]

reproducibility_assessment

Accessed 2026-05-25. 15 ReproRepo : Scaling Reproducibility Audits with GitHub Repository Issues A. Artifact Use, Licenses, & Intended Use Our study builds on existing public artifacts, including conference paper metadata, public GitHub reposito- ries, GitHub issue threads, and repository links discovered from Paper Copilot and conference metadata. We use...

2026