ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.
Repro-bench: Can agentic ai systems assess the reproducibility of social science research?
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.
citing papers explorer
-
ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.
-
ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.