AI Coding Agents Can Reproduce Social Science Findings

Atoosa Kasirzadeh; Fabrizio Gilardi; Joshua Tucker; Meysam Alizadeh; Mohsen Mosleh

arxiv: 2606.11447 · v1 · pith:Q4HD2GBXnew · submitted 2026-06-09 · 💻 cs.CL

AI Coding Agents Can Reproduce Social Science Findings

Meysam Alizadeh , Mohsen Mosleh , Fabrizio Gilardi , Atoosa Kasirzadeh , Joshua Tucker This is my paper

Pith reviewed 2026-06-27 13:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI coding agentssocial science reproducibilityreproduction benchmarkClaude CodeCodexLLM evaluationcomputational workflowsprompt effects

0 comments

The pith

Claude Code and Codex reproduce a large share of social science findings from original data and code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SocSci-Repro-Bench, a set of 221 tasks drawn from studies in four disciplines where results are known to be either fully reproducible or impossible due to missing materials. It tests two frontier coding agents on these tasks to measure their ability to execute the original analyses correctly. Both agents succeed on many tasks, with Claude Code outperforming Codex and exceeding reproduction rates seen in earlier LLM evaluations. The agents also identify underlying research questions in the studies at high rates, and checks suggest success is not mainly from memorizing the papers. The work shows that giving agents the original paper PDF can help but risks introducing bias on tasks that cannot be reproduced, and that prompt wording can steer agents toward certain analytical choices.

Core claim

We introduce SocSci-Repro-Bench with 221 tasks spanning four disciplines and 13 domains, built only from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data. On this benchmark, both Claude Code and Codex reproduce a large share of the findings, with Claude Code substantially outperforming Codex at rates higher than those reported for general-purpose LLM-based agents on comparable tasks. The agents also perform strongly on identifying the research questions behind the studies, additional analyses indicate results are not primarily driven by memorization, and providing the paper PDF modestly improves performance whil

What carries the argument

SocSci-Repro-Bench, a collection of 221 reproduction tasks that isolates agent performance by restricting the set to studies with clear, known reproducibility status based on provided materials.

If this is right

Frontier coding agents can serve as reliable executors of computational workflows in social science research.
Prompt design requires care because subtle framing can nudge agents toward confirmatory specification search.
Providing original papers alongside code and data can improve results but risks biasing outcomes on non-reproducible tasks.
Agents demonstrate capability on reasoning tasks such as identifying underlying research questions in addition to code execution.
Systematic benchmarking becomes necessary as AI systems take on larger roles in scientific production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The performance gap between agents may grow or shrink with newer models, suggesting ongoing monitoring of capability changes over time.
Integration of such agents into research could accelerate verification of existing findings but would require safeguards against prompt-induced biases.
Extending the benchmark approach to other fields or to studies using newer data sources could test how broadly the reproduction capacity holds.
The results raise the practical question of how to combine agent outputs with human oversight in publication pipelines.

Load-bearing premise

The benchmark selects only studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, so that success or failure can be attributed to the agent rather than the materials themselves.

What would settle it

Running the same agents on an independent collection of social science studies whose reproducibility status has been verified separately from the benchmark construction, and finding substantially lower reproduction rates.

Figures

Figures reproduced from arXiv: 2606.11447 by Atoosa Kasirzadeh, Fabrizio Gilardi, Joshua Tucker, Meysam Alizadeh, Mohsen Mosleh.

**Figure 1.** Figure 1: Comparison of Claude Code and Codex across three accuracy metrics and failure rates. (Left) Accuracy for all tasks (N = 221), non-reproducible tasks (N = 10), and all papers (N = 54). Both models achieve perfect accuracy on non-reproducible tasks, while Claude Code substantially outperforms Codex at both task (93.4% vs. 62.1%) and paper level (78.0% vs. 35.8%), where a paper was considered fully reproduced… view at source ↗

**Figure 2.** Figure 2: Stratified performance of Claude Code and Codex across programming languages and training-data cutoffs. a, Task-level accuracy stratified by the primary programming language of each replication package (Python, n = 49 tasks; R, n = 136; Stata, n = 36). Sample sizes differ across languages and comparisons are descriptive. b, Paper-level accuracy (all tasks correct) under the same language stratification (Py… view at source ↗

**Figure 3.** Figure 3: Evidence against direct memorization in AI coding agents assessed through metadata recovery from anonymized replication materials. Stacked bar charts show the percentage of papers (n = 54) for which each AI agent correctly recovered the title, authors, journal and publication year from fully anonymized replication code and data, compared against a gold-standard reference. (a) Claude Code attempted metadata… view at source ↗

**Figure 4.** Figure 4: Research question (RQ) extraction accuracy of Claude Code and Codex compared to the Gold standard. Three similarity metrics are shown in each panel: RQ-level semantic match rate (proportion of greedy-paired RQs that are semantically equivalent), paper-level full match rate (proportion of papers where all Gold RQs have a semantic match), and paper-level ≥60% match rate (proportion of papers where at least 6… view at source ↗

**Figure 5.** Figure 5: Accuracy under confirmatory prompt nudging. Mean accuracy of Claude Code and Codex across three independent runs when presented with confirmatory prompts designed to induce result-oriented specification search. Results are faceted by evaluation granularity: task-level (left) and paper-level (right). Claude Code maintained high accuracy on all tasks (94.1%) compared with Codex (74.1%), though both models sh… view at source ↗

**Figure 6.** Figure 6: Claude Code and Codex performance on the CORE-Bench social science reproducibility benchmark. (a) Non-anonymized condition, where agents have access to paper titles and author names, and (b) anonymized condition, where this metadata is removed. Each panel reports task-level and paper-level accuracy (left) and failure rates (right), averaged across three independent runs. with 0.0% failure rates (Fig. 6a). … view at source ↗

read the original abstract

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces SocSci-Repro-Bench with 221 tasks and reports solid reproduction rates for frontier coding agents, but the benchmark's own validation step is not described.

read the letter

The main point is that the authors created SocSci-Repro-Bench, a collection of 221 social science tasks across four disciplines, and found that Claude Code and Codex can reproduce a large share of the findings, with Claude doing noticeably better and both beating earlier reported rates.

The benchmark construction is the clearest new piece. It selects studies where materials either allow full reproduction or clearly do not because of missing data. This split aims to measure agent capacity directly. The paper also includes checks on a reasoning task, tests for memorization, and shows how adding the paper PDF or changing prompt framing affects outcomes. These are practical data points for anyone tracking AI use in research workflows.

The evaluation is empirical and straightforward. The performance numbers and the nudge result on specification search give concrete evidence that current agents can execute many computational social science steps.

The soft spot is in how the benchmark tasks were verified. The abstract states that the selected studies have results that are fully reproducible with available materials, yet it does not describe the authors running those materials themselves to confirm the outputs match the published findings. Without that check, success rates could partly reflect whether the scripts run at all rather than whether they recover the exact claimed results. This weakens the isolation claim and the comparison to prior benchmarks. No error bars or statistical tests appear in the reported results either.

The work is aimed at researchers building or evaluating AI tools for science and at computational social scientists interested in reproducibility. Readers working on agent benchmarks or AI-assisted analysis will get direct value from the task set and the numbers.

It deserves peer review. The benchmark and the agent comparisons are concrete enough to warrant referee scrutiny on the construction details and verification process.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SocSci-Repro-Bench, a benchmark of 221 tasks spanning four social science disciplines constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data. It evaluates two frontier coding agents (Claude Code and Codex) on reproduction of findings, reports that both achieve large success rates (Claude substantially outperforming Codex and exceeding prior benchmarks), shows strong performance on a research-question identification task, provides evidence against memorization as the driver, and demonstrates effects of PDF provision and prompt framing on confirmatory search.

Significance. If the benchmark tasks are correctly validated, the results would indicate that frontier AI coding agents can function as reliable executors of computational social science workflows at rates higher than previously documented, with implications for scaling reproducibility efforts; the work also supplies falsifiable predictions about prompt sensitivity and includes analyses separating memorization from genuine reproduction capacity.

major comments (2)

[Abstract / benchmark construction] Abstract / benchmark construction paragraph: the claim that tasks isolate agent reproduction capacity rests on selecting studies 'whose results are either fully reproducible with available materials' but reports no verification step in which the authors executed the replication code and confirmed that outputs match the published coefficients, p-values, or tables (within tolerance). Without this, measured success rates may reflect execution of whatever the script produces rather than reproduction of the claimed findings, directly undermining the isolation from prior benchmarks and the central comparison.
[Results / evaluation sections] Results reporting (abstract and implied evaluation sections): success rates and outperformance claims are presented without error bars, confidence intervals, or statistical tests comparing to prior reproducibility benchmarks, so the assertion that rates 'considerably exceed' previous work cannot be assessed for robustness.

minor comments (2)

[Methods / evaluation setup] Clarify the precise model versions or interfaces referred to by 'Claude Code' and 'Codex' and whether they are used with default settings or custom scaffolding.
[Results on PDF provision] The modest improvement from providing the original paper PDF is described qualitatively; a table or specific delta in success rates would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract / benchmark construction paragraph: the claim that tasks isolate agent reproduction capacity rests on selecting studies 'whose results are either fully reproducible with available materials' but reports no verification step in which the authors executed the replication code and confirmed that outputs match the published coefficients, p-values, or tables (within tolerance). Without this, measured success rates may reflect execution of whatever the script produces rather than reproduction of the claimed findings, directly undermining the isolation from prior benchmarks and the central comparison.

Authors: We agree that an explicit verification step strengthens the benchmark's validity. The abstract describes selection criteria based on reproducibility with available materials, but does not report the execution and confirmation process. In the revised manuscript we will add a dedicated methods subsection describing how we ran the replication code for each task and verified that outputs matched published results (within tolerance), thereby better isolating agent reproduction capacity from material issues. revision: yes
Referee: [Results / evaluation sections] Results reporting (abstract and implied evaluation sections): success rates and outperformance claims are presented without error bars, confidence intervals, or statistical tests comparing to prior reproducibility benchmarks, so the assertion that rates 'considerably exceed' previous work cannot be assessed for robustness.

Authors: We concur that statistical reporting is needed for robust interpretation. The current version presents raw success rates without uncertainty estimates or formal comparisons. In the revision we will add confidence intervals to all reported rates and include appropriate statistical tests (e.g., proportion tests or bootstrap comparisons) against prior benchmarks to allow readers to evaluate the outperformance claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivation

full rationale

The paper reports direct empirical measurements of agent success rates on a newly constructed set of 221 tasks drawn from published studies. Reproduction rates are obtained by running the agents on the provided materials and counting matches to published outputs; no equations, fitted parameters, or self-referential definitions reduce these counts to the benchmark inputs themselves. No load-bearing self-citations or uniqueness theorems are invoked to justify the central performance claims. The evaluation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the 221 tasks accurately separate agent capability from material defects; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected studies are correctly classified as fully reproducible or non-reproducible solely on the basis of available materials.
This classification is used to isolate agents' reproduction capacity (abstract description of benchmark).

pith-pipeline@v0.9.1-grok · 5791 in / 1288 out tokens · 21209 ms · 2026-06-27T13:05:16.407122+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coding-agents can replicate scientific machine learning papers
cs.AI 2026-07 unverdicted novelty 7.0

Paper-replication is a workflow that enables coding agents to replicate computational claims from scientific ML papers by recording targets, reconstructing methods, running experiments, and validating evidence against...

Reference graph

Works this paper leans on

64 extracted references · 4 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

2026
[2]

Risks of ai scientists: prioritizing safeguarding over autonomy

Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Risks of ai scientists: prioritizing safeguarding over autonomy. Nature Communications, 16(1):8317, 2025

2025
[3]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

2025
[4]

Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang. Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

2025
[5]

Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

Christopher A Bail. Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

2024
[6]

Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, et al. Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

2026
[7]

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H Torr, et al. Agentslr: Au- tomating systematic literature reviews in epidemiology with agentic ai.arXiv preprint arXiv:2603.22327, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

Igor Grossmann, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham. Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

2023
[9]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

2023
[10]

CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark

Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. Transactions on Machine Learning Research, 2024

2024
[11]

Holistic agent leaderboard: The missing infrastructure for ai agent evaluation,

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

work page arXiv 2025
[12]

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. Repro-bench: Can agentic ai systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, 2025

2025
[13]

Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

Alexandru Marcoci, David P Wilkinson, Ans Vercammen, Bonnie C Wintle, Anna Lou Abatayo, Ernest Baskin, Henk Berkman, Erin M Buchanan, Sara Capitán, Tabaré Capitán, et al. Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

2025
[14]

Reproducibility and replicability in science

National Academies of Sciences, Medicine, Policy, Global Affairs, Board on Research Data, Informa- tion, Division on Engineering, Physical Sciences, Committee on Applied, Theoretical Statistics, et al. Reproducibility and replicability in science. National Academies Press, 2019. 15 Alizadeh et al

2019
[15]

Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

David Peterson and Aaron Panofsky. Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

2021
[16]

Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

Christophe Pérignon, Kamel Gadouche, Christophe Hurlin, Roxane Silberman, and Eric Debonnel. Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

2019
[17]

Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

Alec Brandon and John A List. Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

2015
[18]

How to make replication the norm.Nature, 554(7693):417–419, 2018

Paul Gertler, Sebastian Galiani, and Mauricio Romero. How to make replication the norm.Nature, 554(7693):417–419, 2018

2018
[19]

A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

2017
[20]

Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

Roger D Peng. Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

2011
[21]

State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[22]

An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

2018
[23]

Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

2021
[24]

Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

Abel Brodeur, Derek Mikola, Nikolai Cook, Lenka Fiala, Thomas Brailey, Ryan Briggs, Alexandra De Gendre, Yannick Dupraz, Jacopo Gabani, Romain Gauriot, et al. Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

2026
[25]

The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

Zohid Askarov, Anthony Doucouliagos, Hristos Doucouliagos, and Tom D Stanley. The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

2023
[26]

P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

Abel Brodeur, Nikolai Cook, and Carina Neisser. P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

2024
[27]

Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

Allan Dafoe. Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

2014
[28]

Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

Brian A Nosek, Tom E Hardwicke, Hannah Moshontz, Aurélien Allard, Katherine S Corker, Anna Dreber, Fiona Fidler, Joe Hilgard, Melissa Kline Struhl, Michèle B Nuijten, et al. Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

2022
[29]

Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

Miloš Fišar, Ben Greiner, Christoph Huber, Elena Katok, Ali I Ozkes, and Management Science Repro- ducibility Collaboration. Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

2024
[30]

Codeocean-a versatile platform for practical programming excercises in online environments

Thomas Staubitz, Hauke Klement, Ralf Teusner, Jan Renz, and Christoph Meinel. Codeocean-a versatile platform for practical programming excercises in online environments. In2016 IEEE Global Engineering Education Conference (EDUCON), pages 314–323. IEEE, 2016

2016
[31]

Time travel in llms: Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[32]

Paperbench: Evaluating ai’s ability to replicate ai research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. InForty-second International Conference on Machine Learning, 2025

2025
[33]

Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. 16 Alizadeh et al

work page arXiv 2025
[34]

Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

Samuel GZ Asher, Janet Malzahn, Jessica M Persano, Elliot J Paschal, Andrew CW Myers, and Andrew B Hall. Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

2026
[35]

Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

Nate Breznau, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, Henrik K Andersen, Daniel Auer, Flavio Azevedo, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

2022
[36]

Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

2018
[37]

1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

Monya Baker. 1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

2016
[38]

The research reproducibility crisis and economics of science, 2017

Zacharias Maniadis and Fabio Tufano. The research reproducibility crisis and economics of science, 2017

2017
[39]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Bang Nguyen, Dominik Soós, Qian Ma, Rochana R Obadage, Zack Ranjan, Sai Koneru, Timothy M Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, et al. Replicatorbench: Benchmarking llm agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. 17 Alizadeh et al. A Task Design Examples A.1 Examples of Tasks Excluded ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Areplication-materials/directory
[42]

id" •"RQ

A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process allNfolders (sequentially or in parallel). Follow the steps below exactly. Step 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. Step 2 — Inspect Replication Materials •Read al...
[43]

AReplication/directory
[44]

task_prompt

A JSON file named{folder_name}.jsoncontaining: •"task_prompt" •"tasks"
[45]

id" •"RQ

A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process all N folders (sequentially or in parallel). STEP 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. STEP 2 — Inspect Replication Materials
[46]

Inspect theReplication/(orreplication-materials/) directory
[47]

Read all README files and setup notes
[48]

Identify: •Entry-point scripts or notebooks •Expected outputs and locations •Data files and formats •Language/tooling used (Python, R, Stata, Julia, etc.) •Hardcoded paths or external assumptions •IDE/notebook dependencies •Missing output directories or required folder structures STEP 3 — Environment Setup (Offline Sandbox)
[49]

Create or activate an environment (virtualenv/conda if available)
[50]

•R:Rscript -e ’install.packages(...)’

Install required packages: •Python:python3 -m pip install ... •R:Rscript -e ’install.packages(...)’
[51]

Resolve version incompatibilities using closest compatible versions and document choices
[52]

Use only local files

Do not download data from the internet. Use only local files. STEP 4 — Write a New Executable Replication Script Create a new script in the current folder named: replication_code.pyorreplication_code.R Choose the dominant language in the repository. If code is a.do file, convert it to an R script and run that. The script must:
[53]

Be executable end-to-end from the command line
[54]

Reproduce the main analysis pipeline using provided code and data
[55]

Resolve executability issues, including: •Missing directories (create output folders) •Hardcoded absolute paths (replace with relative paths) •Notebook-only logic (convert to scriptable workflow) •Interactive IDE assumptions •Dependency/version mismatches •File naming inconsistencies
[56]

21 Alizadeh et al

Preserve original analytical logic whenever possible. 21 Alizadeh et al
[57]

Write all outputs into a localresults/directory
[58]

Include minimal logging statements
[59]

STEP 5 — Execute and Validate

If the entry point in{folder_number}.json is incorrect, identify the correct entry point indepen- dently. STEP 5 — Execute and Validate
[60]

Run the new replication script
[61]

Verify outputs match task requirements
[62]

If execution fails, revise only the new script and environment
[63]

Iterate until best achievable reproduction is reached
[64]

Copy the original JSON structure and insert answers inline
[65]

task_prompt

Save asresults_1.jsonwith exact schema: { "task_prompt": "<copied exactly>", "tasks": [ {"Question text 1": "Answer 1"}, {"Question text 2": "Answer 2"} ] } STEP 6 — Logging •Createlog.jsoncontaining: –Commands executed –Errors encountered –Fixes applied –Replication status (success/failure) •If replication fails: –Document issue inlog.json –Continue to n...

2025

[1] [1]

Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

2026

[2] [2]

Risks of ai scientists: prioritizing safeguarding over autonomy

Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Risks of ai scientists: prioritizing safeguarding over autonomy. Nature Communications, 16(1):8317, 2025

2025

[3] [3]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

2025

[4] [4]

Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang. Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

2025

[5] [5]

Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

Christopher A Bail. Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

2024

[6] [6]

Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, et al. Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

2026

[7] [7]

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H Torr, et al. Agentslr: Au- tomating systematic literature reviews in epidemiology with agentic ai.arXiv preprint arXiv:2603.22327, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

Igor Grossmann, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham. Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

2023

[9] [9]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

2023

[10] [10]

CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark

Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. Transactions on Machine Learning Research, 2024

2024

[11] [11]

Holistic agent leaderboard: The missing infrastructure for ai agent evaluation,

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

work page arXiv 2025

[12] [12]

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. Repro-bench: Can agentic ai systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, 2025

2025

[13] [13]

Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

Alexandru Marcoci, David P Wilkinson, Ans Vercammen, Bonnie C Wintle, Anna Lou Abatayo, Ernest Baskin, Henk Berkman, Erin M Buchanan, Sara Capitán, Tabaré Capitán, et al. Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

2025

[14] [14]

Reproducibility and replicability in science

National Academies of Sciences, Medicine, Policy, Global Affairs, Board on Research Data, Informa- tion, Division on Engineering, Physical Sciences, Committee on Applied, Theoretical Statistics, et al. Reproducibility and replicability in science. National Academies Press, 2019. 15 Alizadeh et al

2019

[15] [15]

Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

David Peterson and Aaron Panofsky. Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

2021

[16] [16]

Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

Christophe Pérignon, Kamel Gadouche, Christophe Hurlin, Roxane Silberman, and Eric Debonnel. Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

2019

[17] [17]

Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

Alec Brandon and John A List. Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

2015

[18] [18]

How to make replication the norm.Nature, 554(7693):417–419, 2018

Paul Gertler, Sebastian Galiani, and Mauricio Romero. How to make replication the norm.Nature, 554(7693):417–419, 2018

2018

[19] [19]

A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

2017

[20] [20]

Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

Roger D Peng. Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

2011

[21] [21]

State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[22] [22]

An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

2018

[23] [23]

Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

2021

[24] [24]

Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

Abel Brodeur, Derek Mikola, Nikolai Cook, Lenka Fiala, Thomas Brailey, Ryan Briggs, Alexandra De Gendre, Yannick Dupraz, Jacopo Gabani, Romain Gauriot, et al. Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

2026

[25] [25]

The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

Zohid Askarov, Anthony Doucouliagos, Hristos Doucouliagos, and Tom D Stanley. The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

2023

[26] [26]

P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

Abel Brodeur, Nikolai Cook, and Carina Neisser. P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

2024

[27] [27]

Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

Allan Dafoe. Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

2014

[28] [28]

Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

Brian A Nosek, Tom E Hardwicke, Hannah Moshontz, Aurélien Allard, Katherine S Corker, Anna Dreber, Fiona Fidler, Joe Hilgard, Melissa Kline Struhl, Michèle B Nuijten, et al. Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

2022

[29] [29]

Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

Miloš Fišar, Ben Greiner, Christoph Huber, Elena Katok, Ali I Ozkes, and Management Science Repro- ducibility Collaboration. Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

2024

[30] [30]

Codeocean-a versatile platform for practical programming excercises in online environments

Thomas Staubitz, Hauke Klement, Ralf Teusner, Jan Renz, and Christoph Meinel. Codeocean-a versatile platform for practical programming excercises in online environments. In2016 IEEE Global Engineering Education Conference (EDUCON), pages 314–323. IEEE, 2016

2016

[31] [31]

Time travel in llms: Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[32] [32]

Paperbench: Evaluating ai’s ability to replicate ai research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. InForty-second International Conference on Machine Learning, 2025

2025

[33] [33]

Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. 16 Alizadeh et al

work page arXiv 2025

[34] [34]

Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

Samuel GZ Asher, Janet Malzahn, Jessica M Persano, Elliot J Paschal, Andrew CW Myers, and Andrew B Hall. Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

2026

[35] [35]

Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

Nate Breznau, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, Henrik K Andersen, Daniel Auer, Flavio Azevedo, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

2022

[36] [36]

Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

2018

[37] [37]

1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

Monya Baker. 1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

2016

[38] [38]

The research reproducibility crisis and economics of science, 2017

Zacharias Maniadis and Fabio Tufano. The research reproducibility crisis and economics of science, 2017

2017

[39] [39]

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Bang Nguyen, Dominik Soós, Qian Ma, Rochana R Obadage, Zack Ranjan, Sai Koneru, Timothy M Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, et al. Replicatorbench: Benchmarking llm agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. 17 Alizadeh et al. A Task Design Examples A.1 Examples of Tasks Excluded ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Areplication-materials/directory

[41] [42]

id" •"RQ

A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process allNfolders (sequentially or in parallel). Follow the steps below exactly. Step 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. Step 2 — Inspect Replication Materials •Read al...

[42] [43]

AReplication/directory

[43] [44]

task_prompt

A JSON file named{folder_name}.jsoncontaining: •"task_prompt" •"tasks"

[44] [45]

id" •"RQ

A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process all N folders (sequentially or in parallel). STEP 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. STEP 2 — Inspect Replication Materials

[45] [46]

Inspect theReplication/(orreplication-materials/) directory

[46] [47]

Read all README files and setup notes

[47] [48]

Identify: •Entry-point scripts or notebooks •Expected outputs and locations •Data files and formats •Language/tooling used (Python, R, Stata, Julia, etc.) •Hardcoded paths or external assumptions •IDE/notebook dependencies •Missing output directories or required folder structures STEP 3 — Environment Setup (Offline Sandbox)

[48] [49]

Create or activate an environment (virtualenv/conda if available)

[49] [50]

•R:Rscript -e ’install.packages(...)’

Install required packages: •Python:python3 -m pip install ... •R:Rscript -e ’install.packages(...)’

[50] [51]

Resolve version incompatibilities using closest compatible versions and document choices

[51] [52]

Use only local files

Do not download data from the internet. Use only local files. STEP 4 — Write a New Executable Replication Script Create a new script in the current folder named: replication_code.pyorreplication_code.R Choose the dominant language in the repository. If code is a.do file, convert it to an R script and run that. The script must:

[52] [53]

Be executable end-to-end from the command line

[53] [54]

Reproduce the main analysis pipeline using provided code and data

[54] [55]

Resolve executability issues, including: •Missing directories (create output folders) •Hardcoded absolute paths (replace with relative paths) •Notebook-only logic (convert to scriptable workflow) •Interactive IDE assumptions •Dependency/version mismatches •File naming inconsistencies

[55] [56]

21 Alizadeh et al

Preserve original analytical logic whenever possible. 21 Alizadeh et al

[56] [57]

Write all outputs into a localresults/directory

[57] [58]

Include minimal logging statements

[58] [59]

STEP 5 — Execute and Validate

If the entry point in{folder_number}.json is incorrect, identify the correct entry point indepen- dently. STEP 5 — Execute and Validate

[59] [60]

Run the new replication script

[60] [61]

Verify outputs match task requirements

[61] [62]

If execution fails, revise only the new script and environment

[62] [63]

Iterate until best achievable reproduction is reached

[63] [64]

Copy the original JSON structure and insert answers inline

[64] [65]

task_prompt

Save asresults_1.jsonwith exact schema: { "task_prompt": "<copied exactly>", "tasks": [ {"Question text 1": "Answer 1"}, {"Question text 2": "Answer 2"} ] } STEP 6 — Logging •Createlog.jsoncontaining: –Commands executed –Errors encountered –Fixes applied –Replication status (success/failure) •If replication fails: –Document issue inlog.json –Continue to n...

2025