pith. sign in

arxiv: 2606.11447 · v1 · pith:Q4HD2GBXnew · submitted 2026-06-09 · 💻 cs.CL

AI Coding Agents Can Reproduce Social Science Findings

Pith reviewed 2026-06-27 13:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords AI coding agentssocial science reproducibilityreproduction benchmarkClaude CodeCodexLLM evaluationcomputational workflowsprompt effects
0
0 comments X

The pith

Claude Code and Codex reproduce a large share of social science findings from original data and code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SocSci-Repro-Bench, a set of 221 tasks drawn from studies in four disciplines where results are known to be either fully reproducible or impossible due to missing materials. It tests two frontier coding agents on these tasks to measure their ability to execute the original analyses correctly. Both agents succeed on many tasks, with Claude Code outperforming Codex and exceeding reproduction rates seen in earlier LLM evaluations. The agents also identify underlying research questions in the studies at high rates, and checks suggest success is not mainly from memorizing the papers. The work shows that giving agents the original paper PDF can help but risks introducing bias on tasks that cannot be reproduced, and that prompt wording can steer agents toward certain analytical choices.

Core claim

We introduce SocSci-Repro-Bench with 221 tasks spanning four disciplines and 13 domains, built only from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data. On this benchmark, both Claude Code and Codex reproduce a large share of the findings, with Claude Code substantially outperforming Codex at rates higher than those reported for general-purpose LLM-based agents on comparable tasks. The agents also perform strongly on identifying the research questions behind the studies, additional analyses indicate results are not primarily driven by memorization, and providing the paper PDF modestly improves performance whil

What carries the argument

SocSci-Repro-Bench, a collection of 221 reproduction tasks that isolates agent performance by restricting the set to studies with clear, known reproducibility status based on provided materials.

If this is right

  • Frontier coding agents can serve as reliable executors of computational workflows in social science research.
  • Prompt design requires care because subtle framing can nudge agents toward confirmatory specification search.
  • Providing original papers alongside code and data can improve results but risks biasing outcomes on non-reproducible tasks.
  • Agents demonstrate capability on reasoning tasks such as identifying underlying research questions in addition to code execution.
  • Systematic benchmarking becomes necessary as AI systems take on larger roles in scientific production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The performance gap between agents may grow or shrink with newer models, suggesting ongoing monitoring of capability changes over time.
  • Integration of such agents into research could accelerate verification of existing findings but would require safeguards against prompt-induced biases.
  • Extending the benchmark approach to other fields or to studies using newer data sources could test how broadly the reproduction capacity holds.
  • The results raise the practical question of how to combine agent outputs with human oversight in publication pipelines.

Load-bearing premise

The benchmark selects only studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, so that success or failure can be attributed to the agent rather than the materials themselves.

What would settle it

Running the same agents on an independent collection of social science studies whose reproducibility status has been verified separately from the benchmark construction, and finding substantially lower reproduction rates.

Figures

Figures reproduced from arXiv: 2606.11447 by Atoosa Kasirzadeh, Fabrizio Gilardi, Joshua Tucker, Meysam Alizadeh, Mohsen Mosleh.

Figure 1
Figure 1. Figure 1: Comparison of Claude Code and Codex across three accuracy metrics and failure rates. (Left) Accuracy for all tasks (N = 221), non-reproducible tasks (N = 10), and all papers (N = 54). Both models achieve perfect accuracy on non-reproducible tasks, while Claude Code substantially outperforms Codex at both task (93.4% vs. 62.1%) and paper level (78.0% vs. 35.8%), where a paper was considered fully reproduced… view at source ↗
Figure 2
Figure 2. Figure 2: Stratified performance of Claude Code and Codex across programming languages and training-data cutoffs. a, Task-level accuracy stratified by the primary programming language of each replication package (Python, n = 49 tasks; R, n = 136; Stata, n = 36). Sample sizes differ across languages and comparisons are descriptive. b, Paper-level accuracy (all tasks correct) under the same language stratification (Py… view at source ↗
Figure 3
Figure 3. Figure 3: Evidence against direct memorization in AI coding agents assessed through metadata recovery from anonymized replication materials. Stacked bar charts show the percentage of papers (n = 54) for which each AI agent correctly recovered the title, authors, journal and publication year from fully anonymized replication code and data, compared against a gold-standard reference. (a) Claude Code attempted metadata… view at source ↗
Figure 4
Figure 4. Figure 4: Research question (RQ) extraction accuracy of Claude Code and Codex compared to the Gold standard. Three similarity metrics are shown in each panel: RQ-level semantic match rate (proportion of greedy-paired RQs that are semantically equivalent), paper-level full match rate (proportion of papers where all Gold RQs have a semantic match), and paper-level ≥60% match rate (proportion of papers where at least 6… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy under confirmatory prompt nudging. Mean accuracy of Claude Code and Codex across three independent runs when presented with confirmatory prompts designed to induce result-oriented specification search. Results are faceted by evaluation granularity: task-level (left) and paper-level (right). Claude Code maintained high accuracy on all tasks (94.1%) compared with Codex (74.1%), though both models sh… view at source ↗
Figure 6
Figure 6. Figure 6: Claude Code and Codex performance on the CORE-Bench social science reproducibility benchmark. (a) Non-anonymized condition, where agents have access to paper titles and author names, and (b) anonymized condition, where this metadata is removed. Each panel reports task-level and paper-level accuracy (left) and failure rates (right), averaged across three independent runs. with 0.0% failure rates (Fig. 6a). … view at source ↗
read the original abstract

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SocSci-Repro-Bench, a benchmark of 221 tasks spanning four social science disciplines constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data. It evaluates two frontier coding agents (Claude Code and Codex) on reproduction of findings, reports that both achieve large success rates (Claude substantially outperforming Codex and exceeding prior benchmarks), shows strong performance on a research-question identification task, provides evidence against memorization as the driver, and demonstrates effects of PDF provision and prompt framing on confirmatory search.

Significance. If the benchmark tasks are correctly validated, the results would indicate that frontier AI coding agents can function as reliable executors of computational social science workflows at rates higher than previously documented, with implications for scaling reproducibility efforts; the work also supplies falsifiable predictions about prompt sensitivity and includes analyses separating memorization from genuine reproduction capacity.

major comments (2)
  1. [Abstract / benchmark construction] Abstract / benchmark construction paragraph: the claim that tasks isolate agent reproduction capacity rests on selecting studies 'whose results are either fully reproducible with available materials' but reports no verification step in which the authors executed the replication code and confirmed that outputs match the published coefficients, p-values, or tables (within tolerance). Without this, measured success rates may reflect execution of whatever the script produces rather than reproduction of the claimed findings, directly undermining the isolation from prior benchmarks and the central comparison.
  2. [Results / evaluation sections] Results reporting (abstract and implied evaluation sections): success rates and outperformance claims are presented without error bars, confidence intervals, or statistical tests comparing to prior reproducibility benchmarks, so the assertion that rates 'considerably exceed' previous work cannot be assessed for robustness.
minor comments (2)
  1. [Methods / evaluation setup] Clarify the precise model versions or interfaces referred to by 'Claude Code' and 'Codex' and whether they are used with default settings or custom scaffolding.
  2. [Results on PDF provision] The modest improvement from providing the original paper PDF is described qualitatively; a table or specific delta in success rates would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] Abstract / benchmark construction paragraph: the claim that tasks isolate agent reproduction capacity rests on selecting studies 'whose results are either fully reproducible with available materials' but reports no verification step in which the authors executed the replication code and confirmed that outputs match the published coefficients, p-values, or tables (within tolerance). Without this, measured success rates may reflect execution of whatever the script produces rather than reproduction of the claimed findings, directly undermining the isolation from prior benchmarks and the central comparison.

    Authors: We agree that an explicit verification step strengthens the benchmark's validity. The abstract describes selection criteria based on reproducibility with available materials, but does not report the execution and confirmation process. In the revised manuscript we will add a dedicated methods subsection describing how we ran the replication code for each task and verified that outputs matched published results (within tolerance), thereby better isolating agent reproduction capacity from material issues. revision: yes

  2. Referee: [Results / evaluation sections] Results reporting (abstract and implied evaluation sections): success rates and outperformance claims are presented without error bars, confidence intervals, or statistical tests comparing to prior reproducibility benchmarks, so the assertion that rates 'considerably exceed' previous work cannot be assessed for robustness.

    Authors: We concur that statistical reporting is needed for robust interpretation. The current version presents raw success rates without uncertainty estimates or formal comparisons. In the revision we will add confidence intervals to all reported rates and include appropriate statistical tests (e.g., proportion tests or bootstrap comparisons) against prior benchmarks to allow readers to evaluate the outperformance claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivation

full rationale

The paper reports direct empirical measurements of agent success rates on a newly constructed set of 221 tasks drawn from published studies. Reproduction rates are obtained by running the agents on the provided materials and counting matches to published outputs; no equations, fitted parameters, or self-referential definitions reduce these counts to the benchmark inputs themselves. No load-bearing self-citations or uniqueness theorems are invoked to justify the central performance claims. The evaluation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the 221 tasks accurately separate agent capability from material defects; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The selected studies are correctly classified as fully reproducible or non-reproducible solely on the basis of available materials.
    This classification is used to isolate agents' reproduction capacity (abstract description of benchmark).

pith-pipeline@v0.9.1-grok · 5791 in / 1288 out tokens · 21209 ms · 2026-06-27T13:05:16.407122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coding-agents can replicate scientific machine learning papers

    cs.AI 2026-07 unverdicted novelty 7.0

    Paper-replication is a workflow that enables coding agents to replicate computational claims from scientific ML papers by recording targets, reconstructing methods, running experiments, and validating evidence against...

Reference graph

Works this paper leans on

64 extracted references · 4 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

  2. [2]

    Risks of ai scientists: prioritizing safeguarding over autonomy

    Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Risks of ai scientists: prioritizing safeguarding over autonomy. Nature Communications, 16(1):8317, 2025

  3. [3]

    Agent laboratory: Using llm agents as research assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

  4. [4]

    Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

    Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang. Sciscigpt: advancing human–ai collaboration in the science of science.Nature Computational Science, pages 1–15, 2025

  5. [5]

    Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

    Christopher A Bail. Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

  6. [6]

    Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, et al. Synthesizing scientific literature with retrieval-augmented language models.Nature, pages 1–7, 2026

  7. [7]

    Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

    Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H Torr, et al. Agentslr: Au- tomating systematic literature reviews in epidemiology with agentic ai.arXiv preprint arXiv:2603.22327, 2026

  8. [8]

    Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

    Igor Grossmann, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham. Ai and the transformation of social science research.Science, 380(6650):1108– 1109, 2023

  9. [9]

    Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

    Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

  10. [10]

    CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark

    Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. Transactions on Machine Learning Research, 2024

  11. [11]

    Holistic agent leaderboard: The missing infrastructure for ai agent evaluation,

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

  12. [12]

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. Repro-bench: Can agentic ai systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pages 23616–23626, 2025

  13. [13]

    Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

    Alexandru Marcoci, David P Wilkinson, Ans Vercammen, Bonnie C Wintle, Anna Lou Abatayo, Ernest Baskin, Henk Berkman, Erin M Buchanan, Sara Capitán, Tabaré Capitán, et al. Predicting the replicability of social and behavioural science claims in covid-19 preprints.Nature human behaviour, 9(2):287–304, 2025

  14. [14]

    Reproducibility and replicability in science

    National Academies of Sciences, Medicine, Policy, Global Affairs, Board on Research Data, Informa- tion, Division on Engineering, Physical Sciences, Committee on Applied, Theoretical Statistics, et al. Reproducibility and replicability in science. National Academies Press, 2019. 15 Alizadeh et al

  15. [15]

    Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

    David Peterson and Aaron Panofsky. Self-correction in science: The diagnostic and integrative motives for replication.Social Studies of Science, 51(4):583–605, 2021

  16. [16]

    Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

    Christophe Pérignon, Kamel Gadouche, Christophe Hurlin, Roxane Silberman, and Eric Debonnel. Certify reproducibility with confidential data.Science, 365(6449):127–128, 2019

  17. [17]

    Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

    Alec Brandon and John A List. Markets for replication.Proceedings of the National Academy of Sciences, 112(50):15267–15268, 2015

  18. [18]

    How to make replication the norm.Nature, 554(7693):417–419, 2018

    Paul Gertler, Sebastian Galiani, and Mauricio Romero. How to make replication the norm.Nature, 554(7693):417–419, 2018

  19. [19]

    A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

    Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. A manifesto for reproducible science.Nature human behaviour, 1(1):0021, 2017

  20. [20]

    Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

    Roger D Peng. Reproducible research in computational science.Science, 334(6060):1226–1227, 2011

  21. [21]

    State of the art: Reproducibility in artificial intelligence

    Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  22. [22]

    An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

    Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. An empirical analysis of journal policy effectiveness for computational reproducibility.Proceedings of the National Academy of Sciences, 115(11):2584–2589, 2018

  23. [23]

    Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019reproducibility program).Journal of machine learning research, 22(164):1–20, 2021

  24. [24]

    Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

    Abel Brodeur, Derek Mikola, Nikolai Cook, Lenka Fiala, Thomas Brailey, Ryan Briggs, Alexandra De Gendre, Yannick Dupraz, Jacopo Gabani, Romain Gauriot, et al. Reproducibility and robustness of economics and political science research.Nature, 652(8108):151–156, 2026

  25. [25]

    The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

    Zohid Askarov, Anthony Doucouliagos, Hristos Doucouliagos, and Tom D Stanley. The significance of data-sharing policy.Journal of the European Economic Association, 21(3):1191–1226, 2023

  26. [26]

    P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

    Abel Brodeur, Nikolai Cook, and Carina Neisser. P-hacking, data type and data-sharing policy.The Economic Journal, 134(659):985–1018, 2024

  27. [27]

    Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

    Allan Dafoe. Science deserves better: the imperative to share complete replication files.PS: Political Science & Politics, 47(1):60–66, 2014

  28. [28]

    Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

    Brian A Nosek, Tom E Hardwicke, Hannah Moshontz, Aurélien Allard, Katherine S Corker, Anna Dreber, Fiona Fidler, Joe Hilgard, Melissa Kline Struhl, Michèle B Nuijten, et al. Replicability, robustness, and reproducibility in psychological science.Annual review of psychology, 73:719–748, 2022

  29. [29]

    Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

    Miloš Fišar, Ben Greiner, Christoph Huber, Elena Katok, Ali I Ozkes, and Management Science Repro- ducibility Collaboration. Reproducibility in management science.Management Science, 70(3):1343–1356, 2024

  30. [30]

    Codeocean-a versatile platform for practical programming excercises in online environments

    Thomas Staubitz, Hauke Klement, Ralf Teusner, Jan Renz, and Christoph Meinel. Codeocean-a versatile platform for practical programming excercises in online environments. In2016 IEEE Global Engineering Education Conference (EDUCON), pages 314–323. IEEE, 2016

  31. [31]

    Time travel in llms: Tracing data contamination in large language models

    Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

  32. [32]

    Paperbench: Evaluating ai’s ability to replicate ai research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. InForty-second International Conference on Machine Learning, 2025

  33. [33]

    Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

    Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Rajamanickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025. 16 Alizadeh et al

  34. [34]

    Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

    Samuel GZ Asher, Janet Malzahn, Jessica M Persano, Elliot J Paschal, Andrew CW Myers, and Andrew B Hall. Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, 2026

  35. [35]

    Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

    Nate Breznau, Eike Mark Rinke, Alexander Wuttke, Hung HV Nguyen, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, Henrik K Andersen, Daniel Auer, Flavio Azevedo, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty.Proceedings of the National Academy of Sciences, 119(44):e2203150119, 2022

  36. [36]

    Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

    Raphael Silberzahn, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results.Advances in methods and practices in psychological science, 1(3):337–356, 2018

  37. [37]

    1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

    Monya Baker. 1,500 scientists lift the lid on reproducibility.Nature, 533(7604):452–454, 2016

  38. [38]

    The research reproducibility crisis and economics of science, 2017

    Zacharias Maniadis and Fabio Tufano. The research reproducibility crisis and economics of science, 2017

  39. [39]

    ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

    Bang Nguyen, Dominik Soós, Qian Ma, Rochana R Obadage, Zack Ranjan, Sai Koneru, Timothy M Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, et al. Replicatorbench: Benchmarking llm agents for replicability in social and behavioral sciences.arXiv preprint arXiv:2602.11354, 2026. 17 Alizadeh et al. A Task Design Examples A.1 Examples of Tasks Excluded ...

  40. [40]

    Areplication-materials/directory

  41. [42]

    id" •"RQ

    A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process allNfolders (sequentially or in parallel). Follow the steps below exactly. Step 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. Step 2 — Inspect Replication Materials •Read al...

  42. [43]

    AReplication/directory

  43. [44]

    task_prompt

    A JSON file named{folder_name}.jsoncontaining: •"task_prompt" •"tasks"

  44. [45]

    id" •"RQ

    A JSON file namedRQ_{folder_number}.jsoncontaining: •"id" •"RQ" •"paper_title" •"paper_authors" Process all N folders (sequentially or in parallel). STEP 1 — Read Instructions •Open{folder_name}.json. •Carefully read"task_prompt". •Identify and respect any explicit restrictions. STEP 2 — Inspect Replication Materials

  45. [46]

    Inspect theReplication/(orreplication-materials/) directory

  46. [47]

    Read all README files and setup notes

  47. [48]

    Identify: •Entry-point scripts or notebooks •Expected outputs and locations •Data files and formats •Language/tooling used (Python, R, Stata, Julia, etc.) •Hardcoded paths or external assumptions •IDE/notebook dependencies •Missing output directories or required folder structures STEP 3 — Environment Setup (Offline Sandbox)

  48. [49]

    Create or activate an environment (virtualenv/conda if available)

  49. [50]

    •R:Rscript -e ’install.packages(...)’

    Install required packages: •Python:python3 -m pip install ... •R:Rscript -e ’install.packages(...)’

  50. [51]

    Resolve version incompatibilities using closest compatible versions and document choices

  51. [52]

    Use only local files

    Do not download data from the internet. Use only local files. STEP 4 — Write a New Executable Replication Script Create a new script in the current folder named: replication_code.pyorreplication_code.R Choose the dominant language in the repository. If code is a.do file, convert it to an R script and run that. The script must:

  52. [53]

    Be executable end-to-end from the command line

  53. [54]

    Reproduce the main analysis pipeline using provided code and data

  54. [55]

    Resolve executability issues, including: •Missing directories (create output folders) •Hardcoded absolute paths (replace with relative paths) •Notebook-only logic (convert to scriptable workflow) •Interactive IDE assumptions •Dependency/version mismatches •File naming inconsistencies

  55. [56]

    21 Alizadeh et al

    Preserve original analytical logic whenever possible. 21 Alizadeh et al

  56. [57]

    Write all outputs into a localresults/directory

  57. [58]

    Include minimal logging statements

  58. [59]

    STEP 5 — Execute and Validate

    If the entry point in{folder_number}.json is incorrect, identify the correct entry point indepen- dently. STEP 5 — Execute and Validate

  59. [60]

    Run the new replication script

  60. [61]

    Verify outputs match task requirements

  61. [62]

    If execution fails, revise only the new script and environment

  62. [63]

    Iterate until best achievable reproduction is reached

  63. [64]

    Copy the original JSON structure and insert answers inline

  64. [65]

    task_prompt

    Save asresults_1.jsonwith exact schema: { "task_prompt": "<copied exactly>", "tasks": [ {"Question text 1": "Answer 1"}, {"Question text 2": "Answer 2"} ] } STEP 6 — Logging •Createlog.jsoncontaining: –Commands executed –Errors encountered –Fixes applied –Replication status (success/failure) •If replication fails: –Document issue inlog.json –Continue to n...