Recognition: unknown
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
Pith reviewed 2026-05-14 18:23 UTC · model grok-4.3
The pith
ReproScore separates static readiness assessment from actual execution outcome in research software and finds near-zero correlation between them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReproScore is a two-tier framework that separates reproducibility readiness (RRS, built from 26 sub-metrics in five categories) from reproducibility outcome (ROS, measured by sandboxed execution probes). These are combined into a coverage-adaptive Composite Score (RCS) whose weights can be set by community YAML profiles. Evaluated on 423 repositories spanning five failure modes, the environment category of RRS discriminates failure types effectively, yet the overall RRS score exhibits near-zero correlation with binary execution success, thereby quantifying the readiness-outcome gap at repository scale.
What carries the argument
The ReproScore two-tier separation of static readiness metrics (RRS) from execution-based outcome probes (ROS), aggregated via a coverage-adaptive Composite Score (RCS) with external YAML weighting profiles.
If this is right
- Digital libraries can now maintain separate readiness and outcome scores instead of treating completeness as a proxy for executability.
- Community-defined YAML profiles allow different weighting schemes to be versioned and reused across curation workflows.
- Static signals remain useful for detecting structural differences among failure modes even when they do not predict success.
- Assessment remains feasible at scale when full sandbox execution is unavailable, by falling back to the readiness tier.
- The architectural split demonstrates that readiness and outcome must be tracked independently for reproducibility-aware curation.
Where Pith is reading between the lines
- If the near-zero correlation persists, many existing automated reproducibility tools may systematically overstate the usability of archived code.
- The same separation could be tested on data repositories or computational notebooks to check whether a similar readiness-outcome gap appears outside GitHub software.
- Libraries could adopt the composite score as a filter that only triggers full execution checks on high-readiness items, reducing compute cost.
- Extending the five-category metric set to include citation or documentation quality might further improve discrimination without adding execution overhead.
Load-bearing premise
The 423-repository ground-truth corpus spanning five failure modes is representative enough to support general claims about the near-zero correlation between readiness and outcome.
What would settle it
Re-running the same RRS computation and execution probes on a new collection of at least 1000 research software repositories that yields a statistically significant positive correlation between RRS and ROS would falsify the reported gap.
Figures
read the original abstract
Digital libraries curate millions of research software artefacts yet lack scalable infrastructure for assessing whether those artefacts remain executable. Existing automated assessment tools treat static repository completeness -- what a repository contains -- as a proxy for execution success -- whether it runs. We term this the readiness-outcome conflation and present ReproScore, a two-tier framework that explicitly separates reproducibility readiness (RRS) from reproducibility outcome (ROS), combining them into a coverage-adaptive Composite Score (RCS). RRS comprises 26 sub-metrics across five categories; ROS provides execution-based probes when sandbox infrastructure is available; a community rubric externalises weighting priorities as versioned YAML profiles. Evaluated on 423 GitHub repositories from a large-scale ground-truth corpus spanning five failure modes, two complementary findings emerge: the environment category strongly discriminates failure mode, confirming static signals capture meaningful structural differences; yet RRS exhibits near-zero binary success correlation, empirically quantifying the readiness-outcome gap at repository scale. Together, these findings validate the architectural separation as both necessary and non-trivial, positioning ReproScore as scalable infrastructure for reproducibility-aware curation in digital library workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ReproScore, a two-tier framework that separates reproducibility readiness (RRS, computed from 26 static sub-metrics across five categories) from reproducibility outcome (ROS, via execution-based probes). These are combined into a coverage-adaptive Composite Score (RCS) using versioned community YAML weighting profiles. Evaluated on a 423-repository ground-truth corpus spanning five failure modes, the work reports that the environment category discriminates among failure modes while RRS exhibits near-zero correlation with binary success, thereby quantifying the readiness-outcome gap at repository scale.
Significance. If the near-zero correlation generalizes beyond the curated corpus, the separation of static readiness signals from execution outcomes supplies scalable infrastructure for reproducibility-aware curation in digital libraries. The external ground-truth corpus, execution probes, and versioned YAML profiles constitute concrete strengths that make the assessment adaptable and empirically testable.
major comments (1)
- [Evaluation] Evaluation section: the 423-repository ground-truth corpus was assembled specifically to span five failure modes. This selection process conditions both the outcome distribution and RRS score spread on failure diversity rather than drawing from an unbiased sample of GitHub research software. Consequently, the headline claim of near-zero RRS-binary success correlation requires additional support (sampling frame, inclusion criteria, or population comparison statistics) before the generalization to broader collections can be accepted.
minor comments (1)
- [Abstract] Abstract: no information is supplied on how the 26 sub-metrics are derived, how weights are assigned, or whether any post-hoc decisions were made during the 423-repository evaluation; these details are needed to evaluate the robustness of the reported correlation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify the scope and limitations of our evaluation. We address the major comment point by point below and have revised the manuscript to improve transparency regarding the corpus construction and generalizability.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 423-repository ground-truth corpus was assembled specifically to span five failure modes. This selection process conditions both the outcome distribution and RRS score spread on failure diversity rather than drawing from an unbiased sample of GitHub research software. Consequently, the headline claim of near-zero RRS-binary success correlation requires additional support (sampling frame, inclusion criteria, or population comparison statistics) before the generalization to broader collections can be accepted.
Authors: We agree that the 423-repository ground-truth corpus was deliberately assembled to span the five failure modes, which means it is not a random or unbiased sample from the broader population of GitHub research software repositories. This curation was intentional to allow for a comprehensive assessment of ReproScore's behavior across diverse reproducibility challenges, including both successful and failed executions. The near-zero correlation between RRS and binary success is an empirical observation specific to this corpus, which includes a balanced representation of failure modes to highlight the readiness-outcome gap. We do not claim that this correlation holds universally without additional validation on other datasets. To address the referee's concern, we will revise the Evaluation section to: explicitly detail the sampling frame and inclusion criteria used in constructing the corpus; include a dedicated limitations subsection discussing the implications for generalizability; and provide any available statistics comparing the corpus to broader populations where possible. These changes will clarify that the findings are conditioned on the diverse failure-mode coverage while strengthening the manuscript's transparency. revision: yes
Circularity Check
No significant circularity: empirical claim rests on external corpus evaluation
full rationale
The paper defines RRS via 26 independent sub-metrics across five categories and ROS via execution probes, then reports an observed near-zero correlation on a separately curated 423-repository ground-truth corpus. No parameter is fitted to the correlation itself, no self-citation supplies the central uniqueness or weighting, and the readiness-outcome separation is not derived from the measured correlation. The evaluation chain therefore remains self-contained against the external corpus and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2310.09634 (2023)
Akdeniz, E.K., Tekir, S., Hinnawi, M.N.A.A.: An end-to-end system for repro- ducibility assessment of source code repositories via their readmes. arXiv preprint arXiv:2310.09634 (2023)
-
[2]
1,500 scientists lift the lid on reproducibility
Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature533, 452–454 (2016). https://doi.org/10.1038/533452a
-
[3]
In: RExPO22
Bandrowski, A., Roelandse, M.: SciScore, a tool that can measure rigor criteria presence or absence in a biomedical study. In: RExPO22. ScienceOpen (2022)
2022
-
[4]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Bogin, B., Yang, K., Gupta, S., Richardson, K., Bransom, E., Clark, P., Sabharwal, A., Khot, T.: Super: Evaluating agents on setting up and executing tasks from research repositories. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 12622–12645 (2024)
2024
-
[5]
whole tale
Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M.B., Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B.D., Nabrzyski, J., et al.: Computing environments for reproducibility: Capturing the “whole tale”. Future Generation Computer Systems94, 854–867 (2019)
2019
-
[6]
Manning Publications Co
Campbell, G.A., Papapetrou, P.P.: SonarQube in action. Manning Publications Co. (2013)
2013
-
[7]
Journal of Neural Engi- neering22(2), 021002 (2025)
Carlson, D.E., Chavarriaga, R., Liu, Y., Lotte, F., Lu, B.L.: The nerve-ml (neural engineering reproducibility and validity essentials for machine learning) checklist: ensuring machine learning advances neural engineering. Journal of Neural Engi- neering22(2), 021002 (2025)
2025
-
[8]
Zenodo (2022)
Chue Hong, N.P., Katz, D.S., Barker, M., Lamprecht, A.L., Martinez, C., Pso- mopoulos, F.E., Harrow, J., Castro, L.J., Gruenpeter, M., Martinez, P.A., et al.: FAIR principles for research software (FAIR4RS principles). Zenodo (2022)
2022
-
[9]
In: Proceedings of the 2nd ACM Confer- ence on Reproducibility and Replicability
Costa, L., Barbosa, S., Cunha, J.: Evaluating tools for enhancing reproducibility in computational scientific experiments. In: Proceedings of the 2nd ACM Confer- ence on Reproducibility and Replicability. p. 46–51. ACM REP ’24, Association for Computing Machinery, New York, NY, USA (2024). https://doi.org/10.1145/ 3641525.3663623, https://doi.org/10.1145/...
-
[10]
Devaraju, A., Huber, R., Mokrane, M., Herterich, P., Cepinskas, L., de Vries, J., L’Hours, H., Davidson, J., White, A.: FAIRsFAIR data object assessment metrics. Zenodo4081213(2020). https://doi.org/10.5281/zenodo.4081213
-
[11]
iPRES 2017 – 14th International Conference on Digital Preservation (2017)
Di Cosmo, R., Zacchiroli, S.: Software heritage: Why and how to preserve software source code. iPRES 2017 – 14th International Conference on Digital Preservation (2017)
2017
-
[12]
Forde, J., Head, T., Holdgraf, C., Panda, Y., Nalvarete, G., Ragan-Kelley, B., Sundell, E.: Reproducible research environments with repo2docker (2018) ReproScore: Separating Readiness from Outcome 17
2018
-
[13]
Journal of Biomedical Semantics14(1), 7 (2023)
Gaignard, A., Rosnet, T., De Lamotte, F., Lefort, V., Devignes, M.D.: Fair-checker: supportingdigitalresourcefindabilityandreusewithknowledgegraphsandseman- tic web standards. Journal of Biomedical Semantics14(1), 7 (2023)
2023
-
[14]
Research Ideas and Outcomes11, e179253 (2025)
Gey, R., Mietchen, D., Karras, O., Wittenborg, T., Schubotz, M., Bumberger, J.: find.software: Foundations for interdisciplinary discovery of (research) software. Research Ideas and Outcomes11, e179253 (2025). https://doi.org/10.3897/rio.11. e179253
-
[15]
Goggins, S., Lumbard, K., Germonprez, M.: Open source community health: An- alytical metrics and their corresponding narratives. In: 2021 IEEE/ACM 4th In- ternational Workshop on Software Health in Projects, Ecosystems and Communi- ties (SoHeal). pp. 25–33. IEEE (2021). https://doi.org/10.1109/SoHeal52568.2021. 00010
-
[16]
https://doi.org/10.5281/zenodo.17867919
Hagemeier, B., Bleier, A., Flemisch, B., Reuter, K., Dogaru, G., Mietchen, D., Lieber, M.: Jupyter4NFDI - Proposal for the Integration Phase of Base4NFDI (Dec 2025). https://doi.org/10.5281/zenodo.17867919
-
[17]
In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries
Higgins, S.: The DCC curation lifecycle model. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. p. 453. JCDL ’08, Asso- ciation for Computing Machinery, New York, NY, USA (2008). https://doi.org/ 10.1145/1378889.1378998, https://doi.org/10.1145/1378889.1378998
-
[18]
Hu, C., Zhang, L., Lim, Y., Wadhwani, A., Peters, A., Kang, D.: Repro-bench: Can agentic ai systems assess the reproducibility of social science research? In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 23616– 23626 (2025)
2025
-
[19]
Knowledge Exchange (2017), version 2.0, https: //codemeta.github.io
Jones, M.B., Boettiger, C., Mayes, A.C., Smith, A., Slaughter, P., Niemeyer, K., Gil, Y., Fenner, M., Schulte, K., Chamberlin, L., et al.: Codemeta: an exchange schema for software metadata. Knowledge Exchange (2017), version 2.0, https: //codemeta.github.io
2017
-
[20]
(No Title) (2014)
Lavoie, B.: The open archival information system (oais) reference model: introduc- tory guide. (No Title) (2014)
2014
-
[21]
DPC Technology Watch Report 14-02, Digital Preservation Coali- tion, York, UK (2014)
Lavoie, B.: The Open Archival Information System (OAIS) Reference Model: Intro- ductory Guide. DPC Technology Watch Report 14-02, Digital Preservation Coali- tion, York, UK (2014). https://doi.org/10.7207/twr14-02
-
[22]
In: 2019 IEEE International Con- ference on Big Data (Big Data)
Mao, A., Garijo, D., Fakhraei, S.: Somef: A framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Con- ference on Big Data (Big Data). pp. 3032–3037 (2019). https://doi.org/10.1109/ BigData47090.2019.9006447, http://dgarijo.com/papers/SoMEF.pdf
-
[23]
F1000Research10, 253 (2021)
Nüst, D., Eglen, S.J.: CODECHECK: an Open Science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve reproducibility. F1000Research10, 253 (2021). https://doi.org/10.12688/ f1000research.51738.2
2021
-
[24]
In: 2019 IEEE/ACM 16th in- ternational conference on mining software repositories (MSR)
Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of jupyter notebooks. In: 2019 IEEE/ACM 16th in- ternational conference on mining software repositories (MSR). pp. 507–517. IEEE (2019). https://doi.org/10.1109/MSR.2019.00077
-
[25]
https://doi.org/10
Samuel, S., Mietchen, D.: Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications (Aug 2023). https://doi.org/10. 5281/zenodo.8226725
2023
-
[26]
GigaScience13, giad113 (2024)
Samuel,S.,Mietchen,D.:ComputationalreproducibilityofJupyternotebooksfrom biomedical publications. GigaScience13, giad113 (2024). https://doi.org/10.1093/ gigascience/giad113 18 Samuel et al
2024
-
[27]
Transactions on Graph Data and Knowledge2(2), 4:1–4:24 (2024)
Samuel, S., Mietchen, D.: FAIR Jupyter: A Knowledge Graph Approach to Seman- tic Sharing and Granular Exploration of a Computational Notebook Reproducibil- ity Dataset. Transactions on Graph Data and Knowledge2(2), 4:1–4:24 (2024). https://doi.org/10.4230/TGDK.2.2.4
-
[28]
https://doi.org/10.5281/zenodo
Samuel, S., Mietchen, D.: ReproScore (2026), https://doi.org/10.5281/zenodo. 20154206, The ReproScore implementation, rubric profiles, and per-repository provenance record
-
[29]
arXiv preprint arXiv:2409.11363 (2024)
Siegel, Z.S., Kapoor, S., Nagdir, N., Stroebl, B., Narayanan, A.: Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark. arXiv preprint arXiv:2409.11363 (2024)
-
[30]
Data Science5(2), 97–138 (2022)
Soiland-Reyes, S., Sefton, P., Crosas, M., Castro, L.J., Coppens, F., Fernández, J.M., Garijo, D., Grüning, B., La Rosa, M., Leo, S., et al.: Packaging research artefacts with RO-Crate. Data Science5(2), 97–138 (2022). https://doi.org/10. 3233/DS-210053
2022
-
[31]
Spaaks, J.H., Verhoeven, S., Diblen, F., Drost, N., Hutton, A., Garcia Gonzalez, J.: howfairis: analyse a github or gitlab repository’s compliance with the fair software recommendations. Zenodo. https://doi.org/10.5281/zenodo.5013050 (2021)
-
[32]
Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J.S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., et al.: Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848 (2025)
-
[33]
https://doi.org/10.1073/pnas.1708290115
Stodden, V., Seiler, J., Ma, Z.: An empirical analysis of journal policy effectiveness forcomputationalreproducibility.ProceedingsoftheNationalAcademyofSciences 115(11), 2584–2589 (2018). https://doi.org/10.1073/pnas.1708290115
-
[34]
https://doi.org/10.5281/zenodo.15043760
Stäcker, T., Apel, J., Arning, U., et al.: SeDOA – Servicestelle Diamond Open Access (Mar 2025). https://doi.org/10.5281/zenodo.15043760
-
[35]
https://doi.org/10.5281/zenodo.6552436
The MaRDI consortium: MaRDI: Mathematical Research Data Initiative Proposal (May 2022). https://doi.org/10.5281/zenodo.6552436
-
[36]
Scientific Data9(1), 60 (2022)
Trisovic, A., Lau, M.K., Pasquier, T., Crosas, M.: A large-scale study on research code quality and execution. Scientific Data9(1), 60 (2022). https://doi.org/10. 1038/s41597-022-01143-6
2022
-
[37]
arXiv preprint arXiv:2512.22387 (2025)
Vangala, B.P., Adibifar, A., Gehani, A., Malik, T.: Ai-generated code is not repro- ducible (yet): An empirical study of dependency gaps in llm-based coding agents. arXiv preprint arXiv:2512.22387 (2025)
-
[38]
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Courna- peau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods17(3), 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
-
[39]
Scientific data3(1), 1–9 (2016)
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR Guiding Principles for scientific data management and stewardship. Scientific data3(1), 1–9 (2016). https://doi.org/10.1038/sdata.2016.18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.