arxiv: 2605.13275 · v1 · submitted 2026-05-13 · 💻 cs.SE

Recognition: unknown

ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

Sheeba Samuel , Daniel Mietchen , Jungsan Kim , Waqas Ahmed , Martin Gaedke

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords reproducibility assessmentresearch softwarereadiness metricsexecution outcomedigital librariesGitHub repositoriesfailure modescomposite score

0 comments

The pith

ReproScore separates static readiness assessment from actual execution outcome in research software and finds near-zero correlation between them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current tools for checking research software treat how complete a repository looks as a stand-in for whether the code will actually run when executed. To fix this conflation it introduces ReproScore, a framework that scores readiness separately from outcome and combines the two only when execution data is available. On 423 real GitHub repositories chosen to cover five common failure types, the static readiness score strongly tracks certain structural differences such as environment setup, yet it shows almost no relation to whether the code actually succeeds. This gap means digital libraries cannot rely on completeness checks alone if they want to know which artefacts remain usable.

Core claim

ReproScore is a two-tier framework that separates reproducibility readiness (RRS, built from 26 sub-metrics in five categories) from reproducibility outcome (ROS, measured by sandboxed execution probes). These are combined into a coverage-adaptive Composite Score (RCS) whose weights can be set by community YAML profiles. Evaluated on 423 repositories spanning five failure modes, the environment category of RRS discriminates failure types effectively, yet the overall RRS score exhibits near-zero correlation with binary execution success, thereby quantifying the readiness-outcome gap at repository scale.

What carries the argument

The ReproScore two-tier separation of static readiness metrics (RRS) from execution-based outcome probes (ROS), aggregated via a coverage-adaptive Composite Score (RCS) with external YAML weighting profiles.

If this is right

Digital libraries can now maintain separate readiness and outcome scores instead of treating completeness as a proxy for executability.
Community-defined YAML profiles allow different weighting schemes to be versioned and reused across curation workflows.
Static signals remain useful for detecting structural differences among failure modes even when they do not predict success.
Assessment remains feasible at scale when full sandbox execution is unavailable, by falling back to the readiness tier.
The architectural split demonstrates that readiness and outcome must be tracked independently for reproducibility-aware curation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the near-zero correlation persists, many existing automated reproducibility tools may systematically overstate the usability of archived code.
The same separation could be tested on data repositories or computational notebooks to check whether a similar readiness-outcome gap appears outside GitHub software.
Libraries could adopt the composite score as a filter that only triggers full execution checks on high-readiness items, reducing compute cost.
Extending the five-category metric set to include citation or documentation quality might further improve discrimination without adding execution overhead.

Load-bearing premise

The 423-repository ground-truth corpus spanning five failure modes is representative enough to support general claims about the near-zero correlation between readiness and outcome.

What would settle it

Re-running the same RRS computation and execution probes on a new collection of at least 1000 research software repositories that yields a statistically significant positive correlation between RRS and ROS would falsify the reported gap.

Figures

Figures reproduced from arXiv: 2605.13275 by Daniel Mietchen, Jungsan Kim, Martin Gaedke, Sheeba Samuel, Waqas Ahmed.

**Figure 1.** Figure 1: ReproScore two-tier architecture. Tier 1 (static analysis, always available at intake) computes RRS (cf. Section 3.1) from 26 sub-metrics (cf [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Mean category score by failure mode (423 repositories; 84–85 per class). The E panel illustrates the detection paradox: install_dep scores highest on environment specification yet fails at install time; success scores lower. Kruskal-Wallis H and pvalues per panel. success repositories score lower (E = 10.3) than install_dep, so directional cancellation across the binary label produces near-zero (slightly… view at source ↗

read the original abstract

Digital libraries curate millions of research software artefacts yet lack scalable infrastructure for assessing whether those artefacts remain executable. Existing automated assessment tools treat static repository completeness -- what a repository contains -- as a proxy for execution success -- whether it runs. We term this the readiness-outcome conflation and present ReproScore, a two-tier framework that explicitly separates reproducibility readiness (RRS) from reproducibility outcome (ROS), combining them into a coverage-adaptive Composite Score (RCS). RRS comprises 26 sub-metrics across five categories; ROS provides execution-based probes when sandbox infrastructure is available; a community rubric externalises weighting priorities as versioned YAML profiles. Evaluated on 423 GitHub repositories from a large-scale ground-truth corpus spanning five failure modes, two complementary findings emerge: the environment category strongly discriminates failure mode, confirming static signals capture meaningful structural differences; yet RRS exhibits near-zero binary success correlation, empirically quantifying the readiness-outcome gap at repository scale. Together, these findings validate the architectural separation as both necessary and non-trivial, positioning ReproScore as scalable infrastructure for reproducibility-aware curation in digital library workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReproScore gives a clean split between static readiness metrics and actual execution success, with the 423-repo test showing near-zero correlation, but the failure-mode-curated corpus makes the generalization step the main thing to check.

read the letter

The central finding is that their 26 sub-metrics for readiness (RRS) show almost no link to whether the code actually runs (ROS) on the 423-repository set. That quantifies the readiness-outcome gap in a way that feels concrete rather than just asserted. They also report that the environment category of metrics does separate the five failure modes, which suggests the static signals are picking up real structural differences even if they do not predict success overall. The YAML profiles for community weighting are a practical touch that lets users adjust priorities without changing the code. Using an external ground-truth corpus instead of purely internal fitting is the right direction for this kind of work. The composite RCS score that adapts to coverage is a reasonable way to combine the two tiers. The main soft spot is the corpus construction. Because the 423 repositories were assembled specifically to span five failure modes, the outcome distribution and score spread are shaped by that choice. Without clear sampling frame details or a comparison to a broader, less curated collection of research software, it is hard to know how far the near-zero correlation travels. If the full paper shows inclusion criteria and some population statistics, that concern shrinks; if not, the claim stays more provisional than it first appears. This is for groups working on digital-library tooling or automated reproducibility checks rather than for general readers. It has enough of a defined framework and an empirical result to deserve peer review, though referees will probably press on the representativeness question and on how the 26 metrics were derived and weighted.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces ReproScore, a two-tier framework that separates reproducibility readiness (RRS, computed from 26 static sub-metrics across five categories) from reproducibility outcome (ROS, via execution-based probes). These are combined into a coverage-adaptive Composite Score (RCS) using versioned community YAML weighting profiles. Evaluated on a 423-repository ground-truth corpus spanning five failure modes, the work reports that the environment category discriminates among failure modes while RRS exhibits near-zero correlation with binary success, thereby quantifying the readiness-outcome gap at repository scale.

Significance. If the near-zero correlation generalizes beyond the curated corpus, the separation of static readiness signals from execution outcomes supplies scalable infrastructure for reproducibility-aware curation in digital libraries. The external ground-truth corpus, execution probes, and versioned YAML profiles constitute concrete strengths that make the assessment adaptable and empirically testable.

major comments (1)

[Evaluation] Evaluation section: the 423-repository ground-truth corpus was assembled specifically to span five failure modes. This selection process conditions both the outcome distribution and RRS score spread on failure diversity rather than drawing from an unbiased sample of GitHub research software. Consequently, the headline claim of near-zero RRS-binary success correlation requires additional support (sampling frame, inclusion criteria, or population comparison statistics) before the generalization to broader collections can be accepted.

minor comments (1)

[Abstract] Abstract: no information is supplied on how the 26 sub-metrics are derived, how weights are assigned, or whether any post-hoc decisions were made during the 423-repository evaluation; these details are needed to evaluate the robustness of the reported correlation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the scope and limitations of our evaluation. We address the major comment point by point below and have revised the manuscript to improve transparency regarding the corpus construction and generalizability.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 423-repository ground-truth corpus was assembled specifically to span five failure modes. This selection process conditions both the outcome distribution and RRS score spread on failure diversity rather than drawing from an unbiased sample of GitHub research software. Consequently, the headline claim of near-zero RRS-binary success correlation requires additional support (sampling frame, inclusion criteria, or population comparison statistics) before the generalization to broader collections can be accepted.

Authors: We agree that the 423-repository ground-truth corpus was deliberately assembled to span the five failure modes, which means it is not a random or unbiased sample from the broader population of GitHub research software repositories. This curation was intentional to allow for a comprehensive assessment of ReproScore's behavior across diverse reproducibility challenges, including both successful and failed executions. The near-zero correlation between RRS and binary success is an empirical observation specific to this corpus, which includes a balanced representation of failure modes to highlight the readiness-outcome gap. We do not claim that this correlation holds universally without additional validation on other datasets. To address the referee's concern, we will revise the Evaluation section to: explicitly detail the sampling frame and inclusion criteria used in constructing the corpus; include a dedicated limitations subsection discussing the implications for generalizability; and provide any available statistics comparing the corpus to broader populations where possible. These changes will clarify that the findings are conditioned on the diverse failure-mode coverage while strengthening the manuscript's transparency. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claim rests on external corpus evaluation

full rationale

The paper defines RRS via 26 independent sub-metrics across five categories and ROS via execution probes, then reports an observed near-zero correlation on a separately curated 423-repository ground-truth corpus. No parameter is fitted to the correlation itself, no self-citation supplies the central uniqueness or weighting, and the readiness-outcome separation is not derived from the measured correlation. The evaluation chain therefore remains self-contained against the external corpus and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the framework name itself; the 26 sub-metrics and five categories are presented as given without derivation details.

pith-pipeline@v0.9.0 · 5509 in / 1007 out tokens · 40016 ms · 2026-05-14T18:23:13.012395+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 22 canonical work pages

[1]

arXiv preprint arXiv:2310.09634 (2023)

Akdeniz, E.K., Tekir, S., Hinnawi, M.N.A.A.: An end-to-end system for repro- ducibility assessment of source code repositories via their readmes. arXiv preprint arXiv:2310.09634 (2023)

work page arXiv 2023
[2]

1,500 scientists lift the lid on reproducibility

Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature533, 452–454 (2016). https://doi.org/10.1038/533452a

work page doi:10.1038/533452a 2016
[3]

In: RExPO22

Bandrowski, A., Roelandse, M.: SciScore, a tool that can measure rigor criteria presence or absence in a biomedical study. In: RExPO22. ScienceOpen (2022)

2022
[4]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Bogin, B., Yang, K., Gupta, S., Richardson, K., Bransom, E., Clark, P., Sabharwal, A., Khot, T.: Super: Evaluating agents on setting up and executing tasks from research repositories. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 12622–12645 (2024)

2024
[5]

whole tale

Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M.B., Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B.D., Nabrzyski, J., et al.: Computing environments for reproducibility: Capturing the “whole tale”. Future Generation Computer Systems94, 854–867 (2019)

2019
[6]

Manning Publications Co

Campbell, G.A., Papapetrou, P.P.: SonarQube in action. Manning Publications Co. (2013)

2013
[7]

Journal of Neural Engi- neering22(2), 021002 (2025)

Carlson, D.E., Chavarriaga, R., Liu, Y., Lotte, F., Lu, B.L.: The nerve-ml (neural engineering reproducibility and validity essentials for machine learning) checklist: ensuring machine learning advances neural engineering. Journal of Neural Engi- neering22(2), 021002 (2025)

2025
[8]

Zenodo (2022)

Chue Hong, N.P., Katz, D.S., Barker, M., Lamprecht, A.L., Martinez, C., Pso- mopoulos, F.E., Harrow, J., Castro, L.J., Gruenpeter, M., Martinez, P.A., et al.: FAIR principles for research software (FAIR4RS principles). Zenodo (2022)

2022
[9]

In: Proceedings of the 2nd ACM Confer- ence on Reproducibility and Replicability

Costa, L., Barbosa, S., Cunha, J.: Evaluating tools for enhancing reproducibility in computational scientific experiments. In: Proceedings of the 2nd ACM Confer- ence on Reproducibility and Replicability. p. 46–51. ACM REP ’24, Association for Computing Machinery, New York, NY, USA (2024). https://doi.org/10.1145/ 3641525.3663623, https://doi.org/10.1145/...

work page doi:10.1145/3641525.3663623 2024
[10]

Zenodo4081213(2020)

Devaraju, A., Huber, R., Mokrane, M., Herterich, P., Cepinskas, L., de Vries, J., L’Hours, H., Davidson, J., White, A.: FAIRsFAIR data object assessment metrics. Zenodo4081213(2020). https://doi.org/10.5281/zenodo.4081213

work page doi:10.5281/zenodo.4081213 2020
[11]

iPRES 2017 – 14th International Conference on Digital Preservation (2017)

Di Cosmo, R., Zacchiroli, S.: Software heritage: Why and how to preserve software source code. iPRES 2017 – 14th International Conference on Digital Preservation (2017)

2017
[12]

Forde, J., Head, T., Holdgraf, C., Panda, Y., Nalvarete, G., Ragan-Kelley, B., Sundell, E.: Reproducible research environments with repo2docker (2018) ReproScore: Separating Readiness from Outcome 17

2018
[13]

Journal of Biomedical Semantics14(1), 7 (2023)

Gaignard, A., Rosnet, T., De Lamotte, F., Lefort, V., Devignes, M.D.: Fair-checker: supportingdigitalresourcefindabilityandreusewithknowledgegraphsandseman- tic web standards. Journal of Biomedical Semantics14(1), 7 (2023)

2023
[14]

Research Ideas and Outcomes11, e179253 (2025)

Gey, R., Mietchen, D., Karras, O., Wittenborg, T., Schubotz, M., Bumberger, J.: find.software: Foundations for interdisciplinary discovery of (research) software. Research Ideas and Outcomes11, e179253 (2025). https://doi.org/10.3897/rio.11. e179253

work page doi:10.3897/rio.11 2025
[15]

In: 2021 IEEE/ACM 4th In- ternational Workshop on Software Health in Projects, Ecosystems and Communi- ties (SoHeal)

Goggins, S., Lumbard, K., Germonprez, M.: Open source community health: An- alytical metrics and their corresponding narratives. In: 2021 IEEE/ACM 4th In- ternational Workshop on Software Health in Projects, Ecosystems and Communi- ties (SoHeal). pp. 25–33. IEEE (2021). https://doi.org/10.1109/SoHeal52568.2021. 00010

work page doi:10.1109/soheal52568.2021 2021
[16]

https://doi.org/10.5281/zenodo.17867919

Hagemeier, B., Bleier, A., Flemisch, B., Reuter, K., Dogaru, G., Mietchen, D., Lieber, M.: Jupyter4NFDI - Proposal for the Integration Phase of Base4NFDI (Dec 2025). https://doi.org/10.5281/zenodo.17867919

work page doi:10.5281/zenodo.17867919 2025
[17]

In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries

Higgins, S.: The DCC curation lifecycle model. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. p. 453. JCDL ’08, Asso- ciation for Computing Machinery, New York, NY, USA (2008). https://doi.org/ 10.1145/1378889.1378998, https://doi.org/10.1145/1378889.1378998

work page doi:10.1145/1378889.1378998 2008
[18]

Hu, C., Zhang, L., Lim, Y., Wadhwani, A., Peters, A., Kang, D.: Repro-bench: Can agentic ai systems assess the reproducibility of social science research? In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 23616– 23626 (2025)

2025
[19]

Knowledge Exchange (2017), version 2.0, https: //codemeta.github.io

Jones, M.B., Boettiger, C., Mayes, A.C., Smith, A., Slaughter, P., Niemeyer, K., Gil, Y., Fenner, M., Schulte, K., Chamberlin, L., et al.: Codemeta: an exchange schema for software metadata. Knowledge Exchange (2017), version 2.0, https: //codemeta.github.io

2017
[20]

(No Title) (2014)

Lavoie, B.: The open archival information system (oais) reference model: introduc- tory guide. (No Title) (2014)

2014
[21]

DPC Technology Watch Report 14-02, Digital Preservation Coali- tion, York, UK (2014)

Lavoie, B.: The Open Archival Information System (OAIS) Reference Model: Intro- ductory Guide. DPC Technology Watch Report 14-02, Digital Preservation Coali- tion, York, UK (2014). https://doi.org/10.7207/twr14-02

work page doi:10.7207/twr14-02 2014
[22]

In: 2019 IEEE International Con- ference on Big Data (Big Data)

Mao, A., Garijo, D., Fakhraei, S.: Somef: A framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Con- ference on Big Data (Big Data). pp. 3032–3037 (2019). https://doi.org/10.1109/ BigData47090.2019.9006447, http://dgarijo.com/papers/SoMEF.pdf

work page arXiv 2019
[23]

F1000Research10, 253 (2021)

Nüst, D., Eglen, S.J.: CODECHECK: an Open Science initiative for the indepen- dent execution of computations underlying research articles during peer review to improve reproducibility. F1000Research10, 253 (2021). https://doi.org/10.12688/ f1000research.51738.2

2021
[24]

In: 2019 IEEE/ACM 16th in- ternational conference on mining software repositories (MSR)

Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of jupyter notebooks. In: 2019 IEEE/ACM 16th in- ternational conference on mining software repositories (MSR). pp. 507–517. IEEE (2019). https://doi.org/10.1109/MSR.2019.00077

work page doi:10.1109/msr.2019.00077 2019
[25]

https://doi.org/10

Samuel, S., Mietchen, D.: Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications (Aug 2023). https://doi.org/10. 5281/zenodo.8226725

2023
[26]

GigaScience13, giad113 (2024)

Samuel,S.,Mietchen,D.:ComputationalreproducibilityofJupyternotebooksfrom biomedical publications. GigaScience13, giad113 (2024). https://doi.org/10.1093/ gigascience/giad113 18 Samuel et al

2024
[27]

Transactions on Graph Data and Knowledge2(2), 4:1–4:24 (2024)

Samuel, S., Mietchen, D.: FAIR Jupyter: A Knowledge Graph Approach to Seman- tic Sharing and Granular Exploration of a Computational Notebook Reproducibil- ity Dataset. Transactions on Graph Data and Knowledge2(2), 4:1–4:24 (2024). https://doi.org/10.4230/TGDK.2.2.4

work page doi:10.4230/tgdk.2.2.4 2024
[28]

https://doi.org/10.5281/zenodo

Samuel, S., Mietchen, D.: ReproScore (2026), https://doi.org/10.5281/zenodo. 20154206, The ReproScore implementation, rubric profiles, and per-repository provenance record

work page doi:10.5281/zenodo 2026
[29]

arXiv preprint arXiv:2409.11363 (2024)

Siegel, Z.S., Kapoor, S., Nagdir, N., Stroebl, B., Narayanan, A.: Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark. arXiv preprint arXiv:2409.11363 (2024)

work page arXiv 2024
[30]

Data Science5(2), 97–138 (2022)

Soiland-Reyes, S., Sefton, P., Crosas, M., Castro, L.J., Coppens, F., Fernández, J.M., Garijo, D., Grüning, B., La Rosa, M., Leo, S., et al.: Packaging research artefacts with RO-Crate. Data Science5(2), 97–138 (2022). https://doi.org/10. 3233/DS-210053

2022
[31]

Spaaks, J.H., Verhoeven, S., Diblen, F., Drost, N., Hutton, A., Garcia Gonzalez, J.: howfairis: analyse a github or gitlab repository’s compliance with the fair software recommendations. Zenodo. https://doi.org/10.5281/zenodo.5013050 (2021)

work page doi:10.5281/zenodo.5013050 2021
[32]

PaperBench: Evaluating

Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J.S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., et al.: Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848 (2025)

work page arXiv 2025
[33]

https://doi.org/10.1073/pnas.1708290115

Stodden, V., Seiler, J., Ma, Z.: An empirical analysis of journal policy effectiveness forcomputationalreproducibility.ProceedingsoftheNationalAcademyofSciences 115(11), 2584–2589 (2018). https://doi.org/10.1073/pnas.1708290115

work page doi:10.1073/pnas.1708290115 2018
[34]

https://doi.org/10.5281/zenodo.15043760

Stäcker, T., Apel, J., Arning, U., et al.: SeDOA – Servicestelle Diamond Open Access (Mar 2025). https://doi.org/10.5281/zenodo.15043760

work page doi:10.5281/zenodo.15043760 2025
[35]

https://doi.org/10.5281/zenodo.6552436

The MaRDI consortium: MaRDI: Mathematical Research Data Initiative Proposal (May 2022). https://doi.org/10.5281/zenodo.6552436

work page doi:10.5281/zenodo.6552436 2022
[36]

Scientific Data9(1), 60 (2022)

Trisovic, A., Lau, M.K., Pasquier, T., Crosas, M.: A large-scale study on research code quality and execution. Scientific Data9(1), 60 (2022). https://doi.org/10. 1038/s41597-022-01143-6

2022
[37]

arXiv preprint arXiv:2512.22387 (2025)

Vangala, B.P., Adibifar, A., Gehani, A., Malik, T.: Ai-generated code is not repro- ducible (yet): An empirical study of dependency gaps in llm-based coding agents. arXiv preprint arXiv:2512.22387 (2025)

work page arXiv 2025
[38]

E., et al

Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Courna- peau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods17(3), 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2

work page doi:10.1038/s41592-019-0686-2 2020
[39]

Scientific data3(1), 1–9 (2016)

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR Guiding Principles for scientific data management and stewardship. Scientific data3(1), 1–9 (2016). https://doi.org/10.1038/sdata.2016.18

work page doi:10.1038/sdata.2016.18 2016