pith. sign in

arxiv: 2606.16974 · v3 · pith:AD5DETZHnew · submitted 2026-06-15 · 💻 cs.AI

The Shift Toward Open and Reproducible AI Research

Pith reviewed 2026-06-29 05:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords reproducibilityAI researchdocumentation practicescode sharingdata sharingopen scienceempirical analysisconferences
0
0 comments X

The pith

AI papers sharing both code and data rose from 11% to 64% between 2014 and 2024, with estimated reproducibility rising in step from 28% to 64%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks documentation practices across every paper at five major AI conferences over a full decade to see whether the field has become more open. It finds steady gains in code and data release that began before any formal checklists existed. The authors then use an earlier measured link between documentation and reproducibility to estimate that actual reproducibility rates improved by the same large margin.

Core claim

Analysis of 56,800 papers shows that the percentage sharing both code and data grew from 11% in 2014 to 64% in 2024. Linking this trend to earlier measured reproducibility rates yields an estimated increase in reproducibility from 28% to 64% over the same period. The improvements predate the introduction of reproducibility checklists at the conferences.

What carries the argument

Measurement of seven reproducibility variables, chiefly the joint release of code and data, across all publications from five leading AI conferences.

If this is right

  • Reproducibility in published AI work has increased substantially over the decade.
  • The rise in open practices reflects a broader community shift rather than a direct effect of checklist policies.
  • Documentation metrics can serve as a practical ongoing proxy for reproducibility trends.
  • Similar documentation improvements may be visible in other fields that have moved toward open science.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the upward trend continues, independent verification of AI results will become more routine.
  • The same documentation-tracking method could be applied to measure change in other research communities.
  • Future work could test whether the gains have begun to level off after 2024.

Load-bearing premise

The correlation between documentation practices and actual reproducibility rates found in one earlier study remains stable enough to support direct estimates across ten years and five different conferences.

What would settle it

A direct reproducibility audit of random samples of papers from 2014 and 2024 that returns rates clearly different from the estimated 28% and 64%.

read the original abstract

The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines trends in reproducibility-related documentation practices across 56,800 papers published at five major AI conferences (2014–2024). It reports direct measurements showing that the share of papers releasing both code and data rose from 11% to 64%, with improvements in seven tracked variables. Building on empirical reproducibility rates taken from one prior study, the authors infer that overall reproducibility rose from 28% to 64% over the decade; they note that these trends largely predate the introduction of formal reproducibility checklists.

Significance. The large-scale, direct documentation of documentation-practice trends across a decade and multiple venues constitutes a useful empirical contribution to the open-science literature in AI. If the mapping from documentation variables to actual reproducibility rates can be shown to be stable, the inferred reproducibility increase would provide a quantitative benchmark for the field's progress. The manuscript already credits the raw counts as measured rather than inferred.

major comments (1)
  1. [Abstract and reproducibility-inference section] Abstract and the section describing the reproducibility inference: the headline claim that reproducibility rose from 28% to 64% is obtained by applying conditional rates taken from a single external prior study to the observed documentation distributions. No sensitivity analysis, re-validation on any subset of the 56,800-paper corpus, or discussion of potential changes in confounders (evaluation practices, dataset scale, hardware) is reported. This step is load-bearing for the reproducibility conclusion.
minor comments (2)
  1. [Methods] The seven reproducibility variables should be listed explicitly with their operational definitions and any inter-annotator agreement statistics in the methods section.
  2. [Data collection] Clarify whether the 56,800 papers constitute the full population or a sampled subset of the five conferences over the decade.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We are grateful to the referee for their positive assessment of the empirical contribution and for highlighting the importance of the reproducibility inference. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and reproducibility-inference section] Abstract and the section describing the reproducibility inference: the headline claim that reproducibility rose from 28% to 64% is obtained by applying conditional rates taken from a single external prior study to the observed documentation distributions. No sensitivity analysis, re-validation on any subset of the 56,800-paper corpus, or discussion of potential changes in confounders (evaluation practices, dataset scale, hardware) is reported. This step is load-bearing for the reproducibility conclusion.

    Authors: The inference is indeed derived from applying rates reported in a single prior study to our observed documentation distributions, as this remains the most comprehensive empirical source for such conditional probabilities. The manuscript explicitly states that the estimate is inferred from documentation practices rather than direct testing. We did not perform sensitivity analysis or re-validation because the prior study does not provide the necessary granular data for re-application to our corpus, and conducting direct reproducibility tests on even a subset of 56,800 papers is not practicable. However, we recognize the value of discussing potential confounders such as changes in evaluation practices, dataset scale, and hardware. We will revise the manuscript to add a dedicated paragraph in the limitations section addressing these issues and the assumptions of the inference method. This revision will not change the reported documentation trends or the headline inference but will provide additional context for readers. revision: yes

standing simulated objections not resolved
  • Re-validation on any subset of the corpus, due to the lack of direct reproducibility test results for these papers.

Circularity Check

0 steps flagged

No significant circularity; primary measurements are direct corpus counts and reproducibility estimate imports external mapping.

full rationale

The paper directly counts seven reproducibility variables across 56,800 papers from five conferences (2014-2024), yielding observed trends such as code+data sharing rising from 11% to 64%. The reproducibility percentages (28% to 64%) are obtained by applying conditional rates taken from a single prior study rather than by any internal fit, self-definition, or renaming of the paper's own outputs. No equation or step reduces the claimed quantities to tautological inputs by construction, and the cited prior rates constitute external evidence that remains falsifiable outside this manuscript. The derivation chain therefore contains no load-bearing self-citation, ansatz smuggling, or fitted-input-as-prediction patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The reproducibility estimate depends on an external prior study's mapping between documentation and actual reproducibility; no new free parameters are introduced in the abstract itself.

axioms (1)
  • domain assumption Documentation practices serve as a stable proxy for actual reproducibility rates across time and conferences
    Invoked when converting observed documentation percentages into estimated reproducibility percentages using the prior study.

pith-pipeline@v0.9.1-grok · 5705 in / 1170 out tokens · 25287 ms · 2026-06-29T05:15:25.616065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    PLoS medicine 2(8), 124 (2005) https://doi.org/10.1371/journal.pmed.0020124

    Ioannidis, J.P.: Why most published research findings are false. PLoS medicine 2(8), 124 (2005) https://doi.org/10.1371/journal.pmed.0020124

  2. [2]

    American Association for the Advancement of Science (2014)

    McNutt, M.: Reproducibility. American Association for the Advancement of Science (2014). https://doi.org/10.1126/science.1250475

  3. [3]

    Nature Publishing Group UK London (2016)

    Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature Publishing Group UK London (2016). https://doi.org/10.1038/533452a

  4. [4]

    Pashler, H., Wagenmakers, E.-J.: Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on psycho- logical science7(6), 528–530 (2012) https://doi.org/10.1177/1745691612465253

  5. [5]

    Science349(6251), 4716 (2015) https://doi.org/10.1126/science.aac4716

    Open Science Collaboration: Estimating the reproducibility of psychological science. Science349(6251), 4716 (2015) https://doi.org/10.1126/science.aac4716

  6. [6]

    Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

    Klein, R.A., Ratliff, K.A., Vianello, M., Adams Jr, R.B., Bahn´ ık,ˇS., Bernstein, M.J., Bocian, K., Brandt, M.J., Brooks, B., Brumbaugh, C.C.,et al.: Investigat- ing variation in replicability. Social psychology (2014) https://doi.org/10.1027/ 1864-9335/a000178

  7. [7]

    Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

    Camerer, C.F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T.,et al.: Evaluating replicability of laboratory experiments in economics. Science351(6280), 1433–1436 (2016) https://doi.org/10.1126/science.aaf09

  8. [8]

    Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery10(9), 712–712 (2011) https://doi.org/10.1038/nrd3439-c1

  9. [9]

    Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

    Collins, F.S., Tabak, L.A.: Policy: NIH plans to enhance reproducibility. Nature 505(7485), 612–613 (2014) https://doi.org/10.1038/505612a

  10. [10]

    Nature 483(7391), 531–533 (2012) https://doi.org/10.1038/483531a

    Begley, C.G., Ellis, L.M.: Raise standards for preclinical cancer research. Nature 483(7391), 531–533 (2012) https://doi.org/10.1038/483531a

  11. [11]

    Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

    Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., Munaf` o, M.R.: Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience14(5), 365–376 (2013) https://doi.org/ 10.1038/nrn3475

  12. [12]

    Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

    Hewitt, J.K.: Editorial policy on candidate gene association and candidate gene- by-environment interaction studies of complex traits. Behavior genetics42(1), 1–2 (2012) https://doi.org/10.1007/s10519-011-9504-z

  13. [13]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Gundersen, O.E., Kjensmo, S.: State of the art: Reproducibility in artificial 22 intelligence. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11503

  14. [14]

    Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

    Hutson, M.: Artificial intelligence faces reproducibility crisis. Science359(6377), 725–726 (2018) https://doi.org/10.1126/science.359.6377.725

  15. [15]

    Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

    Vicente-Saez, R., Martinez-Fuentes, C.: Open science now: A systematic literature review for an integrated definition. Journal of Business Research88, 428–436 (2018) https://doi.org/10.1016/j.jbusres.2017.12.043

  16. [16]

    Patterns (2025) https://doi.org/10.1016/j.patter

    Bischl, B., Casalicchio, G., Das, T., Feurer, M., Fischer, S., Gijsbers, P., Mukherjee, S., M¨ uller, A.C., N´ emeth, L., Oala, L.,et al.: OpenML: Insights from 10 years and more than a thousand papers. Patterns (2025) https://doi.org/10.1016/j.patter. 2025.101317

  17. [17]

    Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017

    Foster, E.D., Deardorff, A.: Open science framework (OSF). Journal of the Medical Library Association: JMLA105(2), 203 (2017) https://doi.org/10.5195/jmla.2017. 88

  18. [18]

    Scientific Data3(1), 1–9 (2016), https://doi.org/10.1038/sdata.2016.18

    Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Silva Santos, L.B., Bourne, P.E.,et al.: The FAIR guiding principles for scientific data management and stewardship. Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18

  19. [19]

    arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

    White, M., Haddad, I., Osborne, C., Liu, X.-Y.Y., Abdelmonsef, A., Varghese, S., Hors, A.L.: The model openness framework: Promoting completeness and openness for reproducibility, transparency, and usability in artificial intelligence. arXiv preprint arXiv:2403.13784 (2024) https://doi.org/10.48550/arXiv.2403.13784

  20. [20]

    Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

    Wilkinson, S.R., Aloqalaa, M., Belhajjame, K., Crusoe, M.R., Paula Kinoshita, B., Gadelha, L., Garijo, D., Gustafsson, O.J.R., Juty, N., Kanwal, S.,et al.: Applying the FAIR principles to computational workflows. Scientific Data12(1), 328 (2025) https://doi.org/10.1038/s41597-025-04451-9

  21. [21]

    Journal of machine learning research22(164), 1–20 (2021)

    Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi` ere, V., Beygelzimer, A., d’Alch´ e- Buc, F., Fox, E., Larochelle, H.: Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of machine learning research22(164), 1–20 (2021)

  22. [22]

    Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

    Gundersen, O.E., Helmert, M., Hoos, H.: Improving reproducibility in AI research: Four mechanisms adopted by JAIR. Journal of Artificial Intelligence Research81, 1019–1041 (2024) https://doi.org/10.1613/jair.1.16905

  23. [23]

    Advances in neural information processing systems31 (2018) 23

    Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are gans created equal? a large-scale study. Advances in neural information processing systems31 (2018) 23

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018). https://doi.org/10.1609/aaai.v32i1.11694

  25. [25]

    In: Proceedings of the 13th ACM Conference on Recommender Systems, pp

    Ferrari Dacrema, M., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101–109 (2019). https://doi.org/10.1145/3298689.3347058

  26. [26]

    Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

    Belz, A.: A metrological perspective on reproducibility in NLP. Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

  27. [27]

    In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp

    Gundersen, O.E., Shamsaliei, S., Kjærnli, H.S., Langseth, H.: On reporting robust and trustworthy conclusions from model comparison studies involving neural networks and randomness. In: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, pp. 37–61 (2023). https://doi.org/10.1145/ 3589806.3600044

  28. [28]

    Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

    Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Iii, H.D., Crawford, K.: Datasheets for datasets. Communications of the ACM64(12), 86–92 (2021) https://doi.org/10.1145/3458723

  29. [29]

    NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

    Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022) https://doi.org/10.1038/s41746-022-00592-y

  30. [30]

    Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023

    Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine- learning-based science. Patterns4(9) (2023) https://doi.org/10.1016/j.patter.2023. 100804

  31. [31]

    38 Mason Christopher E

    Haibe-Kains, B., Adam, G.A., Hosny, A., Khodakarami, F., Directors Shraddha Thakkar 35 Kusko Rebecca 36 Sansone Susanna-Assunta 37 Tong Weida 35 Wolfinger Russ D. 38 Mason Christopher E. 39 Jones Wendell 40 Dopazo Joaquin 41 Furlanello Cesare 42, M.A.Q.C.M.S.B., Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A.,et al.: Transparency and repr...

  32. [32]

    In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp

    Belz, A., Agarwal, S., Shimorina, A., Reiter, E.: A systematic review of reproducibil- ity research in natural language processing. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 381–393 (2021). https://doi.org/10.18653/v1/2021.eacl-main.29

  33. [33]

    In: International Conference on Machine Learning, pp

    Bouthillier, X., Laurent, C., Vincent, P.: Unreproducible research is reproducible. In: International Conference on Machine Learning, pp. 725–734 (2019). PMLR

  34. [34]

    IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

    Hunold, S., Carpen-Amarie, A.: Reproducible MPI benchmarking is still not as 24 easy as you think. IEEE Transactions on Parallel and Distributed Systems27(12), 3617–3630 (2016) https://doi.org/10.1109/TPDS.2016.2539167

  35. [35]

    Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

    Hong, S.-Y., Koo, M.-S., Jang, J., Esther Kim, J.-E., Park, H., Joh, M.-S., Kang, J.-H., Oh, T.-J.: An evaluation of the software system dependency of a global atmospheric model. Monthly Weather Review141(11), 4165–4172 (2013) https://doi.org/10.1175/MWR-D-12-00352.1

  36. [36]

    Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science

    Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P., Taufer, M.: Enhancing reproducibility for computational methods. Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science. aah6168

  37. [37]

    In: International Conference on Document Analysis and Recognition, pp

    Ajayi, K., Choudhury, M.H., Rajtmajer, S.M., Wu, J.: A study on reproducibil- ity and replicability of table structure recognition methods. In: International Conference on Document Analysis and Recognition, pp. 3–19 (2023). https: //doi.org/10.1007/978-3-031-41679-8 1 . Springer

  38. [38]

    arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

    Gundersen, O.E., Coakley, K., Kirkpatrick, C., Gil, Y.: Sources of irreproducibility in machine learning: A review. arXiv preprint arXiv:2204.07610 (2022) https: //doi.org/10.48550/arXiv.2204.07610

  39. [39]

    Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

    Gundersen, O.E.: The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society A379(2197), 20200210 (2021) https://doi.org/ 10.1098/rsta.2020.0210

  40. [40]

    Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

    Schmidt, S.: Shall we really do it again? the powerful concept of replication is neglected in the social sciences. Review of general psychology13(2), 90–100 (2009) https://doi.org/10.1037/a0015108

  41. [41]

    Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

    Nosek, B.A., Lakens, D.: A method to increase the credibility of published results. Social Psychology45(3), 137–141 (2014) https://doi.org/10.1027/1864-9335/ a000192

  42. [42]

    Goodman, S.N., Fanelli, D., Ioannidis, J.P.: What does research reproducibility mean? Science translational medicine8(341), 341–1234112 (2016) https://doi.org/ 10.1126/scitranslmed.aaf5027

  43. [43]

    Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

    Collberg, C., Proebsting, T.A.: Repeatability in computer systems research. Communications of the ACM59(3), 62–69 (2016) https://doi.org/10.1145/ 2812803

  44. [44]

    In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol

    Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Informa- tion Processing Systems, vol. 32. Curran Associates Inc., Red Hook, NY, USA (2019) 25

  45. [45]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Gundersen, O.E., Cappelen, O., Møln˚ a, M., Nilsen, N.G.: The unreasonable effectiveness of open science in AI: A replication study. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 26211–26219 (2025). https://doi.org/10.1609/aaai.v39i25.34818

  46. [46]

    naacl-long.499/

    Magnusson, I., Smith, N.A., Dodge, J.: Reproducibility in NLP: What have we learned from the checklist? In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 12789–12811 (2023). https://doi.org/10.18653/v1/2023. findings-acl.809

  47. [47]

    AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

    Gundersen, O.E., Gil, Y., Aha, D.W.: On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications. AI magazine 39(3), 56–68 (2018) https://doi.org/10.1609/aimag.v39i3.2816

  48. [48]

    PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

    Makridakis, S., Spiliotis, E., Assimakopoulos, V.: Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one13(3), 0194889 (2018) https://doi.org/10.1371/journal.pone.0194889

  49. [49]

    In: Parallel Computing: Technology Trends, pp

    Pouchard, L., Lin, Y., Van Dam, H.: Replicating machine learning experiments in materials science. In: Parallel Computing: Technology Trends, pp. 743–755. IOS Press, Amsterdam (2020). https://doi.org/10.3233/APC200105

  50. [50]

    In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp

    Coakley, K., Kirkpatrick, C.R., Gundersen, O.E.: Examining the effect of imple- mentation factors on deep learning reproducibility. In: Proceedings of the IEEE 18th International Conference on e-Science (e-Science), pp. 397–398 (2022). https://doi.org/10.1109/eScience55777.2022.00056 . IEEE

  51. [51]

    In: Marculescu, D., Chi, Y., Wu, C

    Zhuang, D., Zhang, X., Song, S., Hooker, S.: Randomness in neural network training: Characterizing the impact of tooling. In: Marculescu, D., Chi, Y., Wu, C. (eds.) Proceedings of the Fourth Conference on Machine Learning and Systems, vol. 4, pp. 316–336 (2022)

  52. [52]

    Advances in Neural Information Processing Systems34, 3081–3095 (2021)

    Cooper, A.F., Lu, Y., Forde, J., De Sa, C.M.: Hyperparameter optimization is deceiving us, and how to stop it. Advances in Neural Information Processing Systems34, 3081–3095 (2021)

  53. [53]

    In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp

    Reimers, N., Gurevych, I.: Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 338–348 (2017). https://doi.org/10.18653/v1/D17-1035

  54. [54]

    Metropolitan Books, New York City, New York (2010)

    Gawande, A.: The Checklist Manifesto: How to Get Things Right. Metropolitan Books, New York City, New York (2010)

  55. [55]

    Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

    Gundersen, O.E.: Standing on the feet of giants—reproducibility in AI. Ai Magazine40(4), 9–23 (2019) https://doi.org/10.1609/aimag.v40i4.5185 26

  56. [56]

    Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

    Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L., Karlstrom, L., Lee, H., Mills, H.J., Oh, J.-H.,et al.: Toward the geoscience paper of the future: Best practices for documenting and sharing research from data to software to provenance. Earth and Space Science3(10), 388–415 (2016) https://doi.org/10.1002/2015EA000136

  57. [57]

    Gil, Y.: Will AI write scientific papers in the future? AI Magazine42(4), 3–15 (2022) https://doi.org/10.1609/aaai.12027

  58. [58]

    In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp

    Bhaskar, A., Stodden, V.: Reproscreener: Leveraging LLMs for assessing com- putational reproducibility of machine learning pipelines. In: Proceedings of the 2nd ACM Conference on Reproducibility and Replicability, pp. 101–109 (2024). https://doi.org/10.1145/3641525.3663629

  59. [59]

    arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10

    Bibal, A., Minton, S.N., Khider, D., Gil, Y.: AI copilots for reproducibility in science: A case study. arXiv preprint arXiv:2506.20130 (2025) https://doi.org/10. 48550/arXiv.2506.20130

  60. [60]

    Towards an AI co-scientist

    Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R.,et al.: Towards an AI co-scientist. arXiv preprint arXiv:2502.18864 (2025) https://doi.org/10.48550/arXiv.2502.18864

  61. [61]

    Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater

    Cheetham, A.K., Seshadri, R.: Artificial intelligence driving materials discovery? perspective on the article: Scaling deep learning for materials discovery. Chemistry of Materials36(8), 3490–3495 (2024) https://doi.org/10.1021/acs.chemmater. 4c00643

  62. [62]

    Advanced Science12(44), 08751 (2025) https://doi.org/10

    Guan, Y., Cui, L., Inchai, J., Fang, Z., Law, J., Brito, A.A.G., Pawlosky, A., Gottweis, J., Daryin, A., Myaskovsky, A.,et al.: AI-assisted drug re-purposing for human liver fibrosis. Advanced Science12(44), 08751 (2025) https://doi.org/10. 1002/advs.202508751

  63. [63]

    Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

    He, L., Patkowski, J.B., Wang, J., Miguel-Romero, L., Aylett, C.H.S., Fillol-Salom, A., Costa, T.R.D., Penad´ es, J.R.: Chimeric infective particles expand species boundaries in phage-inducible chromosomal island mobilization. Cell188(23), 6636–665317 (2025) https://doi.org/10.1016/j.cell.2025.08.019

  64. [64]

    Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

    Peters, U., Chin-Yee, B.: Generalization bias in large language model summa- rization of scientific research. Royal Society Open Science12(4), 241776 (2025) https://doi.org/10.1098/rsos.241776

  65. [65]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  66. [66]

    Our method s i g n i f i c a n t l y o u t p e r f o r m s

    Coakley, K.L., Snelleman, T., Hoos, H., Gundersen, O.E.: GitHub: Kevincoakley/ai- research-moves-towards. https://doi.org/10.5281/zenodo.20785801 27 S4 Supplementary Tables Reproducibility Variable AAAI ICML ICLR IJCAI NeurIPS 2021 2023 2022 2021 2019 Pseudocode✓– –✓– Open Code✓ ✓ ✓ ✓ ✓ Open Datasets✓ ✓ ✓ ✓ ✓ Dataset Splits –✓–✓ ✓ Hardware Specification✓ ...

  67. [67]

    In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN

    and T1D Exchange Registry [31]. In addition, one of the reasons our privacy results perform well is because we use two separate datasets for the training of the motif causality block and the GAN. However, this may be a limiting factor for others that do not have a large enough set of traces available to be able to train adequately on partitioned data. Fal...