pith. sign in

arxiv: 2606.28120 · v1 · pith:UJUFOYP7new · submitted 2026-06-26 · 💻 cs.DL · cs.SE· cs.SI

The Reciprocal Impact of Science and Software: A Cross-Corpus Analysis of How Research Shapes Software and Software Enables Research

Pith reviewed 2026-06-29 01:43 UTC · model grok-4.3

classification 💻 cs.DL cs.SEcs.SI
keywords cross-corpus analysissoftware reusescientific citationsreproducibility toolsmachine learning infrastructureimpact measurementdependency graphsversion control history
0
0 comments X

The pith

The measured correlation between software reuse and scientific citations reverses sign depending on how the two are linked.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects a near-complete archive of public version-control history to two major scientific literature databases through a typed graph of 69.8 million edges. Anchoring on 18,247 science-related repositories, it examines how papers shape code and how code shapes papers. Science reaches software primarily through reproducibility and packaging tools plus sequence-analysis packages, while software reaches science primarily through machine-learning and data-science libraries. Direct mentions of repositories in papers prove too sparse to rank impact, so dependency reuse is used as a proxy. That proxy correlates only weakly with citation counts, and the sign of the correlation itself changes when citations are taken from papers that name the repository versus DOIs the repository declares for itself.

Core claim

The two directions of influence illuminate different, complementary strata: literature reaches software mainly via a reproducibility and packaging layer and sequence-analysis tools, whereas software reaches science mainly via a largely invisible machine-learning and data-science infrastructure tier. The direct paper-names-software channel is too sparse to support ranking. Dependency reuse as a proxy shows at most weak coupling to citation count and stars. The reuse-citation correlation flips sign and statistical significance across two reasonable pairing methods, with n=137 yielding rho=0.05 (CI straddling zero) and n=1,067 yielding rho=0.13 (CI [0.07,0.19]).

What carries the argument

A typed cross-corpus graph of 69.8M edges over eight relation types that links World of Code version histories to Semantic Scholar and OpenAlex records, anchored on 18,247 curated science repositories.

If this is right

  • Science shapes software most visibly through reproducibility frameworks and packaging systems rather than through direct algorithmic contributions.
  • Software shapes science most visibly through data-science and machine-learning libraries that papers rarely name explicitly.
  • Dependency reuse can stand in for direct citation counts but only as a weak proxy.
  • Any headline claim about the strength or direction of science-software coupling must be tested against multiple pairing methods because the sign is not robust.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Impact studies that rely on a single linking rule should report results from at least one alternative rule to demonstrate stability.
  • The observed sparsity of explicit mentions suggests that better named-entity recognition or mandatory software citation standards could change the measured strata.
  • Separate metrics may be needed for the reproducibility layer and the machine-learning infrastructure layer rather than a single aggregate score.

Load-bearing premise

The typed linkages between papers and repositories, especially mentions and declared citations, are complete and unbiased enough to reveal the main strata of influence.

What would settle it

Recompute the reuse-citation Spearman correlations after adding a third independent matching rule, such as full-text search for repository names inside every paper, and check whether the sign stays the same across all three rules.

read the original abstract

Software and scientific knowledge co-evolve, yet they are catalogued in separate corpora that rarely speak to one another. We bridge them at global scale by linking World of Code (a near-complete mirror of public version-control history) to Semantic Scholar and OpenAlex through a typed cross-corpus graph of 69.8M edges over eight relation types (paper-to-software mentions, software-to-paper citations, software dependencies, authorship, affiliation, and identity bridges). Anchoring on 18,247 curated science repositories, we ask two reciprocal questions: what is the impact of science on software, and of software on science? To test whether this Science-Software Supply Chain (S3C) view is feasible, we run basic investigations rather than claim a definitive measurement. The two directions appear to illuminate different, complementary strata: the literature's reach into software is dominated by a reproducibility and packaging layer (nf-core, Nextflow, Bioconda) and sequence-analysis tools, whereas software's reach back into science is proxied by a largely invisible machine-learning and data-science infrastructure tier (PyTorch, seaborn, NLTK). The direct paper-names-software channel is too sparse to rank: a human-curated gold benchmark links none of its 65 in-scope cases. Dependency reuse stands in as a proxy and is at most weakly coupled to citation count and to stars (Spearman rho=0.36). Our most cautionary finding is about measurement itself: the reuse-citation coupling flips sign and confidence across two reasonable ways of pairing a repository with a citation count, through papers that name it (n=137, rho=0.05, CI straddling zero) versus DOIs a repository declares for itself (n=1,067, rho=0.13, CI [0.07,0.19]). With linkage this sparse, the sign of a headline correlation depends on which gap one tolerates, so we report both and refrain from a strong decoupling claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript constructs a typed cross-corpus graph of 69.8M edges linking World of Code to Semantic Scholar and OpenAlex across eight relation types. Anchoring on 18,247 curated science repositories, it performs basic exploratory investigations into reciprocal impacts rather than definitive measurements. It reports that science-to-software influence is dominated by reproducibility/packaging layers (nf-core, Nextflow, Bioconda) and sequence-analysis tools, while software-to-science influence is proxied by ML/data-science infrastructure (PyTorch, seaborn, NLTK). Direct paper-to-software mentions are sparse (zero matches in a human-curated 65-case gold benchmark), dependency reuse is at most weakly coupled to citations/stars (rho=0.36), and the reuse-citation correlation flips sign and confidence depending on pairing method (named papers: n=137, rho=0.05, CI straddling zero; declared DOIs: n=1,067, rho=0.13, CI [0.07,0.19]).

Significance. If the linkages are representative within the acknowledged sparsity, the work demonstrates the feasibility of large-scale cross-corpus analysis for science-software co-evolution and underscores measurement sensitivity in such settings. Explicit strengths include the human-curated gold benchmark, direct reporting of both pairing methods with their differing outcomes and CIs, and consistent framing as basic investigations rather than strong claims.

minor comments (2)
  1. [Abstract] Abstract: the sample sizes n=137 and n=1,067 for the two correlation analyses are reported without a brief description of how the subsets were extracted from the full 18,247-repository anchor set; adding one sentence would improve reproducibility of the comparison.
  2. The manuscript could add a short dedicated limitations subsection (perhaps after the methods) that consolidates the acknowledged sparsity of direct linkages and the proxy nature of dependency reuse, even though these points are already stated in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of the manuscript. The recommendation for minor revision is noted; we will address any editorial or presentational suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an exploratory analysis based on direct empirical counts, Spearman correlations, and a typed cross-corpus graph constructed from external sources (World of Code, Semantic Scholar, OpenAlex). No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. All reported findings (strata of influence, correlation comparisons, sparsity observations) rest on the constructed linkages and external data without reduction to inputs by construction or load-bearing self-citation chains. The work explicitly frames itself as feasibility tests and reports both pairing methods with their differing results, confirming the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central observations rest on the assumption that the constructed graph and curated anchor set allow meaningful basic investigations into influence strata; no free parameters are fitted beyond the reported correlations, and no new entities are postulated.

axioms (1)
  • domain assumption The 18,247 curated science repositories and the typed linkages provide a representative enough sample to identify the dominant strata of science-software influence.
    The analysis anchors all questions on these repositories and the 69.8M-edge graph.

pith-pipeline@v0.9.1-grok · 5906 in / 1343 out tokens · 72237 ms · 2026-06-29T01:43:30.676520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Ivanov, John Chamberlin, David Hanauer, Candace L

    Awan Afiaz, Andrey A. Ivanov, John Chamberlin, David Hanauer, Candace L. Savonen, Mary J. Goldman, Martin Morgan, Michael Reich, Alexander Getka, Aaron Holmes, Sarthak Pati, Dan Knight, Paul C. Boutros, Spyridon Bakas, J. Gregory Caporaso, Guilherme Del Fiol, Harry Hochheiser, Brian Haas, Patrick D. Schloss, James A. Eddy, Jake Albrecht, Andrey Fedorov, L...

  2. [2]

    Sadika Amreen, Yuxia Zhang, Chris Bogart, Russell Zaretzki, and Audris Mockus

    refs/eval-software-impact-biomed-2023.pdf. Sadika Amreen, Yuxia Zhang, Chris Bogart, Russell Zaretzki, and Audris Mockus. Alfaa: Ac- tive learning fingerprint based anti-aliasing for correcting developer identity errors in ver- sion control systems.Empirical Software Engineering, 25(2):1136–1167,

  3. [3]

    URLpapers/ALFAA.pdf

    doi: 10.1007/ s10664-019-09786-7. URLpapers/ALFAA.pdf. Eva Maxfield Brown, Stephan Druskat, Laurent H´ ebert-Dufresne, James Howison, Daniel Mietchen, Andrew Nesbitt, Jo˜ ao Felipe Pimentel, and Boris Veytsman. Biomedical open source software: Crucial packages and hidden heroes.arXiv preprint arXiv:2404.06672,

  4. [4]

    Biol.; refs/biomedical-oss-hidden-heroes-2024.pdf

    intended for PLOS Comput. Biol.; refs/biomedical-oss-hidden-heroes-2024.pdf. Alexandre Decan, Tom Mens, and Philippe Grosjean. An empirical comparison of dependency network evolution in seven software packaging ecosystems.Empirical Software Engineering, 24 (1):381–416,

  5. [5]

    Stephan Druskat

    doi: 10.1007/s10664-017-9589-y. Stephan Druskat. Software and dependencies in research citation graphs.Computing in Sci- ence & Engineering, 22(2):8–21,

  6. [6]

    arXiv:1906.06141; refs/software-dependencies-citation-graphs-2019.pdf

    doi: 10.1109/MCSE.2019.2952840. arXiv:1906.06141; refs/software-dependencies-citation-graphs-2019.pdf. 17 Stephan Druskat, Neil P. Chue Hong, Sammie Buzzard, Olexandr Konovalov, and Patrick Kornek. Don’t mention it: An approach to assess challenges to using software mentions for citation and discoverability research.arXiv preprint arXiv:2402.14602,

  7. [7]

    Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison

    refs/dont-mention-it-software- mentions-2024.pdf. Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison. Softcite dataset: A dataset of software mentions in biomedical and economic research publications.Journal of the Association for Information Science and Technology, 72(7):870–884,

  8. [8]

    software- mention extraction recall is well below

    doi: 10.1002/asi.24454. software- mention extraction recall is well below

  9. [9]

    James Howison and Julia Bullard

    doi: 10.1126/science.aao0185. James Howison and Julia Bullard. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature.Journal of the Association for Information Science and Technology (JASIST), 67(9):2137–2155,

  10. [10]

    refs/howison-bullard-2016-software-in-lit.pdf

    doi: 10.1002/asi.23538. refs/howison-bullard-2016-software-in-lit.pdf. James Howison and James D. Herbsleb. Scientific software production: incentives and collaboration. InProc. ACM CSCW,

  11. [11]

    refs/howison-herbsleb-2011-scisoft- incentives.pdf

    doi: 10.1145/1958824.1958904. refs/howison-herbsleb-2011-scisoft- incentives.pdf. Ana-Maria Istrate, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, and Ivana Williams. A large dataset of software mentions in the biomedical literature,

  12. [12]

    CZ Software Mentions; also Proc

    URLhttps:// arxiv.org/abs/2209.00693. CZ Software Mentions; also Proc. ISSI 2023, pp. 155–174; refs/czi- software-mentions-biomed-2022.pdf. Mahmoud Jahanshahi and Audris Mockus. Cracks in the stack: Hidden vulnerabilities and licensing risks in llm pre-training datasets. InLLM4Code, April-May

  13. [13]

    Rodney Kinney et al

    arXiv preprint, under review. Rodney Kinney et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140,

  14. [14]

    Challenges of measuring the impact of software: an examination of the lme4 R package

    doi: 10.1126/science.adw3000. Kai Li, Pei-Ying Chen, and Erjia Yan. Challenges of measuring the impact of software: an exami- nation of the lme4 r package.arXiv preprint arXiv:1811.11270,

  15. [15]

    Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus

    refs/challenges-measuring- software-lme4-2018.pdf. Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus. World of code: An infrastructure for mining the universe of open source vcs data. InIEEE Working Conference on Mining Software Repositories, May 26

  16. [16]

    Addi Malviya-Thakur, Reed Milewicz, Lavinia Paganini, Mahmoud Jahanshahi, Ahmed Samir Imam Mahmoud, Bogdan Vasilescu, and Audris Mockus

    URL https://arxiv.org/abs/2312.06382. Addi Malviya-Thakur, Reed Milewicz, Lavinia Paganini, Mahmoud Jahanshahi, Ahmed Samir Imam Mahmoud, Bogdan Vasilescu, and Audris Mockus. Scientific open-source soft- ware is more sustainable than one might think! InThe ACM International Conference on the Foundations of Software Engineering, June 23-27

  17. [17]

    org/doi/10.1145/3338906.3342813?cid=81100250207

    URLhttps://dl.acm. org/doi/10.1145/3338906.3342813?cid=81100250207. FSE’19 Industry Keynote. Audris Mockus. Tutorial: Open source software supply chains. InIndia Software Engineering Conference,

  18. [18]

    Audris Mockus, Peter C

    companion paper, under preparation. Audris Mockus, Peter C. Rigby, Rui Abreu, Parth Suresh, Yifen Chen, and Nachiappan Nagappan. Modeling the centrality of developer output with software supply chains. InESEC/FSE 2023, December

  19. [19]

    Heather Piwowar, Jason Priem, and James Howison

    doi: 10.1007/s11192-016-2138-4. Heather Piwowar, Jason Priem, and James Howison. Citeas: mapping software to its requested citation.https://citeas.org,

  20. [20]

    Openalex: an open and comprehensive catalog of scholarly works

    Jason Priem, Heather Piwowar, and Richard Orr. Openalex: an open and comprehensive catalog of scholarly works. arXiv:2205.01833,

  21. [21]

    Proceedings of the 30th

    doi: 10.1145/3459637.3482017. 19 David Schindler, Felix Bensmann, Stefan Dietze, and Frank Kr¨ uger. The role of software in science: a knowledge graph-based analysis of software mentions in pubmed central.PeerJ Computer Science, 8:e835,

  22. [22]

    David Schindler, Tazin Hossain, Sascha Spors, and Frank Kr¨ uger

    doi: 10.7717/peerj-cs.835. David Schindler, Tazin Hossain, Sascha Spors, and Frank Kr¨ uger. A multi-level analysis of data quality for formal software citation.arXiv preprint arXiv:2306.17535,

  23. [23]

    refs/multilevel-data- quality-software-citation-2023.pdf. Arfon M. Smith, Daniel S. Katz, and Kyle E. Niemeyer. Software citation principles.PeerJ Computer Science, 2:e86,

  24. [24]

    FORCE11 Software Citation Working Group

    doi: 10.7717/peerj-cs.86. FORCE11 Software Citation Working Group. Vincent A. Traag. Science of science—citation models and research evaluation. In Taha Yasseri, editor,Handbook of Computational Social Science. Edward Elgar,

  25. [25]

    Dashun Wang and Albert-L´ aszl´ o Barab´ asi.The Science of Science

    arXiv:2207.11116; refs/sciofsci-citation-models-eval-2022.pdf. Dashun Wang and Albert-L´ aszl´ o Barab´ asi.The Science of Science. Cambridge University Press,

  26. [26]

    science software

    doi: 10.1038/s41586-019-0941-9. 20 Table 8: Threats to validity and mitigations. Threat Description Mitigation / residual risk Construct: “science software” The SciCat seed is one LLM-classified op- erationalization from a sampled crawl; flag- ship repositories can be absent (e.g. the E3SM model; only an I/O component is present). Seed is curated and fiel...