arxiv: 2605.13310 · v1 · submitted 2026-05-13 · 💻 cs.DL · cs.DB· cs.IR

Recognition: 1 theorem link

· Lean Theorem

SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem

Abdul Rafay , Yuni Susanti , David Lamprecht , Michael F\"arber

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:59 UTC · model grok-4.3

classification 💻 cs.DL cs.DBcs.IR

keywords knowledge graphresearch softwareGitHub repositoriesscholarly ecosystemreproducibilityRDFprovenancesoftware sustainability

0 comments

The pith

SemRepo creates a knowledge graph linking nearly 200,000 research GitHub repositories to publications, authors, and artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SemRepo as an RDF knowledge graph containing over 81 million triples on almost 200,000 GitHub repositories used in science. It records repository details like contributors, issues, and languages while connecting authors to SemOpenAlex profiles, repositories to LPWC publications, and artifacts to MLSea-KG. This setup allows cross-platform queries that combine code history with scholarly records in one place. The result supports tracing software origins and spotting threats to reproducibility or long-term maintenance. A single graph of this kind makes ecosystem-level studies of research software feasible where fragmented sources previously blocked them.

Core claim

SemRepo is an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. It captures repository-level metadata such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts such as datasets and experiments are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, supporting provenance reconstruction and systematic identification of

What carries the argument

The RDF knowledge graph SemRepo that unifies GitHub repository metadata with links to author profiles, publications, and artifacts in other scholarly graphs.

If this is right

Provenance reconstruction across repositories and publications becomes possible in one query.
Systematic identification of reproducibility risks and software sustainability issues can be performed at scale.
Analyses that combine repository metadata with publication and artifact data are now feasible without separate data sources.
Large-scale studies of research software within the full scientific ecosystem can draw on unified data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph could be used to build automated alerts for software that lacks sufficient documentation or maintenance signals.
Citation networks for research software might be constructed by following the new links from code to papers.
Long-term trends in language choice or collaboration patterns across scientific fields could be extracted from the combined data.
Extensions to other domains such as humanities software or industrial code reuse could follow the same linking approach.

Load-bearing premise

The connections between GitHub repositories and the linked author profiles, publications, and artifacts must be accurate and complete enough for reliable provenance and risk analysis.

What would settle it

A random sample of 500 linked repository-publication pairs that reveals more than 20 percent incorrect or missing associations would falsify the claim that the graph reliably supports provenance reconstruction.

Figures

Figures reproduced from arXiv: 2605.13310 by Abdul Rafay, David Lamprecht, Michael F\"arber, Yuni Susanti.

**Figure 2.** Figure 2: Exploratory SPARQL Analysis on SemRepo closed and 729,917 remain open, implying that roughly 72% of issues have been resolved. Each repository is associated with approximately 2.06 programming languages, reflecting the multi-language nature of modern software systems. Finally, exploratory SPARQL analyses ( [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Distribution of reproducibility risk across 20,000 SemRepo-linked repositories, showing that nearly half are high-risk. Right: Relationship between repository popularity and reproducibility. While more popular repositories tend to exhibit slightly higher closure rates, the correlation remains weak. To better understand these risks, we compare high- and low-risk repositories. High-risk repositories sh… view at source ↗

read the original abstract

We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemRepo links 200k research GitHub repos into a single RDF graph with SemOpenAlex, LPWC, and MLSea-KG, but reports no accuracy checks on those links.

read the letter

The main takeaway is that this paper delivers a new knowledge graph called SemRepo with 81 million triples covering nearly 200,000 GitHub repositories tied to scientific work. It adds repo metadata such as contributors, issues, and languages, then connects authors to SemOpenAlex, repositories to LPWC publications, and artifacts to MLSea-KG. The result is one place to run queries that cross software and papers, which are normally scattered across platforms. That specific scale of integration looks new compared with prior scholarly graphs mentioned in the abstract. If the connections hold, it genuinely supports provenance tracking and reproducibility risk spotting at a level that isolated resources cannot match. The paper states the intended uses clearly and keeps the focus on practical infrastructure rather than overclaiming theory. The soft spot is the complete absence of validation data. The abstract asserts that the interlinks enable reliable analyses, yet it gives no construction steps, no precision or recall numbers, no gold-standard sample, and no error rates. Without those, the central utility claim rests on untested assumptions about link quality. Construction details are also thin, so it is hard to judge how the graph was built or whether others could reproduce it. This work is for researchers in digital libraries, research software engineering, and reproducibility studies. Readers who need large unified datasets for cross-platform queries will find the resource relevant once it is released and checked. It deserves a serious referee because the scale and integration goal address a real fragmentation problem in the field. Reviewers can ask for the missing validation metrics and access information without rejecting the effort outright. I would send it to peer review with a request to add error analysis and pipeline details.

Referee Report

2 major / 2 minor

Summary. The paper presents SemRepo, an RDF knowledge graph with over 81 million triples describing nearly 200,000 GitHub repositories linked to scholarly resources. Repository metadata (contributors, issues, languages) is integrated with SemOpenAlex author profiles, LPWC publications, and MLSea-KG artifacts to support cross-platform queries for provenance reconstruction and reproducibility risk identification.

Significance. If the interlinks prove accurate, SemRepo would offer useful infrastructure for analyses spanning research software and its scholarly context, addressing fragmentation across platforms. The scale is notable, but the absence of any validation or error metrics for the linking steps leaves the asserted reliability for provenance and reproducibility tasks unsupported.

major comments (2)

[Abstract] Abstract and construction description: the central utility claims ('reliable provenance reconstruction' and 'systematic identification of risks to research reproducibility') rest on the accuracy of interlinks (GitHub authors to SemOpenAlex, repositories to LPWC, artifacts to MLSea-KG), yet no precision/recall figures, gold-standard evaluation set, or manual validation sample are reported.
[Linking process] Linking process section: no details are supplied on matching algorithms, similarity thresholds, or coverage statistics for the interlinks, so it is impossible to evaluate whether the 81 million triples support the claimed downstream analyses.

minor comments (2)

Add a table or figure quantifying the number and coverage of links to each external KG (SemOpenAlex, LPWC, MLSea-KG).
Clarify the exact RDF schema and namespace usage for the new SemRepo classes and properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify that the manuscript's utility claims depend on interlink quality and that the linking process requires more explicit documentation. We will revise the paper to address both points by adding quantitative validation and expanded technical details on matching methods, thresholds, and coverage. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract] Abstract and construction description: the central utility claims ('reliable provenance reconstruction' and 'systematic identification of risks to research reproducibility') rest on the accuracy of interlinks (GitHub authors to SemOpenAlex, repositories to LPWC, artifacts to MLSea-KG), yet no precision/recall figures, gold-standard evaluation set, or manual validation sample are reported.

Authors: We agree that the abstract foregrounds downstream applications whose reliability hinges on interlink accuracy. The manuscript describes the integration at a high level but does not report quantitative evaluation. In the revision we will add a dedicated evaluation subsection that presents a manually curated gold-standard sample of 500 interlinks per linking type, together with precision, recall, and F1 scores. This will directly substantiate the claims for provenance reconstruction and reproducibility-risk identification. revision: yes
Referee: [Linking process] Linking process section: no details are supplied on matching algorithms, similarity thresholds, or coverage statistics for the interlinks, so it is impossible to evaluate whether the 81 million triples support the claimed downstream analyses.

Authors: We acknowledge that the current description of the linking process is insufficiently detailed for independent assessment. The revised manuscript will expand the relevant section to specify the matching algorithms (string similarity for authors, DOI-based exact matching for publications, and artifact identifier alignment), the similarity thresholds applied (e.g., Levenshtein-based score > 0.8), and coverage statistics (percentage of repositories successfully linked to each external graph). These additions will allow readers to judge whether the resulting 81 million triples are adequate for the stated analyses. revision: yes

Circularity Check

0 steps flagged

No circularity in SemRepo KG construction

full rationale

The paper describes procedural construction of an RDF knowledge graph by harvesting GitHub metadata and creating interlinks to existing external resources (SemOpenAlex, LPWC, MLSea-KG). No equations, predictions, fitted parameters, uniqueness theorems, or derivations appear anywhere in the manuscript. The central claims concern data integration and query capability rather than any result that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any asserted prediction or theorem. This is infrastructure work whose validity rests on external data quality and linking accuracy, not on any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the assumption that RDF-based integration of existing scholarly graphs yields a usable resource; no free parameters are fitted and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption RDF is a suitable and sufficient format for representing and querying linked metadata across research software and scholarly artifacts.
The paper adopts RDF without discussing alternatives or limitations of the format for this use case.

pith-pipeline@v0.9.0 · 5474 in / 1169 out tokens · 48476 ms · 2026-05-14T18:59:21.983377+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SemRepo... RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories... interlinks... SemOpenAlex, LPWC, MLSea-KG

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages

[1]

Bibliothek Forschung und Praxis44(3), 516–529 (2020)

Auer, S., Oelen, A., Haris, M., Stocker, M., D’Souza, J., Farfar, K.E., Vogt, L., Prinz, M., Wiens, V., Jaradeh, M.Y.: Improving access to scientific literature with knowledge graphs. Bibliothek Forschung und Praxis44(3), 516–529 (2020)

2020
[2]

In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

Baltes, S., Diehl, S.: Towards a theory of developer expertise. In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME). pp. 74–84 (2018).https://doi.org/10.1109/ICSME.2018.00018

work page doi:10.1109/icsme.2018.00018 2018
[3]

In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Borges, H., Hora, A., Valente, M.T.O.: Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). pp. 334–344 (2016).https://doi. org/10.1109/ICSME.2016.31

work page doi:10.1109/icsme.2016.31 2016
[4]

PLOS ONE11(4), e0152976 (2016).https://doi.org/10.1371/ journal.pone.0152976

Chełkowski, T., Gloor, P.A., Jemielniak, D.: Inequalities in open source soft- ware development: Analysis of contributor’s commits in apache software founda- tion projects. PLOS ONE11(4), e0152976 (2016).https://doi.org/10.1371/ journal.pone.0152976

2016
[5]

Kyoto, Japan (Sep 2017),https://www.softwareheritage

Cosmo, R.D., Zacchiroli, S.: Software heritage: Why and how to preserve software sourcecode.In:Proceedingsofthe14thInternationalConferenceonDigitalPreser- vation (iPRES 2017). Kyoto, Japan (Sep 2017),https://www.softwareheritage. org/wp-content/uploads/2020/01/ipres-2017-swh.pdf

2017
[6]

In: 2015 IEEE/ACM 12th Working Confer- ence on Mining Software Repositories

Crowston, K., Howison, J.: Leveraging open source communities to support evidence-based software engineering. In: 2015 IEEE/ACM 12th Working Confer- ence on Mining Software Repositories. pp. 483–486 (2015).https://doi.org/10. 1109/MSR.2015.70

2015
[7]

In: European Semantic Web Conference

Dasoulas, I., Yang, D., Dimou, A.: Mlsea: a semantic layer for discoverable machine learning. In: European Semantic Web Conference. pp. 178–198. Springer (2024)

2024
[8]

In: European Semantic Web Conference.pp

Dasoulas, I., Yang, D., Dimou, A.: Mlseascape: Search over machine learning meta- data empoweredbyknowledgegraphs. In: European Semantic Web Conference.pp. 193–196. Springer (2024)

2024
[9]

Destefanis, G., Bartolucci, S., Graziotin, D., Neykova, R., Ortu, M.: Introducing repository stability (2025),https://arxiv.org/abs/2504.00542

work page arXiv 2025
[10]

In: The Semantic Web–ISWC 2019: 18th Interna- tional Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18

Färber, M.: The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In: The Semantic Web–ISWC 2019: 18th Interna- tional Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18. pp. 113–129. Springer (2019)

2019
[11]

arXiv preprint arXiv:2310.20475 (2023)

Färber, M., Lamprecht, D.: Linked papers with code: the latest in machine learning as an rdf knowledge graph. arXiv preprint arXiv:2310.20475 (2023)

work page arXiv 2023
[12]

In: International Semantic Web Con- ference

Färber, M., Lamprecht, D., Krause, J., Aung, L., Haase, P.: Semopenalex: the scientific landscape in 26 billion rdf triples. In: International Semantic Web Con- ference. pp. 94–112. Springer (2023)

2023
[13]

In: 2017 32nd IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE)

Fu, B., Zhang, M., Shang, L., Ma, J.: Devrank: Mining influential developers in github. In: 2017 32nd IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE). pp. 464–474. IEEE (2017).https://doi.org/10.1109/ ASE.2017.8115655

work page arXiv 2017
[14]

Logic Jouornal of the IGPL29(4), 697–717 (2021)

Giunti, M., Sergioli, G., Vivanet, G., Pinna, S.: Representing n-ary relations in the Semantic Web. Logic Jouornal of the IGPL29(4), 697–717 (2021)

2021
[15]

In: Proceedings of the 9th Working Conference on Mining Software Repositories

Gousios, G., Spinellis, D.: Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th Working Conference on Mining Software Repositories. pp. 12–21 (2012). https://doi.org/10.1109/MSR.2012.6224294

work page doi:10.1109/msr.2012.6224294 2012
[16]

In: Proceedings of the 10th International Conference on Knowl- edge Capture (K-CAP)

Jaradeh, M.Y., Oelen, A., Farfar, K., Prinz, M., D’Souza, J., Stocker, M., Auer, S.: Open research knowledge graph: Next generation infrastructure for semantic schol- arly knowledge. In: Proceedings of the 10th International Conference on Knowl- edge Capture (K-CAP). pp. 243–246. ACM (2019).https://doi.org/10.1145/ 3360901.3364435

work page arXiv 2019
[17]

Jones, M.B., Boettiger, C., Mayes, A.C., Smith, A., Slaughter, P., Niemeyer, K.E., Gil, Y., Fenner, M., Nowak, K., Hahnel, M., Coy, L., Allen, A., Crosas, M., Sands, A., Hong, N.C., Cruse, P., Katz, D.S., Goble, C.: Codemeta: An exchange schema for software metadata (2017).https://doi.org/10.5063/schema/codemeta-2.0, version 2.0

work page doi:10.5063/schema/codemeta-2.0 2017
[18]

In: The Semantic Web–ISWC 2019

Kubitza, D.O., Böckmann, M., Graux, D.: Semangit: A linked dataset from git. In: The Semantic Web–ISWC 2019. pp. 215–228. Springer (2019)

2019
[19]

Empirical Software Engineering31(105) (2026).https://doi.org/10

Linåker, J., Olsson, T., Papatheocharous, E.: Assessing open source software health in organizations’ intake processes: A qualitative study on the practitioners’ per- spective. Empirical Software Engineering31(105) (2026).https://doi.org/10. 1007/s10664-026-10846-y

2026
[20]

Manghi,P.,Mannocci,A.,LaBruzzo,S.,Atzori,C.,Bardi,A.,Artini,M.,Principe, P., Schirrwagen, J.: The openaire research graph (2021)

2021
[21]

Papers with Code: Papers with code.https://paperswithcode.com(2019), ac- cessed: 2026-04-23

2019
[22]

IEEE Software34(3), 28–35 (2017).https://doi.org/10.1109/MS.2017.80

Pautasso, C., Alonso, G., Nussbaumer, B.: Software engineering research for the world wide web: methods, tools, and opportunities. IEEE Software34(3), 28–35 (2017).https://doi.org/10.1109/MS.2017.80

work page doi:10.1109/ms.2017.80 2017
[23]

Quantitative Science Studies1(1), 428–444 (2020)

Peroni, S., Shotton, D.: Opencitations, an infrastructure organization for open scholarship. Quantitative Science Studies1(1), 428–444 (2020)

2020
[24]

In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Ray, B., Posnett, D., Filkov, V., Devanbu, P.: A large-scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 155–165 (2014).https://doi.org/10.1145/2635868.2635922

work page doi:10.1145/2635868.2635922 2014
[25]

In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)

Venigalla, A.S.M., Ali, M.S., Manjunath, N., Chimalakonda, S.: Rcgraph - a tool to integrate readme and commits through temporal knowledge graphs. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). pp. 30–34 (2023).https://doi.org/10.1109/ICPC58990.2023.00014

work page doi:10.1109/icpc58990.2023.00014 2023
[26]

Elife9, e52614 (2020)

Waagmeester, A., Stupp, G., Burgstaller-Muehlbacher, S., Good, B.M., Griffith, M., Griffith, O.L., Hanspers, K., Hermjakob, H., Hudson, T.S., Hybiske, K., et al.: Wikidata as a knowledge graph for the life sciences. Elife9, e52614 (2020)

2020
[27]

In: Proceedings of the 27th ACM international conference on information and knowledge management

Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., Wang, X.: Acekg: A large-scale knowledge graph for academic data mining. In: Proceedings of the 27th ACM international conference on information and knowledge management. pp. 1487–1490 (2018)

2018
[28]

World Wide Web Consortium: Defining N-ary Relations on the Semantic Web (2006),https://www.w3.org/TR/swbp-n-aryRelations/

2006