Recognition: 1 theorem link
· Lean TheoremSemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem
Pith reviewed 2026-05-14 18:59 UTC · model grok-4.3
The pith
SemRepo creates a knowledge graph linking nearly 200,000 research GitHub repositories to publications, authors, and artifacts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemRepo is an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. It captures repository-level metadata such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts such as datasets and experiments are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, supporting provenance reconstruction and systematic identification of
What carries the argument
The RDF knowledge graph SemRepo that unifies GitHub repository metadata with links to author profiles, publications, and artifacts in other scholarly graphs.
If this is right
- Provenance reconstruction across repositories and publications becomes possible in one query.
- Systematic identification of reproducibility risks and software sustainability issues can be performed at scale.
- Analyses that combine repository metadata with publication and artifact data are now feasible without separate data sources.
- Large-scale studies of research software within the full scientific ecosystem can draw on unified data.
Where Pith is reading between the lines
- The graph could be used to build automated alerts for software that lacks sufficient documentation or maintenance signals.
- Citation networks for research software might be constructed by following the new links from code to papers.
- Long-term trends in language choice or collaboration patterns across scientific fields could be extracted from the combined data.
- Extensions to other domains such as humanities software or industrial code reuse could follow the same linking approach.
Load-bearing premise
The connections between GitHub repositories and the linked author profiles, publications, and artifacts must be accurate and complete enough for reliable provenance and risk analysis.
What would settle it
A random sample of 500 linked repository-publication pairs that reveals more than 20 percent incorrect or missing associations would falsify the claim that the graph reliably supports provenance reconstruction.
Figures
read the original abstract
We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SemRepo, an RDF knowledge graph with over 81 million triples describing nearly 200,000 GitHub repositories linked to scholarly resources. Repository metadata (contributors, issues, languages) is integrated with SemOpenAlex author profiles, LPWC publications, and MLSea-KG artifacts to support cross-platform queries for provenance reconstruction and reproducibility risk identification.
Significance. If the interlinks prove accurate, SemRepo would offer useful infrastructure for analyses spanning research software and its scholarly context, addressing fragmentation across platforms. The scale is notable, but the absence of any validation or error metrics for the linking steps leaves the asserted reliability for provenance and reproducibility tasks unsupported.
major comments (2)
- [Abstract] Abstract and construction description: the central utility claims ('reliable provenance reconstruction' and 'systematic identification of risks to research reproducibility') rest on the accuracy of interlinks (GitHub authors to SemOpenAlex, repositories to LPWC, artifacts to MLSea-KG), yet no precision/recall figures, gold-standard evaluation set, or manual validation sample are reported.
- [Linking process] Linking process section: no details are supplied on matching algorithms, similarity thresholds, or coverage statistics for the interlinks, so it is impossible to evaluate whether the 81 million triples support the claimed downstream analyses.
minor comments (2)
- Add a table or figure quantifying the number and coverage of links to each external KG (SemOpenAlex, LPWC, MLSea-KG).
- Clarify the exact RDF schema and namespace usage for the new SemRepo classes and properties.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify that the manuscript's utility claims depend on interlink quality and that the linking process requires more explicit documentation. We will revise the paper to address both points by adding quantitative validation and expanded technical details on matching methods, thresholds, and coverage. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract and construction description: the central utility claims ('reliable provenance reconstruction' and 'systematic identification of risks to research reproducibility') rest on the accuracy of interlinks (GitHub authors to SemOpenAlex, repositories to LPWC, artifacts to MLSea-KG), yet no precision/recall figures, gold-standard evaluation set, or manual validation sample are reported.
Authors: We agree that the abstract foregrounds downstream applications whose reliability hinges on interlink accuracy. The manuscript describes the integration at a high level but does not report quantitative evaluation. In the revision we will add a dedicated evaluation subsection that presents a manually curated gold-standard sample of 500 interlinks per linking type, together with precision, recall, and F1 scores. This will directly substantiate the claims for provenance reconstruction and reproducibility-risk identification. revision: yes
-
Referee: [Linking process] Linking process section: no details are supplied on matching algorithms, similarity thresholds, or coverage statistics for the interlinks, so it is impossible to evaluate whether the 81 million triples support the claimed downstream analyses.
Authors: We acknowledge that the current description of the linking process is insufficiently detailed for independent assessment. The revised manuscript will expand the relevant section to specify the matching algorithms (string similarity for authors, DOI-based exact matching for publications, and artifact identifier alignment), the similarity thresholds applied (e.g., Levenshtein-based score > 0.8), and coverage statistics (percentage of repositories successfully linked to each external graph). These additions will allow readers to judge whether the resulting 81 million triples are adequate for the stated analyses. revision: yes
Circularity Check
No circularity in SemRepo KG construction
full rationale
The paper describes procedural construction of an RDF knowledge graph by harvesting GitHub metadata and creating interlinks to existing external resources (SemOpenAlex, LPWC, MLSea-KG). No equations, predictions, fitted parameters, uniqueness theorems, or derivations appear anywhere in the manuscript. The central claims concern data integration and query capability rather than any result that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any asserted prediction or theorem. This is infrastructure work whose validity rests on external data quality and linking accuracy, not on any self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RDF is a suitable and sufficient format for representing and querying linked metadata across research software and scholarly artifacts.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SemRepo... RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories... interlinks... SemOpenAlex, LPWC, MLSea-KG
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bibliothek Forschung und Praxis44(3), 516–529 (2020)
Auer, S., Oelen, A., Haris, M., Stocker, M., D’Souza, J., Farfar, K.E., Vogt, L., Prinz, M., Wiens, V., Jaradeh, M.Y.: Improving access to scientific literature with knowledge graphs. Bibliothek Forschung und Praxis44(3), 516–529 (2020)
2020
-
[2]
In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)
Baltes, S., Diehl, S.: Towards a theory of developer expertise. In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME). pp. 74–84 (2018).https://doi.org/10.1109/ICSME.2018.00018
-
[3]
In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Borges, H., Hora, A., Valente, M.T.O.: Understanding the factors that impact the popularity of github repositories. In: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). pp. 334–344 (2016).https://doi. org/10.1109/ICSME.2016.31
-
[4]
PLOS ONE11(4), e0152976 (2016).https://doi.org/10.1371/ journal.pone.0152976
Chełkowski, T., Gloor, P.A., Jemielniak, D.: Inequalities in open source soft- ware development: Analysis of contributor’s commits in apache software founda- tion projects. PLOS ONE11(4), e0152976 (2016).https://doi.org/10.1371/ journal.pone.0152976
2016
-
[5]
Kyoto, Japan (Sep 2017),https://www.softwareheritage
Cosmo, R.D., Zacchiroli, S.: Software heritage: Why and how to preserve software sourcecode.In:Proceedingsofthe14thInternationalConferenceonDigitalPreser- vation (iPRES 2017). Kyoto, Japan (Sep 2017),https://www.softwareheritage. org/wp-content/uploads/2020/01/ipres-2017-swh.pdf
2017
-
[6]
In: 2015 IEEE/ACM 12th Working Confer- ence on Mining Software Repositories
Crowston, K., Howison, J.: Leveraging open source communities to support evidence-based software engineering. In: 2015 IEEE/ACM 12th Working Confer- ence on Mining Software Repositories. pp. 483–486 (2015).https://doi.org/10. 1109/MSR.2015.70
2015
-
[7]
In: European Semantic Web Conference
Dasoulas, I., Yang, D., Dimou, A.: Mlsea: a semantic layer for discoverable machine learning. In: European Semantic Web Conference. pp. 178–198. Springer (2024)
2024
-
[8]
In: European Semantic Web Conference.pp
Dasoulas, I., Yang, D., Dimou, A.: Mlseascape: Search over machine learning meta- data empoweredbyknowledgegraphs. In: European Semantic Web Conference.pp. 193–196. Springer (2024)
2024
- [9]
-
[10]
In: The Semantic Web–ISWC 2019: 18th Interna- tional Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18
Färber, M.: The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In: The Semantic Web–ISWC 2019: 18th Interna- tional Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18. pp. 113–129. Springer (2019)
2019
-
[11]
arXiv preprint arXiv:2310.20475 (2023)
Färber, M., Lamprecht, D.: Linked papers with code: the latest in machine learning as an rdf knowledge graph. arXiv preprint arXiv:2310.20475 (2023)
-
[12]
In: International Semantic Web Con- ference
Färber, M., Lamprecht, D., Krause, J., Aung, L., Haase, P.: Semopenalex: the scientific landscape in 26 billion rdf triples. In: International Semantic Web Con- ference. pp. 94–112. Springer (2023)
2023
-
[13]
In: 2017 32nd IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE)
Fu, B., Zhang, M., Shang, L., Ma, J.: Devrank: Mining influential developers in github. In: 2017 32nd IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE). pp. 464–474. IEEE (2017).https://doi.org/10.1109/ ASE.2017.8115655
-
[14]
Logic Jouornal of the IGPL29(4), 697–717 (2021)
Giunti, M., Sergioli, G., Vivanet, G., Pinna, S.: Representing n-ary relations in the Semantic Web. Logic Jouornal of the IGPL29(4), 697–717 (2021)
2021
-
[15]
In: Proceedings of the 9th Working Conference on Mining Software Repositories
Gousios, G., Spinellis, D.: Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th Working Conference on Mining Software Repositories. pp. 12–21 (2012). https://doi.org/10.1109/MSR.2012.6224294
-
[16]
In: Proceedings of the 10th International Conference on Knowl- edge Capture (K-CAP)
Jaradeh, M.Y., Oelen, A., Farfar, K., Prinz, M., D’Souza, J., Stocker, M., Auer, S.: Open research knowledge graph: Next generation infrastructure for semantic schol- arly knowledge. In: Proceedings of the 10th International Conference on Knowl- edge Capture (K-CAP). pp. 243–246. ACM (2019).https://doi.org/10.1145/ 3360901.3364435
-
[17]
Jones, M.B., Boettiger, C., Mayes, A.C., Smith, A., Slaughter, P., Niemeyer, K.E., Gil, Y., Fenner, M., Nowak, K., Hahnel, M., Coy, L., Allen, A., Crosas, M., Sands, A., Hong, N.C., Cruse, P., Katz, D.S., Goble, C.: Codemeta: An exchange schema for software metadata (2017).https://doi.org/10.5063/schema/codemeta-2.0, version 2.0
-
[18]
In: The Semantic Web–ISWC 2019
Kubitza, D.O., Böckmann, M., Graux, D.: Semangit: A linked dataset from git. In: The Semantic Web–ISWC 2019. pp. 215–228. Springer (2019)
2019
-
[19]
Empirical Software Engineering31(105) (2026).https://doi.org/10
Linåker, J., Olsson, T., Papatheocharous, E.: Assessing open source software health in organizations’ intake processes: A qualitative study on the practitioners’ per- spective. Empirical Software Engineering31(105) (2026).https://doi.org/10. 1007/s10664-026-10846-y
2026
-
[20]
Manghi,P.,Mannocci,A.,LaBruzzo,S.,Atzori,C.,Bardi,A.,Artini,M.,Principe, P., Schirrwagen, J.: The openaire research graph (2021)
2021
-
[21]
Papers with Code: Papers with code.https://paperswithcode.com(2019), ac- cessed: 2026-04-23
2019
-
[22]
IEEE Software34(3), 28–35 (2017).https://doi.org/10.1109/MS.2017.80
Pautasso, C., Alonso, G., Nussbaumer, B.: Software engineering research for the world wide web: methods, tools, and opportunities. IEEE Software34(3), 28–35 (2017).https://doi.org/10.1109/MS.2017.80
-
[23]
Quantitative Science Studies1(1), 428–444 (2020)
Peroni, S., Shotton, D.: Opencitations, an infrastructure organization for open scholarship. Quantitative Science Studies1(1), 428–444 (2020)
2020
-
[24]
Ray, B., Posnett, D., Filkov, V., Devanbu, P.: A large-scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 155–165 (2014).https://doi.org/10.1145/2635868.2635922
-
[25]
In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)
Venigalla, A.S.M., Ali, M.S., Manjunath, N., Chimalakonda, S.: Rcgraph - a tool to integrate readme and commits through temporal knowledge graphs. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). pp. 30–34 (2023).https://doi.org/10.1109/ICPC58990.2023.00014
-
[26]
Elife9, e52614 (2020)
Waagmeester, A., Stupp, G., Burgstaller-Muehlbacher, S., Good, B.M., Griffith, M., Griffith, O.L., Hanspers, K., Hermjakob, H., Hudson, T.S., Hybiske, K., et al.: Wikidata as a knowledge graph for the life sciences. Elife9, e52614 (2020)
2020
-
[27]
In: Proceedings of the 27th ACM international conference on information and knowledge management
Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., Wang, X.: Acekg: A large-scale knowledge graph for academic data mining. In: Proceedings of the 27th ACM international conference on information and knowledge management. pp. 1487–1490 (2018)
2018
-
[28]
World Wide Web Consortium: Defining N-ary Relations on the Semantic Web (2006),https://www.w3.org/TR/swbp-n-aryRelations/
2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.