Recognition: no theorem link
Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions
Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3
The pith
French doctoral theses from 1985 to 2025 are now available as a linked dataset with structured records on individuals and institutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By aggregating heterogeneous national sources, correcting inconsistent identifiers, enriching person and institution records, and constructing derived variables for academic careers, jury participation, and institutional affiliations, the authors produce a linked dataset that provides structured information at the thesis, individual, and institutional levels. Additional identifiers from major repositories are integrated to support linkage with external sources and future extensions.
What carries the argument
The data production pipeline that aggregates sources, corrects identifiers, enriches records, and derives new variables describing careers, juries, and affiliations.
If this is right
- Structured data at multiple levels enables descriptive analyses of thesis characteristics and institutional affiliations.
- Supports relational analyses of academic networks, supervision practices, and jury composition.
- Facilitates longitudinal studies on the evolution of research communities over time.
- Allows linkage with external academic repositories and library catalogues for broader research.
- Documents data quality issues and limitations to guide responsible reuse and extensions by other researchers.
Where Pith is reading between the lines
- The dataset could serve as a template for building comparable linked collections in other countries to enable cross-national comparisons of doctoral education.
- Researchers could examine whether institutional affiliations correlate with patterns in jury composition or career trajectories using the derived variables.
- Extensions that add post-PhD publication or employment data would allow studies of long-term research impact and mobility.
Load-bearing premise
That aggregation of heterogeneous national sources, correction of identifiers, and enrichment with external databases produces a sufficiently complete and accurate linked dataset without introducing substantial linking errors or selection biases.
What would settle it
An independent verification that finds a high rate of mismatched author or institution identifiers, or large gaps in coverage for specific years or fields, would show the dataset is not sufficiently accurate for reliable analyses.
Figures
read the original abstract
This paper presents a comprehensive dataset of doctoral theses defended in France between 1985 and 2025, constructed from multiple national academic metadata sources. The dataset is primarily based on data from the French national thesis platform and is enriched using additional authority and bibliographic databases to improve data quality, completeness, and interoperability. The data production pipeline includes the aggregation of heterogeneous sources, the correction of inconsistent identifiers, the enrichment of person and institution records, and the construction of derived variables describing academic careers, jury participation, institutional affiliations, and thesis characteristics. Additional identifiers from major academic repositories and library catalogues are integrated to facilitate linkage with external data sources and future dataset extensions. The resulting dataset provides structured information at the thesis, individual, and institutional levels, enabling both descriptive and relational analyses. This resource is particularly suited for research on doctoral education, academic networks, supervision practices, jury composition, institutional collaboration, and the evolution of research communities over time. The paper documents the data sources, processing pipeline, feature construction, data quality issues, and limitations, with the objective of facilitating reuse of the dataset by other researchers and supporting future extensions and longitudinal analyses of the academic system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the construction of a linked dataset of doctoral theses defended in France from 1985 to 2025. It aggregates heterogeneous national metadata sources (primarily the French national thesis platform), corrects inconsistent identifiers, enriches person and institution records using external authority and bibliographic databases, and derives variables on academic careers, jury participation, institutional affiliations, and thesis characteristics. Multiple external identifiers are added to support linkage and reuse. The resulting resource is positioned to enable descriptive and relational analyses of doctoral education, supervision, networks, and institutional collaboration.
Significance. If the dataset achieves the claimed levels of completeness and accuracy, it would constitute a significant resource for research on higher education systems, academic mobility, jury composition, and the structure of research communities in France over four decades. The emphasis on interoperability through additional identifiers and the documentation of the full pipeline are strengths that would support reproducibility and extensions by other researchers.
major comments (1)
- [Abstract] Abstract: the description of the aggregation, identifier correction, and enrichment pipeline provides no quantitative validation metrics (precision/recall for entity resolution, error rates from manual audits, or direct comparison of thesis counts against official national totals). This is load-bearing for the central claim that the process yields a sufficiently complete and accurate linked dataset without substantial linking errors or selection biases.
minor comments (2)
- The limitations discussion would benefit from explicit examples of potential biases or coverage gaps introduced by source heterogeneity or linking decisions.
- Consider including a summary table or diagram of data sources, their temporal coverage, and record counts to clarify the aggregation steps.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the dataset's potential value for research on higher education and academic networks. We address the major comment on quantitative validation metrics below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the description of the aggregation, identifier correction, and enrichment pipeline provides no quantitative validation metrics (precision/recall for entity resolution, error rates from manual audits, or direct comparison of thesis counts against official national totals). This is load-bearing for the central claim that the process yields a sufficiently complete and accurate linked dataset without substantial linking errors or selection biases.
Authors: We agree that explicit quantitative validation metrics are essential to substantiate claims of completeness and accuracy. The current manuscript describes data quality issues and limitations but does not report specific metrics such as precision/recall for entity resolution, audit error rates, or direct comparisons to official national thesis totals. We will revise the abstract to summarize key validation results and add a dedicated validation subsection detailing: (1) precision and recall estimates from sampled manual audits of entity resolution steps; (2) error rates observed in those audits; and (3) comparisons of thesis counts against available official national statistics for years where such benchmarks exist (noting potential gaps in official data for earlier periods). These additions will transparently address potential linking errors and selection biases while supporting the central claims. revision: yes
Circularity Check
No circularity: pure data-construction description
full rationale
The paper describes an aggregation pipeline that merges external national thesis records, authority files, and bibliographic databases. No equations, fitted parameters, predictions, or self-referential derivations appear. All steps are procedural mappings from named external sources; the resulting dataset is presented as the direct output of those mappings rather than as a quantity derived from itself. The central claim (utility for descriptive and relational analyses) rests on the documented sources and processing steps, not on any internal loop that reduces to its own inputs. Absence of quantitative validation metrics is a limitation of evidence strength, not a circularity defect.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption National thesis platforms and authority databases contain sufficiently accurate and linkable records of theses, authors, and institutions.
Reference graph
Works this paper leans on
-
[1]
mblazquez.es, Nov
Manuel Blázquez-Ochando.Tesis doctorales en las universidades españolas durante el período 1977-2014. mblazquez.es, Nov. 2015.URL:https://mblazquez.es/tesis-doctorales-en-las-universidades- espanolas-durante-el-periodo-1977-2014/
1977
-
[2]
What makes a productive Ph.D. student?
Alberto Corsini, Michele Pezzoni, and Fabiana Visentin. “What makes a productive Ph.D. student?” In:Re- search Policy(2022).DOI:10.1016/J.RESPOL.2022.104561
-
[3]
Comment les docteurs deviennent-ils directeurs de thèse ? : Le rôle des réseaux disponibles
Olivier Godechot and Alexandra Louvet. “Comment les docteurs deviennent-ils directeurs de thèse ? : Le rôle des réseaux disponibles”. In:Sociologie(2010).DOI:10.3917/SOCIO.001.0003
-
[4]
Working Paper
Ricardo González-Haba.Teseo: una red geográfica de relaciones entre universidades. Working Paper. Version R0. OSF, July 2023.URL:https://osf.io/3pmga/files/u5s9a. [5]Guide d’indexation RAMEAU. 7ème édition. Bibliothèque nationale de France. Mar. 2017.URL:https : / / rameau.bnf.fr/sites/default/files/2021-06/GIR%202017%20MAJ%202021.pdf
2023
-
[5]
Historical comparison of gender inequality in scientific careers across countries and disciplines
Junming Huang et al. “Historical comparison of gender inequality in scientific careers across countries and disciplines”. In:Proceedings of the National Academy of Sciences of the United States of America(2019).DOI: 10.1073/PNAS.1914221117
-
[6]
Adrien Le Chanu.Les docteurs diplômés en 2024. Online. June 2025.URL:https : / / www . enseignementsup-recherche.gouv.fr/fr/les-docteurs-diplomes-en-2024-99452
2024
-
[7]
The gender gap in early career transitions in the life sciences
Marc Lerchenmueller and Olav Sorenson. “The gender gap in early career transitions in the life sciences”. In: Research Policy(2018).DOI:10.1016/J.RESPOL.2018.02.009
-
[8]
Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory
Nataly Matias-Rayme et al. “Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory”. In:Scientometrics129.7 (July 2024), pp. 4553–4572.ISSN: 1588-2861.DOI:10. 1007/s11192-024-05079-z. 21For example, we provide here an example data visualization based on a subsample of the dataset:https://acss-psl. github.io...
2024
-
[9]
JORF n°195 du 24 août 2006, NOR MENS0602083A
Ministère de l’Éducation nationale, de l’Enseignement supérieur et de la Recherche.Arrêté du 7 août 2006 relatif à la formation doctorale. JORF n°195 du 24 août 2006, NOR MENS0602083A. 2006.URL:https: //www.legifrance.gouv.fr/jorf/id/JORFTEXT000000267752
2006
-
[10]
JORF n°0122 du 27 mai 2016, NOR MENS1611139A
Ministère de l’Éducation nationale, de l’Enseignement supérieur et de la Recherche.Article 18 – Arrêté du 25 mai 2016 fixant le cadre national de la formation et les modalités conduisant à la délivrance du diplôme national de doctorat. JORF n°0122 du 27 mai 2016, NOR MENS1611139A. 2016.URL:https : / / www . legifrance.gouv.fr/loda/article_lc/LEGIARTI000046241991
2016
-
[11]
Understanding ORCID adoption among academic re- searchers
Stephen R Porter, Paul D Umbach, and Chris Willis. “Understanding ORCID adoption among academic re- searchers”. In:Scientometrics130.5 (2025), pp. 2783–2797
2025
-
[12]
An Open-Source Cultural Consensus Approach to Name-Based Gender Classification
Ian Van Buskirk, Aaron Clauset, and Daniel B. Larremore. “An Open-Source Cultural Consensus Approach to Name-Based Gender Classification”. In:Proceedings of the International AAAI Conference on Web and Social Media17.1 (June 2023), pp. 866–877.DOI:10.1609/icwsm.v17i1.22195
-
[13]
PhD theses in Spain: A gender study covering the years 1990-2004
Anna Villarroya et al. “PhD theses in Spain: A gender study covering the years 1990-2004”. In:Scientometrics (2008).DOI:10.1007/S11192-007-1965-8. 11 A Linked Dataset of PhDs, Academic Networks, and InstitutionsA PREPRINT Appendix A Percentages of missing data for selected features 0% 19% 12.6% 37.3% 41.7% 63% 11.1% 0.4% 0.5% 2.7% 35.5% 14% 23.7% 0% 0% 0%...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.