pith. machine review for the scientific record. sign in

arxiv: 2604.08619 · v2 · submitted 2026-04-09 · 💻 cs.DL · cs.CY

Recognition: no theorem link

Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 💻 cs.DL cs.CY
keywords doctoral thesesFrancelinked datasetacademic networksPhD supervisionresearch communitiesdata enrichmentinstitutional collaboration
0
0 comments X

The pith

French doctoral theses from 1985 to 2025 are now available as a linked dataset with structured records on individuals and institutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a comprehensive dataset of doctoral theses defended in France between 1985 and 2025 by aggregating data from the national thesis platform and enriching it with authority and bibliographic databases. This process corrects inconsistent identifiers, builds derived variables for careers and affiliations, and adds external links for interoperability. A sympathetic reader would care because the resulting resource supports detailed studies of academic networks, supervision practices, jury composition, and how research communities have evolved over four decades. The dataset structures information at thesis, individual, and institutional levels to enable both descriptive statistics and relational analyses.

Core claim

By aggregating heterogeneous national sources, correcting inconsistent identifiers, enriching person and institution records, and constructing derived variables for academic careers, jury participation, and institutional affiliations, the authors produce a linked dataset that provides structured information at the thesis, individual, and institutional levels. Additional identifiers from major repositories are integrated to support linkage with external sources and future extensions.

What carries the argument

The data production pipeline that aggregates sources, corrects identifiers, enriches records, and derives new variables describing careers, juries, and affiliations.

If this is right

  • Structured data at multiple levels enables descriptive analyses of thesis characteristics and institutional affiliations.
  • Supports relational analyses of academic networks, supervision practices, and jury composition.
  • Facilitates longitudinal studies on the evolution of research communities over time.
  • Allows linkage with external academic repositories and library catalogues for broader research.
  • Documents data quality issues and limitations to guide responsible reuse and extensions by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as a template for building comparable linked collections in other countries to enable cross-national comparisons of doctoral education.
  • Researchers could examine whether institutional affiliations correlate with patterns in jury composition or career trajectories using the derived variables.
  • Extensions that add post-PhD publication or employment data would allow studies of long-term research impact and mobility.

Load-bearing premise

That aggregation of heterogeneous national sources, correction of identifiers, and enrichment with external databases produces a sufficiently complete and accurate linked dataset without introducing substantial linking errors or selection biases.

What would settle it

An independent verification that finds a high rate of mismatched author or institution identifiers, or large gaps in coverage for specific years or fields, would show the dataset is not sufficiently accurate for reliable analyses.

Figures

Figures reproduced from arXiv: 2604.08619 by Dastan Jasim, William Aboucaya.

Figure 1
Figure 1. Figure 1: Number of theses in the dataset per year [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of research topics (translated from French) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the number of Ph. D. supervisors, jury members and thesis rapporteurs for each decade. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

This paper presents a comprehensive dataset of doctoral theses defended in France between 1985 and 2025, constructed from multiple national academic metadata sources. The dataset is primarily based on data from the French national thesis platform and is enriched using additional authority and bibliographic databases to improve data quality, completeness, and interoperability. The data production pipeline includes the aggregation of heterogeneous sources, the correction of inconsistent identifiers, the enrichment of person and institution records, and the construction of derived variables describing academic careers, jury participation, institutional affiliations, and thesis characteristics. Additional identifiers from major academic repositories and library catalogues are integrated to facilitate linkage with external data sources and future dataset extensions. The resulting dataset provides structured information at the thesis, individual, and institutional levels, enabling both descriptive and relational analyses. This resource is particularly suited for research on doctoral education, academic networks, supervision practices, jury composition, institutional collaboration, and the evolution of research communities over time. The paper documents the data sources, processing pipeline, feature construction, data quality issues, and limitations, with the objective of facilitating reuse of the dataset by other researchers and supporting future extensions and longitudinal analyses of the academic system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes the construction of a linked dataset of doctoral theses defended in France from 1985 to 2025. It aggregates heterogeneous national metadata sources (primarily the French national thesis platform), corrects inconsistent identifiers, enriches person and institution records using external authority and bibliographic databases, and derives variables on academic careers, jury participation, institutional affiliations, and thesis characteristics. Multiple external identifiers are added to support linkage and reuse. The resulting resource is positioned to enable descriptive and relational analyses of doctoral education, supervision, networks, and institutional collaboration.

Significance. If the dataset achieves the claimed levels of completeness and accuracy, it would constitute a significant resource for research on higher education systems, academic mobility, jury composition, and the structure of research communities in France over four decades. The emphasis on interoperability through additional identifiers and the documentation of the full pipeline are strengths that would support reproducibility and extensions by other researchers.

major comments (1)
  1. [Abstract] Abstract: the description of the aggregation, identifier correction, and enrichment pipeline provides no quantitative validation metrics (precision/recall for entity resolution, error rates from manual audits, or direct comparison of thesis counts against official national totals). This is load-bearing for the central claim that the process yields a sufficiently complete and accurate linked dataset without substantial linking errors or selection biases.
minor comments (2)
  1. The limitations discussion would benefit from explicit examples of potential biases or coverage gaps introduced by source heterogeneity or linking decisions.
  2. Consider including a summary table or diagram of data sources, their temporal coverage, and record counts to clarify the aggregation steps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the dataset's potential value for research on higher education and academic networks. We address the major comment on quantitative validation metrics below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of the aggregation, identifier correction, and enrichment pipeline provides no quantitative validation metrics (precision/recall for entity resolution, error rates from manual audits, or direct comparison of thesis counts against official national totals). This is load-bearing for the central claim that the process yields a sufficiently complete and accurate linked dataset without substantial linking errors or selection biases.

    Authors: We agree that explicit quantitative validation metrics are essential to substantiate claims of completeness and accuracy. The current manuscript describes data quality issues and limitations but does not report specific metrics such as precision/recall for entity resolution, audit error rates, or direct comparisons to official national thesis totals. We will revise the abstract to summarize key validation results and add a dedicated validation subsection detailing: (1) precision and recall estimates from sampled manual audits of entity resolution steps; (2) error rates observed in those audits; and (3) comparisons of thesis counts against available official national statistics for years where such benchmarks exist (noting potential gaps in official data for earlier periods). These additions will transparently address potential linking errors and selection biases while supporting the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: pure data-construction description

full rationale

The paper describes an aggregation pipeline that merges external national thesis records, authority files, and bibliographic databases. No equations, fitted parameters, predictions, or self-referential derivations appear. All steps are procedural mappings from named external sources; the resulting dataset is presented as the direct output of those mappings rather than as a quantity derived from itself. The central claim (utility for descriptive and relational analyses) rests on the documented sources and processing steps, not on any internal loop that reduces to its own inputs. Absence of quantitative validation metrics is a limitation of evidence strength, not a circularity defect.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the input national databases are sufficiently accurate and linkable, and that the described enrichment procedures add value without creating new inconsistencies.

axioms (1)
  • domain assumption National thesis platforms and authority databases contain sufficiently accurate and linkable records of theses, authors, and institutions.
    All enrichment and linking steps presuppose the basic reliability of the source metadata.

pith-pipeline@v0.9.0 · 5510 in / 1211 out tokens · 36934 ms · 2026-05-10T17:52:35.158914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages

  1. [1]

    mblazquez.es, Nov

    Manuel Blázquez-Ochando.Tesis doctorales en las universidades españolas durante el período 1977-2014. mblazquez.es, Nov. 2015.URL:https://mblazquez.es/tesis-doctorales-en-las-universidades- espanolas-durante-el-periodo-1977-2014/

  2. [2]

    What makes a productive Ph.D. student?

    Alberto Corsini, Michele Pezzoni, and Fabiana Visentin. “What makes a productive Ph.D. student?” In:Re- search Policy(2022).DOI:10.1016/J.RESPOL.2022.104561

  3. [3]

    Comment les docteurs deviennent-ils directeurs de thèse ? : Le rôle des réseaux disponibles

    Olivier Godechot and Alexandra Louvet. “Comment les docteurs deviennent-ils directeurs de thèse ? : Le rôle des réseaux disponibles”. In:Sociologie(2010).DOI:10.3917/SOCIO.001.0003

  4. [4]

    Working Paper

    Ricardo González-Haba.Teseo: una red geográfica de relaciones entre universidades. Working Paper. Version R0. OSF, July 2023.URL:https://osf.io/3pmga/files/u5s9a. [5]Guide d’indexation RAMEAU. 7ème édition. Bibliothèque nationale de France. Mar. 2017.URL:https : / / rameau.bnf.fr/sites/default/files/2021-06/GIR%202017%20MAJ%202021.pdf

  5. [5]

    Historical comparison of gender inequality in scientific careers across countries and disciplines

    Junming Huang et al. “Historical comparison of gender inequality in scientific careers across countries and disciplines”. In:Proceedings of the National Academy of Sciences of the United States of America(2019).DOI: 10.1073/PNAS.1914221117

  6. [6]

    Adrien Le Chanu.Les docteurs diplômés en 2024. Online. June 2025.URL:https : / / www . enseignementsup-recherche.gouv.fr/fr/les-docteurs-diplomes-en-2024-99452

  7. [7]

    The gender gap in early career transitions in the life sciences

    Marc Lerchenmueller and Olav Sorenson. “The gender gap in early career transitions in the life sciences”. In: Research Policy(2018).DOI:10.1016/J.RESPOL.2018.02.009

  8. [8]

    Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory

    Nataly Matias-Rayme et al. “Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory”. In:Scientometrics129.7 (July 2024), pp. 4553–4572.ISSN: 1588-2861.DOI:10. 1007/s11192-024-05079-z. 21For example, we provide here an example data visualization based on a subsample of the dataset:https://acss-psl. github.io...

  9. [9]

    JORF n°195 du 24 août 2006, NOR MENS0602083A

    Ministère de l’Éducation nationale, de l’Enseignement supérieur et de la Recherche.Arrêté du 7 août 2006 relatif à la formation doctorale. JORF n°195 du 24 août 2006, NOR MENS0602083A. 2006.URL:https: //www.legifrance.gouv.fr/jorf/id/JORFTEXT000000267752

  10. [10]

    JORF n°0122 du 27 mai 2016, NOR MENS1611139A

    Ministère de l’Éducation nationale, de l’Enseignement supérieur et de la Recherche.Article 18 – Arrêté du 25 mai 2016 fixant le cadre national de la formation et les modalités conduisant à la délivrance du diplôme national de doctorat. JORF n°0122 du 27 mai 2016, NOR MENS1611139A. 2016.URL:https : / / www . legifrance.gouv.fr/loda/article_lc/LEGIARTI000046241991

  11. [11]

    Understanding ORCID adoption among academic re- searchers

    Stephen R Porter, Paul D Umbach, and Chris Willis. “Understanding ORCID adoption among academic re- searchers”. In:Scientometrics130.5 (2025), pp. 2783–2797

  12. [12]

    An Open-Source Cultural Consensus Approach to Name-Based Gender Classification

    Ian Van Buskirk, Aaron Clauset, and Daniel B. Larremore. “An Open-Source Cultural Consensus Approach to Name-Based Gender Classification”. In:Proceedings of the International AAAI Conference on Web and Social Media17.1 (June 2023), pp. 866–877.DOI:10.1609/icwsm.v17i1.22195

  13. [13]

    PhD theses in Spain: A gender study covering the years 1990-2004

    Anna Villarroya et al. “PhD theses in Spain: A gender study covering the years 1990-2004”. In:Scientometrics (2008).DOI:10.1007/S11192-007-1965-8. 11 A Linked Dataset of PhDs, Academic Networks, and InstitutionsA PREPRINT Appendix A Percentages of missing data for selected features 0% 19% 12.6% 37.3% 41.7% 63% 11.1% 0.4% 0.5% 2.7% 35.5% 14% 23.7% 0% 0% 0%...