Recognition: unknown
NIH-MPINet: A Large-Scale Feature-Rich Network Dataset for Mapping the Frontiers of Team Science
Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3
The pith
NIH-MPINet provides a network dataset of 30,127 researchers linked by 86,743 multi-PI grants to map biomedical team science.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents NIH-MPINet as a large-scale feature-rich network dataset derived from NIH RePORTER and PubMed that characterizes 30,127 principal investigators as nodes connected by 86,743 grants as edges across 888 organizations and 40 institutes. Analysis identifies 19 communities with thematic specializations including cardiovascular health, cancer immunotherapy, neuroscience, and microbiome research, while genetics appears across multiple communities. Temporal examination of topics shows increasing prominence of healthcare and outcomes research, cognitive health, and Alzheimer's disease in recent years, with a relative decline in molecular and cellular biology.
What carries the argument
The PI collaboration network, with PIs as nodes, shared NIH grants as edges, and attached metadata on affiliations, organizations, grant details, and derived research topics.
If this is right
- The 19 communities allow examination of how different thematic groups organize their collaborations.
- Temporal topic data can be used to model how research priorities evolve over nearly two decades.
- Node and edge metadata support training of statistical learning methods on real biomedical collaboration patterns.
- Coverage across 40 NIH institutes enables comparison of collaboration structures by funding source.
Where Pith is reading between the lines
- Connecting the network to publication citation counts could test whether community structure predicts research impact.
- The temporal shifts might be compared against external events to identify drivers of topic change.
- Similar networks built from other agencies' records could reveal whether NIH patterns generalize to other funding systems.
Load-bearing premise
The data pulled from NIH RePORTER and PubMed accurately and completely records all multi-PI collaborations, affiliations, and grant details without significant linking errors or missing records.
What would settle it
Discovery of a large set of verified multi-PI R01 grants from 2006-2023 that are absent from the network or have incorrect PI or affiliation links.
Figures
read the original abstract
This study presents a large-scale network dataset, NIH-MPINet, curated from NIH RePORTER and PubMed, characterizing collaboration among multiple Principal Investigators (multi-PIs) on NIH R01-equivalent grants from 2006 to 2023. The network characterizes 30,127 PIs as nodes and their collaborations on 86,743 NIH R01-equivalent grants as edges, spanning 888 recipient organizations and supported by 40 NIH Institutes and Centers. We also curated comprehensive metadata, including node-level features such as PI affiliation, alongside edge-level features comprising grant years, titles, and abstracts. Using these data, we constructed a PI collaboration network and identified 19 communities as well as 20 major research topics. Several collaboration communities showed distinct thematic profiles, such as cardiovascular health, cancer immunotherapy, neuroscience, and microbiome research, while genetics and genomics were broadly represented across communities. By incorporating temporal analysis, we observed shifts in research topics and collaboration patterns over time. Topics like healthcare and outcomes research, cognitive health, and Alzheimer's disease have become more prominent in recent years, whereas molecular and cellular biology has seen a relative decline. Overall, this work provides a high-fidelity, feature-rich resource for advancing statistical learning methods and network analysis-based discoveries in the study of long-term biomedical collaboration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NIH-MPINet, a large-scale collaboration network dataset derived from NIH RePORTER and PubMed covering 30,127 PIs and 86,743 R01-equivalent grants (2006–2023) across 888 organizations and 40 NIH institutes. Nodes carry affiliation features; edges carry grant-year, title, and abstract metadata. The authors apply community detection to recover 19 communities, perform topic modeling to identify 20 major research areas, and report thematic specialization (e.g., cardiovascular, cancer immunotherapy, neuroscience) together with temporal shifts (rising prominence of Alzheimer’s/cognitive-health topics and decline in molecular biology).
Significance. If the underlying extraction and linking pipeline proves reliable, the dataset would constitute a valuable, feature-rich public resource for team-science research. Its scale, temporal depth, and inclusion of both structural and textual metadata would support reproducible network analyses, statistical learning on collaboration dynamics, and longitudinal studies of biomedical research frontiers.
major comments (3)
- [Data-construction / Methods] Data-construction / Methods section: The headline claim that NIH-MPINet is a “high-fidelity” resource rests on the unvalidated assumption that RePORTER-to-PubMed linking, multi-PI extraction, and PI/organization disambiguation introduce negligible errors. No precision/recall figures, duplicate-resolution statistics, or gold-standard validation subset are reported, rendering downstream community and topic findings vulnerable to curation artifacts.
- [Results on communities and topics] Community-detection and topic-modeling results: The reported 19 communities and their distinct thematic profiles (cardiovascular health, cancer immunotherapy, etc.) are presented as substantive findings, yet no robustness checks (e.g., sensitivity to edge-weight thresholds, alternative community-detection algorithms, or subsampling) are supplied to demonstrate that these profiles survive plausible linking or disambiguation noise.
- [Temporal analysis] Temporal-shift analysis: Claims of increasing prominence for healthcare/outcomes research, cognitive health, and Alzheimer’s disease (and relative decline in molecular biology) are derived from the same unvalidated network; without year-by-year validation of grant coverage or topic-assignment stability, the observed shifts cannot be confidently distinguished from changes in NIH reporting practices or indexing coverage.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction repeatedly use the term “high-fidelity” without a preceding quantitative definition or later empirical support; this phrasing should be replaced by a neutral description of the curation steps.
- [Methods] No table or supplementary file enumerates the exact NIH grant mechanisms included under “R01-equivalent,” the precise PubMed query used for abstract retrieval, or the topic-model hyperparameters; these details are needed for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the presentation of NIH-MPINet. We address each major comment below and have revised the manuscript to incorporate the suggested validations, robustness checks, and qualifications.
read point-by-point responses
-
Referee: [Data-construction / Methods] Data-construction / Methods section: The headline claim that NIH-MPINet is a “high-fidelity” resource rests on the unvalidated assumption that RePORTER-to-PubMed linking, multi-PI extraction, and PI/organization disambiguation introduce negligible errors. No precision/recall figures, duplicate-resolution statistics, or gold-standard validation subset are reported, rendering downstream community and topic findings vulnerable to curation artifacts.
Authors: We agree that quantitative validation metrics are necessary to support the dataset's reliability claims. In the revised manuscript, we will expand the Methods section with a dedicated validation subsection. This will include precision and recall estimates derived from a manually annotated gold-standard sample of 1,000 randomly selected grants, along with statistics on duplicate resolution and disambiguation accuracy. These additions will enable readers to assess potential curation artifacts directly. revision: yes
-
Referee: [Results on communities and topics] Community-detection and topic-modeling results: The reported 19 communities and their distinct thematic profiles (cardiovascular health, cancer immunotherapy, etc.) are presented as substantive findings, yet no robustness checks (e.g., sensitivity to edge-weight thresholds, alternative community-detection algorithms, or subsampling) are supplied to demonstrate that these profiles survive plausible linking or disambiguation noise.
Authors: We concur that robustness checks are required to substantiate the community and topic findings. The revised manuscript will add a new subsection presenting sensitivity analyses, including comparisons with the Leiden algorithm, variations in edge-weight thresholds, and network subsampling experiments. We will report the stability of the 19 communities and their thematic profiles under these conditions, clarifying that the analyses serve as illustrative applications of the dataset rather than exhaustive claims. revision: yes
-
Referee: [Temporal analysis] Temporal-shift analysis: Claims of increasing prominence for healthcare/outcomes research, cognitive health, and Alzheimer’s disease (and relative decline in molecular biology) are derived from the same unvalidated network; without year-by-year validation of grant coverage or topic-assignment stability, the observed shifts cannot be confidently distinguished from changes in NIH reporting practices or indexing coverage.
Authors: We recognize the risk that observed temporal shifts could partly reflect changes in reporting or indexing practices. The revised version will include an expanded limitations paragraph discussing this possibility. We will also add year-by-year comparisons of topic distributions against publicly available NIH funding trend reports to provide external corroboration, and we will qualify the shift claims accordingly while retaining the descriptive value of the observed patterns in the data. revision: yes
Circularity Check
No circularity: dataset curation and exploratory analysis only
full rationale
The manuscript describes curation of NIH-MPINet from RePORTER and PubMed, followed by standard network construction, community detection (19 communities), and topic modeling (20 topics) with temporal observations. No equations, predictions, fitted parameters, or first-principles derivations are present that could reduce to inputs by construction. Community and topic results are outputs of off-the-shelf algorithms applied to the curated graph; they are not defined in terms of themselves or smuggled via self-citation. The data-fidelity assumption is an empirical claim about extraction quality, not a definitional or self-referential step. The work is therefore self-contained as a descriptive resource paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NIH RePORTER and PubMed records accurately identify multi-PI grants, PI affiliations, and grant metadata without significant omissions or errors
Reference graph
Works this paper leans on
-
[1]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794,
work page internal anchor Pith review arXiv
-
[2]
Network analysis of nih grant critiques
Dastagiri Reddy Malikireddy, Madeline Jens, Amarette Filut, Anupama Bhattacharya, Eliza- beth L Pier, You Geon Lee, Molly Carnes, and Anna Kaatz. Network analysis of nih grant critiques. InProceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pages 240–243,
2017
-
[3]
Edmund M Talley, David Newman, David Mimno, Bruce W Herr, Hanna M Wallach, Gully APC Burns, AG Miriam Leenders, and Andrew McCallum
Accessed: 2026-03-22. Edmund M Talley, David Newman, David Mimno, Bruce W Herr, Hanna M Wallach, Gully APC Burns, AG Miriam Leenders, and Andrew McCallum. Database of nih grants using machine- learned categories and graphical clustering.Nature Methods, 8(6):443–444,
2026
-
[4]
Specifically, 3,383 components in- clude two investigators, and 1,115 include three investigators
The network is highly fragmented, with most components consisting of only a small number of investigators (median size = 2). Specifically, 3,383 components in- clude two investigators, and 1,115 include three investigators. In contrast, one large connected component includes 13,873 investigators and represents the main collaboration structure of the netwo...
2008
-
[5]
There were very few multi-PI projects available in 2006–2007 (3 and 23 projects, respectively)
Each line represents the proportion of projects assigned to a given topic within each year, allowing comparison of relative changes over time. There were very few multi-PI projects available in 2006–2007 (3 and 23 projects, respectively). Because BERTopic requires a sufficient number of documents to identify stable clusters, most documents from these earl...
2006
-
[6]
Figure S3: Temporal evolution of BERTopic topic prevalence for Topics 0-9 from 2008–2023
appear to decline slightly after earlier peaks. Figure S3: Temporal evolution of BERTopic topic prevalence for Topics 0-9 from 2008–2023. Lines represent normalized topic frequencies within each year, illustrating changes in the relative prominence of major biomedical research themes over time. 20 Figure S4: Temporal evolution of BERTopic topic prevalence...
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.