arxiv: 2604.22802 · v1 · submitted 2026-04-13 · 💻 cs.DL · cs.SI· stat.AP

Recognition: unknown

NIH-MPINet: A Large-Scale Feature-Rich Network Dataset for Mapping the Frontiers of Team Science

Cuiran Shi , Shuying Han , Shreya Kusumanchi , Mia Zhou , Didong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.DL cs.SIstat.AP

keywords NIH-MPINetmulti-PI collaborationteam sciencebiomedical networkresearch communitiestemporal analysisNIH grantscollaboration dataset

0 comments

The pith

NIH-MPINet provides a network dataset of 30,127 researchers linked by 86,743 multi-PI grants to map biomedical team science.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper constructs NIH-MPINet by linking NIH RePORTER grant records with PubMed data to represent collaborations among multiple principal investigators on R01-equivalent awards spanning 2006 to 2023. Nodes are the 30,127 PIs with affiliation metadata, while edges capture the 86,743 grants along with their years, titles, and abstracts. The authors extract 19 collaboration communities and 20 research topics from the network, documenting distinct profiles such as cardiovascular health or cancer immunotherapy in some groups and broad genetics coverage overall. Temporal tracking reveals rising emphasis on Alzheimer's disease, cognitive health, and healthcare outcomes research alongside a decline in molecular and cellular biology. A reader would care because the dataset supplies structured, time-stamped collaboration data that can support quantitative models of how teams form and how research priorities shift in biomedicine.

Core claim

The paper presents NIH-MPINet as a large-scale feature-rich network dataset derived from NIH RePORTER and PubMed that characterizes 30,127 principal investigators as nodes connected by 86,743 grants as edges across 888 organizations and 40 institutes. Analysis identifies 19 communities with thematic specializations including cardiovascular health, cancer immunotherapy, neuroscience, and microbiome research, while genetics appears across multiple communities. Temporal examination of topics shows increasing prominence of healthcare and outcomes research, cognitive health, and Alzheimer's disease in recent years, with a relative decline in molecular and cellular biology.

What carries the argument

The PI collaboration network, with PIs as nodes, shared NIH grants as edges, and attached metadata on affiliations, organizations, grant details, and derived research topics.

If this is right

The 19 communities allow examination of how different thematic groups organize their collaborations.
Temporal topic data can be used to model how research priorities evolve over nearly two decades.
Node and edge metadata support training of statistical learning methods on real biomedical collaboration patterns.
Coverage across 40 NIH institutes enables comparison of collaboration structures by funding source.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Connecting the network to publication citation counts could test whether community structure predicts research impact.
The temporal shifts might be compared against external events to identify drivers of topic change.
Similar networks built from other agencies' records could reveal whether NIH patterns generalize to other funding systems.

Load-bearing premise

The data pulled from NIH RePORTER and PubMed accurately and completely records all multi-PI collaborations, affiliations, and grant details without significant linking errors or missing records.

What would settle it

Discovery of a large set of verified multi-PI R01 grants from 2006-2023 that are absent from the network or have incorrect PI or affiliation links.

Figures

Figures reproduced from arXiv: 2604.22802 by Cuiran Shi, Didong Li, Mia Zhou, Shreya Kusumanchi, Shuying Han.

**Figure 2.** Figure 2: Flow diagram illustrating the overall analysis workflow. The left branch describes the construc [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: (a): The full network of NIH-MPINet; (b) the largest connected component, labeled by repre [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Topic distribution across four selected NIH-MPINet clusters. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Temporal dynamics and structural evolution of research topics. (a) Temporal trends in five [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

This study presents a large-scale network dataset, NIH-MPINet, curated from NIH RePORTER and PubMed, characterizing collaboration among multiple Principal Investigators (multi-PIs) on NIH R01-equivalent grants from 2006 to 2023. The network characterizes 30,127 PIs as nodes and their collaborations on 86,743 NIH R01-equivalent grants as edges, spanning 888 recipient organizations and supported by 40 NIH Institutes and Centers. We also curated comprehensive metadata, including node-level features such as PI affiliation, alongside edge-level features comprising grant years, titles, and abstracts. Using these data, we constructed a PI collaboration network and identified 19 communities as well as 20 major research topics. Several collaboration communities showed distinct thematic profiles, such as cardiovascular health, cancer immunotherapy, neuroscience, and microbiome research, while genetics and genomics were broadly represented across communities. By incorporating temporal analysis, we observed shifts in research topics and collaboration patterns over time. Topics like healthcare and outcomes research, cognitive health, and Alzheimer's disease have become more prominent in recent years, whereas molecular and cellular biology has seen a relative decline. Overall, this work provides a high-fidelity, feature-rich resource for advancing statistical learning methods and network analysis-based discoveries in the study of long-term biomedical collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces NIH-MPINet, a large-scale collaboration network dataset derived from NIH RePORTER and PubMed covering 30,127 PIs and 86,743 R01-equivalent grants (2006–2023) across 888 organizations and 40 NIH institutes. Nodes carry affiliation features; edges carry grant-year, title, and abstract metadata. The authors apply community detection to recover 19 communities, perform topic modeling to identify 20 major research areas, and report thematic specialization (e.g., cardiovascular, cancer immunotherapy, neuroscience) together with temporal shifts (rising prominence of Alzheimer’s/cognitive-health topics and decline in molecular biology).

Significance. If the underlying extraction and linking pipeline proves reliable, the dataset would constitute a valuable, feature-rich public resource for team-science research. Its scale, temporal depth, and inclusion of both structural and textual metadata would support reproducible network analyses, statistical learning on collaboration dynamics, and longitudinal studies of biomedical research frontiers.

major comments (3)

[Data-construction / Methods] Data-construction / Methods section: The headline claim that NIH-MPINet is a “high-fidelity” resource rests on the unvalidated assumption that RePORTER-to-PubMed linking, multi-PI extraction, and PI/organization disambiguation introduce negligible errors. No precision/recall figures, duplicate-resolution statistics, or gold-standard validation subset are reported, rendering downstream community and topic findings vulnerable to curation artifacts.
[Results on communities and topics] Community-detection and topic-modeling results: The reported 19 communities and their distinct thematic profiles (cardiovascular health, cancer immunotherapy, etc.) are presented as substantive findings, yet no robustness checks (e.g., sensitivity to edge-weight thresholds, alternative community-detection algorithms, or subsampling) are supplied to demonstrate that these profiles survive plausible linking or disambiguation noise.
[Temporal analysis] Temporal-shift analysis: Claims of increasing prominence for healthcare/outcomes research, cognitive health, and Alzheimer’s disease (and relative decline in molecular biology) are derived from the same unvalidated network; without year-by-year validation of grant coverage or topic-assignment stability, the observed shifts cannot be confidently distinguished from changes in NIH reporting practices or indexing coverage.

minor comments (2)

[Abstract and Introduction] The abstract and introduction repeatedly use the term “high-fidelity” without a preceding quantitative definition or later empirical support; this phrasing should be replaced by a neutral description of the curation steps.
[Methods] No table or supplementary file enumerates the exact NIH grant mechanisms included under “R01-equivalent,” the precise PubMed query used for abstract retrieval, or the topic-model hyperparameters; these details are needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the presentation of NIH-MPINet. We address each major comment below and have revised the manuscript to incorporate the suggested validations, robustness checks, and qualifications.

read point-by-point responses

Referee: [Data-construction / Methods] Data-construction / Methods section: The headline claim that NIH-MPINet is a “high-fidelity” resource rests on the unvalidated assumption that RePORTER-to-PubMed linking, multi-PI extraction, and PI/organization disambiguation introduce negligible errors. No precision/recall figures, duplicate-resolution statistics, or gold-standard validation subset are reported, rendering downstream community and topic findings vulnerable to curation artifacts.

Authors: We agree that quantitative validation metrics are necessary to support the dataset's reliability claims. In the revised manuscript, we will expand the Methods section with a dedicated validation subsection. This will include precision and recall estimates derived from a manually annotated gold-standard sample of 1,000 randomly selected grants, along with statistics on duplicate resolution and disambiguation accuracy. These additions will enable readers to assess potential curation artifacts directly. revision: yes
Referee: [Results on communities and topics] Community-detection and topic-modeling results: The reported 19 communities and their distinct thematic profiles (cardiovascular health, cancer immunotherapy, etc.) are presented as substantive findings, yet no robustness checks (e.g., sensitivity to edge-weight thresholds, alternative community-detection algorithms, or subsampling) are supplied to demonstrate that these profiles survive plausible linking or disambiguation noise.

Authors: We concur that robustness checks are required to substantiate the community and topic findings. The revised manuscript will add a new subsection presenting sensitivity analyses, including comparisons with the Leiden algorithm, variations in edge-weight thresholds, and network subsampling experiments. We will report the stability of the 19 communities and their thematic profiles under these conditions, clarifying that the analyses serve as illustrative applications of the dataset rather than exhaustive claims. revision: yes
Referee: [Temporal analysis] Temporal-shift analysis: Claims of increasing prominence for healthcare/outcomes research, cognitive health, and Alzheimer’s disease (and relative decline in molecular biology) are derived from the same unvalidated network; without year-by-year validation of grant coverage or topic-assignment stability, the observed shifts cannot be confidently distinguished from changes in NIH reporting practices or indexing coverage.

Authors: We recognize the risk that observed temporal shifts could partly reflect changes in reporting or indexing practices. The revised version will include an expanded limitations paragraph discussing this possibility. We will also add year-by-year comparisons of topic distributions against publicly available NIH funding trend reports to provide external corroboration, and we will qualify the shift claims accordingly while retaining the descriptive value of the observed patterns in the data. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation and exploratory analysis only

full rationale

The manuscript describes curation of NIH-MPINet from RePORTER and PubMed, followed by standard network construction, community detection (19 communities), and topic modeling (20 topics) with temporal observations. No equations, predictions, fitted parameters, or first-principles derivations are present that could reduce to inputs by construction. Community and topic results are outputs of off-the-shelf algorithms applied to the curated graph; they are not defined in terms of themselves or smuggled via self-citation. The data-fidelity assumption is an empirical claim about extraction quality, not a definitional or self-referential step. The work is therefore self-contained as a descriptive resource paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that source databases are comprehensive and error-free plus standard application of community detection and topic modeling; no new entities or heavy parameter fitting are introduced.

axioms (1)

domain assumption NIH RePORTER and PubMed records accurately identify multi-PI grants, PI affiliations, and grant metadata without significant omissions or errors
The entire network and all downstream analyses depend on faithful extraction and linking from these two public databases.

pith-pipeline@v0.9.0 · 5554 in / 1243 out tokens · 40252 ms · 2026-05-10T15:41:26.328341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794,

work page internal anchor Pith review arXiv
[2]

Network analysis of nih grant critiques

Dastagiri Reddy Malikireddy, Madeline Jens, Amarette Filut, Anupama Bhattacharya, Eliza- beth L Pier, You Geon Lee, Molly Carnes, and Anna Kaatz. Network analysis of nih grant critiques. InProceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pages 240–243,

2017
[3]

Edmund M Talley, David Newman, David Mimno, Bruce W Herr, Hanna M Wallach, Gully APC Burns, AG Miriam Leenders, and Andrew McCallum

Accessed: 2026-03-22. Edmund M Talley, David Newman, David Mimno, Bruce W Herr, Hanna M Wallach, Gully APC Burns, AG Miriam Leenders, and Andrew McCallum. Database of nih grants using machine- learned categories and graphical clustering.Nature Methods, 8(6):443–444,

2026
[4]

Specifically, 3,383 components in- clude two investigators, and 1,115 include three investigators

The network is highly fragmented, with most components consisting of only a small number of investigators (median size = 2). Specifically, 3,383 components in- clude two investigators, and 1,115 include three investigators. In contrast, one large connected component includes 13,873 investigators and represents the main collaboration structure of the netwo...

2008
[5]

There were very few multi-PI projects available in 2006–2007 (3 and 23 projects, respectively)

Each line represents the proportion of projects assigned to a given topic within each year, allowing comparison of relative changes over time. There were very few multi-PI projects available in 2006–2007 (3 and 23 projects, respectively). Because BERTopic requires a sufficient number of documents to identify stable clusters, most documents from these earl...

2006
[6]

Figure S3: Temporal evolution of BERTopic topic prevalence for Topics 0-9 from 2008–2023

appear to decline slightly after earlier peaks. Figure S3: Temporal evolution of BERTopic topic prevalence for Topics 0-9 from 2008–2023. Lines represent normalized topic frequencies within each year, illustrating changes in the relative prominence of major biomedical research themes over time. 20 Figure S4: Temporal evolution of BERTopic topic prevalence...

2008