BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

Chih-Wei Chang; Harini Veeraraghavan; Mingzhe Hu; Mojtaba Safari; Shansong Wang; Xiaofeng Yang; Yizhou Wu; Yuheng Li

arxiv: 2604.27277 · v3 · pith:RK65BUHBnew · submitted 2026-04-30 · 💻 cs.LG · cs.AI· cs.CV

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

Yizhou Wu , Shansong Wang , Yuheng Li , Mojtaba Safari , Mingzhe Hu , Chih-Wei Chang , Harini Veeraraghavan , Xiaofeng Yang This is my paper

Pith reviewed 2026-05-07 08:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords brain MRIself-supervised learningfoundation modelrepresentation learningtransfer learningneuroimagingclinical tasksgeneralization

0 comments

The pith

A self-supervised model trained on millions of brain MRI slices yields a unified representation that supports diverse clinical tasks with a frozen encoder and lightweight heads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a self-distilled model trained on roughly 6.6 million unlabeled axial brain MRI slices drawn from 20 datasets spanning varied populations, diseases, and scanners can produce one representation usable for many different endpoints. Tasks include tumor segmentation, classification of neurodegenerative and neurodevelopmental conditions, brain age estimation, post-stroke temporal prediction, molecular status prediction, sequence classification, and survival modeling. A reader would care because conventional methods usually demand task-specific models and large labeled datasets, which are costly and often unavailable, while this approach works with minimal additional training and shows particular strength when labels are scarce. The learned features turn out to be organized by anatomy and sensitive to pathology even though no task labels guided the initial training.

Core claim

BrainDINO is trained via self-distillation on approximately 6.6 million unlabeled axial slices from 20 heterogeneous datasets. With its encoder frozen and only lightweight task heads attached, the model matches or exceeds natural-image and MRI-specific self-supervised baselines on tumor segmentation, condition classification, age estimation, post-stroke prediction, molecular status prediction, sequence classification, and survival modeling, with the largest gains under low-label regimes. Representation analysis shows the features are anatomically organized and pathology-sensitive despite the complete absence of task supervision during pretraining. These results establish that large-scale, s

What carries the argument

BrainDINO, the self-distilled foundation model that learns generalizable features from unlabeled brain MRI slices through slice-wise self-supervision, allowing transfer by freezing the encoder and training only small task heads.

If this is right

The same pretrained encoder can be reused for new tasks by training only small heads, reducing the need for large labeled datasets per application.
Advantages are greatest when labeled examples are limited, enabling practical use in data-scarce clinical environments.
Anatomically organized and pathology-sensitive features emerge without task labels, potentially aiding clinical interpretation.
Slice-wise processing suffices, removing the requirement for volumetric pretraining or full-network fine-tuning on new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same slice-wise self-supervised strategy could be tested on other imaging modalities or body regions to create comparable foundation models.
Hospitals might adapt the model to new clinical questions more rapidly because only lightweight heads need retraining.
The unified representation might be combined with non-imaging data such as patient records to improve outcome prediction in future studies.

Load-bearing premise

The 20 selected datasets and their clinical endpoints are representative enough of all brain MRI populations, diseases, and acquisition settings to support broad claims of generalizability.

What would settle it

Performance on a new brain MRI dataset from an unseen scanner vendor or patient population falls substantially below that of task-specific supervised models trained on the same target data.

read the original abstract

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BrainDINO scales slice-wise self-supervised pretraining to 6.6M brain MRI slices across 20 datasets and shows frozen-encoder transfer on multiple clinical tasks, but the abstract supplies no numbers or validation details to back the generalization claim.

read the letter

The main thing to know is that this paper trains a DINO-style self-distilled model on roughly 6.6 million axial brain MRI slices drawn from 20 datasets and then freezes the encoder while adding small task heads for transfer. The tasks span tumor segmentation, disease classification, brain age prediction, post-stroke forecasting, molecular status, sequence classification, and survival modeling, with the claim that it holds up especially well when labels are scarce and without needing volumetric pretraining or full fine-tuning.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces BrainDINO, a DINO-style self-supervised foundation model pretrained on ~6.6 million unlabeled axial brain MRI slices drawn from 20 datasets. It claims that a frozen encoder plus lightweight task heads yields a unified representation that transfers to diverse clinical endpoints (tumor segmentation, neurodegenerative/neurodevelopmental classification, brain-age regression, post-stroke temporal prediction, molecular-status prediction, sequence classification, and survival modeling), equaling or exceeding natural-image and MRI-specific SSL baselines especially under label scarcity, while producing anatomically organized and pathology-sensitive features without task-specific supervision or volumetric pretraining.

Significance. If the empirical claims hold, the work would be significant for medical-image foundation-model research by showing that large-scale 2D slice-wise self-supervision on heterogeneous data can produce a versatile brain-MRI representation without 3D volumetric pretraining or full-network fine-tuning. The breadth of downstream tasks and the focus on low-label regimes are clear strengths; the representation analyses further support the utility of the learned features.

major comments (3)

[Pretraining data description] Pretraining-data section: the assertion of 'broad variation in population, disease, and acquisition setting' across the 20 datasets is not accompanied by quantitative coverage metrics (scanner vendors, field-strength distributions, slice-thickness histograms, orientation statistics, or demographic strata). This directly underpins the central generalizability claim and must be addressed with explicit tables or figures.
[Downstream tasks and evaluation] Downstream evaluation: no explicit held-out OOD acquisition protocols (different vendors, non-axial orientations, or unseen field strengths) are tested; downstream tasks appear drawn from distributions overlapping the pretraining pool. This weakens the claim that the representation supports 'arbitrary' heterogeneous brain-MRI endpoints.
[Results and baselines] Results presentation: the abstract and main-text claims of 'consistent outperformance' and 'particularly strong advantages under label scarcity' are not supported by reported quantitative metrics, statistical tests, error bars, or baseline-implementation details sufficient for verification. This is load-bearing for the empirical contribution.

minor comments (2)

Figure captions should explicitly state the metrics, number of runs, and statistical tests shown in each panel.
[Transfer learning protocol] The exact architecture and hyper-parameters of the 'lightweight task heads' used for transfer should be detailed in a table or appendix for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment point by point below, providing the strongest honest responses possible based on the current work. Revisions have been made where the comments identify clear gaps in presentation or evidence.

read point-by-point responses

Referee: Pretraining-data section: the assertion of 'broad variation in population, disease, and acquisition setting' across the 20 datasets is not accompanied by quantitative coverage metrics (scanner vendors, field-strength distributions, slice-thickness histograms, orientation statistics, or demographic strata). This directly underpins the central generalizability claim and must be addressed with explicit tables or figures.

Authors: We agree that quantitative metrics are necessary to substantiate the diversity claim. In the revised manuscript we have added Table 1, which reports scanner vendor distributions, field strength percentages, slice thickness histograms, orientation statistics, and available demographic strata (age, sex) aggregated across all 20 pretraining datasets. We have also included a supplementary figure with per-dataset breakdowns. These additions directly support the heterogeneity assertion without altering any experimental results. revision: yes
Referee: Downstream evaluation: no explicit held-out OOD acquisition protocols (different vendors, non-axial orientations, or unseen field strengths) are tested; downstream tasks appear drawn from distributions overlapping the pretraining pool. This weakens the claim that the representation supports 'arbitrary' heterogeneous brain-MRI endpoints.

Authors: We acknowledge that the downstream tasks largely use axial acquisitions that overlap with the pretraining distribution in scanner and orientation characteristics. The 20 pretraining datasets already span multiple vendors, field strengths, and patient populations, and several downstream tasks introduce unseen pathologies and demographics. In revision we have added an explicit limitations paragraph discussing the scope of current OOD testing and have included a small-scale supplementary experiment on non-axial slices from one external dataset. Full arbitrary-heterogeneity validation would require additional held-out data collection beyond the scope of this study. revision: partial
Referee: Results presentation: the abstract and main-text claims of 'consistent outperformance' and 'particularly strong advantages under label scarcity' are not supported by reported quantitative metrics, statistical tests, error bars, or baseline-implementation details sufficient for verification. This is load-bearing for the empirical contribution.

Authors: We have revised the results section to include comprehensive tables reporting mean performance, standard deviations across five random seeds, and paired statistical tests (Wilcoxon signed-rank with Bonferroni correction) for all tasks and label regimes. Error bars have been added to every figure. Baseline implementation details (architectures, hyperparameters, training schedules) are now fully specified in the methods and supplementary material, enabling direct reproduction. These changes make the quantitative support for our claims explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical self-supervised pretraining and transfer evaluation

full rationale

The paper describes standard self-supervised pretraining of a DINO-style model on ~6.6M unlabeled axial brain MRI slices drawn from 20 datasets, followed by frozen-encoder transfer to a suite of downstream supervised tasks (segmentation, classification, survival, etc.) using lightweight heads. No equations, parameter-fitting steps, or self-referential definitions appear in the abstract or described methodology that would make reported performance metrics equivalent to the pretraining inputs by construction. Downstream results are measured on labeled evaluation sets that are distinct from the unlabeled pretraining corpus; no fitted hyperparameters from downstream tasks are fed back into the pretraining objective, and no uniqueness theorems or ansatzes are invoked via self-citation to force the architecture or loss. The central claim of cross-task generalization is therefore an empirical observation rather than a tautology, placing the work in the normal non-circular category for large-scale representation-learning papers.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only review prevents extraction of exact hyperparameters or architectural choices; the central claim rests on the domain assumption that large-scale unlabeled slice-wise self-supervision captures clinically relevant features transferable to supervised tasks.

free parameters (1)

unspecified pretraining hyperparameters
Standard self-supervised training involves many choices (learning rate, augmentation strength, distillation temperature) that are not reported in the abstract.

axioms (2)

domain assumption Self-supervised learning on unlabeled brain MRI slices produces features that are useful for downstream supervised clinical tasks without task-specific pretraining.
This is the load-bearing premise that allows the frozen encoder plus lightweight heads to succeed across the listed endpoints.
domain assumption The 20 datasets together represent sufficient variation in population, disease, and acquisition to support claims of generalizability.
Invoked when the abstract states the model 'supported transfer across' the listed tasks from heterogeneous sources.

pith-pipeline@v0.9.0 · 5536 in / 1565 out tokens · 68207 ms · 2026-05-07T08:42:33.520839+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation
cs.AI 2026-06 unverdicted novelty 5.0

A volumetric MAE tokenizer decouples clinical embedding from reconstruction to support both 23-task linear probing and conditional 3D brain MRI generation via DiT.
A Benchmark of (MRI-) Foundation Models to Predict IDH Mutational Status in Glioma
eess.IV 2026-06 accept novelty 4.0

Radiomics TabPFN matches or outperforms image foundation models for IDH prediction in glioma MRI, with results sensitive to cohort shifts and representation type.