pith. sign in

arxiv: 2606.23172 · v1 · pith:N2PILTKKnew · submitted 2026-06-22 · 📡 eess.IV

A Benchmark of (MRI-) Foundation Models to Predict IDH Mutational Status in Glioma

Pith reviewed 2026-06-26 06:28 UTC · model grok-4.3

classification 📡 eess.IV
keywords gliomaIDH mutationMRIfoundation modelsradiomicsTabPFNmolecular predictionmodel benchmarking
0
0 comments X

The pith

Tabular foundation models on radiomic features match or exceed image foundation models for IDH prediction from glioma MRI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks four image-based foundation models against radiomics-based tabular models for non-invasive prediction of IDH mutational status in glioma using FLAIR and post-contrast T1 MRI. Within individual cohorts, the tabular model TabPFN on radiomic features achieved the highest AUROC of 0.92 and best calibration. Among image encoders, BiomedCLIP performed best at 0.85 AUROC while MRI-specific models lagged. Cross-cohort and external post-treatment evaluation revealed performance drops with varying sensitivity to distribution shifts, where image models sometimes complemented the tabular baseline. The results highlight that representation type and clinical context together determine which approach works best for reliable molecular status prediction from routine scans.

Core claim

Representation modality and evaluation context critically influence foundation-model performance in MRI-based molecular prediction. Tabular foundation models on radiomic features provide a strong, well-calibrated baseline, while image foundation models may offer complementary value under clinically distinct distribution shifts.

What carries the argument

Benchmark of image foundation models (BrainIAC, MRI-CORE, BiomedCLIP, BrainDINO) versus radiomics TabPFN and logistic regression for IDH mutation prediction across four public glioma cohorts plus one external post-treatment cohort, measuring AUROC, AUPRC, and calibration error.

If this is right

  • TabPFN on radiomics delivers 0.92 mean AUROC and 0.07 ECE within cohorts, establishing it as the strongest baseline.
  • BiomedCLIP reaches the highest external-cohort AUROC of 0.74, suggesting image encoders can retain utility when prevalence or treatment status changes.
  • AUPRC degrades more than AUROC under cross-cohort prevalence shifts, indicating prevalence-aware evaluation is required.
  • MRI-specific pre-trained encoders consistently underperform general vision-language models like BiomedCLIP on this task.
  • Calibration remains superior for the tabular model even when AUROC is comparable, affecting downstream clinical probability use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid models that fuse radiomic tabular features with image embeddings could capture both the calibration strength and shift robustness observed here.
  • The observed underperformance of MRI-specific encoders may indicate that current pre-training objectives or data scales are insufficient for molecular-status tasks and warrant targeted re-examination.
  • Because calibration differs markedly by modality, clinical deployment pipelines may need modality-specific uncertainty thresholds rather than a single model selection rule.
  • Extending the benchmark to include longitudinal or multi-sequence inputs could test whether the current modality ranking persists when more imaging context is available.

Load-bearing premise

The public glioma cohorts and external post-treatment cohort are representative enough of clinical distributions to support conclusions about generalization and the relative value of different model types.

What would settle it

On a new large multi-center prospective clinical dataset, if image foundation models show no AUROC advantage or complementary value over TabPFN under any measured distribution shift, the claim that they can offer value in distinct clinical contexts would be refuted.

Figures

Figures reproduced from arXiv: 2606.23172 by Efthymios Georgiou, Ekin Ermis, Elise Robinson, Nathan Hollet, Sarah Br\"uningk, Uri Nahum.

Figure 1
Figure 1. Figure 1: Schematic overview. Each scan (FLAIR and post-contrast T1) feeds three paral￾lel feature extraction pipelines. (A) 2D image foundation models (MRI-CORE, Biomed￾CLIP, BrainDINO)Three axial slices at the 25th/50th/75th percentiles of the tumor mask pass independently through a frozen encoder, with CLS tokens concatenated. (B) 3D image foundation model (BrainIAC): the full volume passes through a frozen encod… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-cohort AUROC (mean ± std). Rows are training cohorts, columns eval￾uation cohorts; the boxed diagonal cells are the within-cohort numbers. We report performance across the six approaches using K = 5 paired runs per cohort, as mean ± std throughout. We report AUROC throughout; AUPRC was evaluated identically and follows the same model ordering except where noted. Within-cohort performance. The boxed d… view at source ↗
read the original abstract

Non-invasive prediction of glioma molecular status from routine magnetic resonance imaging (MRI) has shown promising performance, but model generalization remains challenging given small-scale matched imaging-genomic datasets. Foundation models may address this bottleneck, but a comprehensive benchmark is needed to establish the impact of diverse architectures, pre-training domains, and objectives. Given the use case of isocitrate dehydrogenase (IDH) mutation prediction from FLAIR and post-contrast T1 MRIs, we compared four image-based foundation models, BrainIAC, MRI-CORE, BiomedCLIP, and BrainDINO, against radiomics-based TabPFN and logistic regression baselines. Prediction performance and calibration were assessed across four public adult glioma cohorts and an external post-treatment cohort. Within-cohort, TabPFN matched or outperformed all visual encoders, achieving 0.92 (0.03) AUROC and 0.74 (0.17) AUPRC (mean (SD) across all datasets). Among visual encoders, BiomedCLIP performed best (0.85 (0.08) AUROC), with BrainDINO competitive (0.82 (0.09) AUROC), while MRI-specific encoders (BrainIAC, MRI-CORE) consistently underperformed. Cross-cohort transfer showed moderate AUROC degradation but stronger AUPRC sensitivity to prevalence shifts. On the external cohort, BiomedCLIP achieved the highest AUROC (0.74 (0.07)), whereas TabPFN provided superior calibration (Expected Calibration Error 0.07 (0.01)). These results indicate that representation modality and evaluation context critically influence foundation-model performance in MRI-based molecular prediction. Tabular foundation models on radiomic features provide a strong, well-calibrated baseline, while image foundation models may offer complementary value under clinically distinct distribution shifts. Code available at https://github.com/nathanhollet/idh-status-prediction

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper benchmarks four image foundation models (BrainIAC, MRI-CORE, BiomedCLIP, BrainDINO) against radiomics-based TabPFN and logistic regression baselines for predicting IDH mutational status from FLAIR and post-contrast T1 MRI. It reports within-cohort performance (TabPFN AUROC 0.92, BiomedCLIP 0.85), cross-cohort transfer degradation, and external post-treatment cohort results (BiomedCLIP AUROC 0.74, TabPFN better calibration), concluding that tabular models provide a strong baseline while image models may complement under distribution shifts. Code is released.

Significance. If the tabulated results and calibration metrics hold after verification of methods, the benchmark supplies concrete empirical comparisons of modality-specific foundation models on a clinically relevant molecular prediction task. The public code release and focus on both AUROC/AUPRC and calibration are strengths that allow direct reuse and extension by the community.

major comments (1)
  1. [Abstract] Abstract: the claim that image foundation models 'may offer complementary value under clinically distinct distribution shifts' is not supported by any quantitative characterization of how the external post-treatment cohort differs from the four public cohorts or from routine clinical distributions (scanner vendor, field strength, slice thickness, treatment timing, or demographics). Without this, the reversal in ranking (BiomedCLIP AUROC advantage vs. TabPFN calibration) cannot be interpreted as evidence of complementary value under shifts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to strengthen the manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that image foundation models 'may offer complementary value under clinically distinct distribution shifts' is not supported by any quantitative characterization of how the external post-treatment cohort differs from the four public cohorts or from routine clinical distributions (scanner vendor, field strength, slice thickness, treatment timing, or demographics). Without this, the reversal in ranking (BiomedCLIP AUROC advantage vs. TabPFN calibration) cannot be interpreted as evidence of complementary value under shifts.

    Authors: We agree that the abstract claim would be more robust with explicit cohort characterization. The external cohort is explicitly described as post-treatment (distinct in treatment timing from the primarily pre-treatment public cohorts), and the observed reversal (BiomedCLIP AUROC 0.74 vs. TabPFN superior calibration) is presented as suggestive rather than definitive evidence. However, we did not include a consolidated table of scanner, field strength, slice thickness, or demographic metadata across cohorts. In revision we will (1) add a table or paragraph in Methods summarizing all available cohort metadata and (2) revise the abstract sentence to read: 'On the external post-treatment cohort, image encoders showed an AUROC advantage while the tabular baseline remained better calibrated, indicating that representation modality and evaluation context influence performance under distribution shift.' This directly incorporates the referee's point without overstating the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential predictions.

full rationale

This paper is a data-driven benchmark study that trains and evaluates multiple models (image foundation models, TabPFN, logistic regression) on public glioma cohorts and one external set, reporting AUROC, AUPRC, and calibration metrics. No equations, derivations, or fitted parameters are presented that reduce any reported performance number to an input by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on direct empirical comparisons rather than any closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark relying on pre-existing foundation models, public datasets, and standard performance metrics without introducing new free parameters, axioms beyond basic statistical assumptions, or invented entities.

axioms (1)
  • standard math Standard assumptions underlying AUROC, AUPRC, and expected calibration error calculations hold for the evaluated datasets and models.
    Performance reporting depends on these metric definitions being applicable without violation.

pith-pipeline@v0.9.1-grok · 5900 in / 1177 out tokens · 28656 ms · 2026-06-26T06:28:51.636767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 25 canonical work pages · 3 internal anchors

  1. [1]

    PLOS ONE16(8), e0256152 (Aug 2021)

    An, C., Park, Y.W., Ahn, S.S., Han, K., Kim, H., Lee, S.K.: Radiomics ma- chine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLOS ONE16(8), e0256152 (Aug 2021). https://doi.org/10.1371/journal.pone.0256152

  2. [2]

    Scientific Data9, 453 (2022)

    Bakas, S., et al.: The University of Pennsylvania glioblastoma (UPenn-GBM) co- hort: advanced MRI, clinical, genomics, & radiomics. Scientific Data9(1), 453 (Jul 2022). https://doi.org/10.1038/s41597-022-01560-7

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., et al.: On the Opportunities and Risks of Foundation Models (Jul 2022). https://doi.org/10.48550/arXiv.2108.07258

  4. [4]

    Neuro-oncology Advances4(1), vdac060 (Apr 2022)

    Calabrese, E., Rudie, J.D., Rauschecker, A.M., Villanueva-Meyer, J.E., Clarke, J.L., Solomon, D.A., Cha, S.: Combining radiomics and deep convolutional neural network features from preoperative MRI for predicting clinically relevant genetic biomarkers in glioblastoma. Neuro-oncology Advances4(1), vdac060 (Apr 2022). https://doi.org/10.1093/noajnl/vdac060

  5. [5]

    Ronald C

    Calabrese, E., et al.: The University of California San Francisco Preoperative Dif- fuse Glioma MRI (UCSF-PDGM) Dataset. Radiology: Artificial Intelligence4(6), e220058 (Nov 2022). https://doi.org/10.1148/ryai.220058

  6. [6]

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple frame- work for contrastive learning of visual representations (2020), https://arxiv.org/abs/2002.05709

  7. [7]

    Neuro-Oncology23(2), 304–313 (Jul 2020)

    Choi, Y.S., et al.: Fully automated hybrid approach to predict the IDH mutation status of gliomas via deep learning and radiomics. Neuro-Oncology23(2), 304–313 (Jul 2020). https://doi.org/10.1093/neuonc/noaa177 10 N. Hollet et al

  8. [8]

    Mri- core: a foundation model for magnetic resonance imaging

    Dong, H., Chen, Y., Gu, H., Konz, N., Chen, Y., Li, Q., Mazurowski, M.A.: MRI-CORE: A Foundation Model for Magnetic Resonance Imaging (Jul 2025). https://doi.org/10.48550/arXiv.2506.12186

  9. [9]

    Eu- ropean Radiology36(2), 1562–1591 (Feb 2026)

    Farahani, S., Hejazi, M., Tabassum, M., Di Ieva, A., Mahdavifar, N., Liu, S.: Di- agnostic performance of deep learning for predicting glioma isocitrate dehydroge- nase and 1p/19q co-deletion in MRI: a systematic review and meta-analysis. Eu- ropean Radiology36(2), 1562–1591 (Feb 2026). https://doi.org/10.1007/s00330- 025-11898-2

  10. [10]

    https://doi.org/10.7937/FWV2-DT74, version 3

    Gagnon, L., et al.: The University of California San Diego annotated post- treatment high-grade glioma multimodal MRI dataset (UCSD-PTGBM) (2025). https://doi.org/10.7937/FWV2-DT74, version 3

  11. [11]

    MedComm 5(9), e722 (2024)

    He, W., Huang, W., Zhang, L., Wu, X., Zhang, S., Zhang, B.: Radiogenomics: bridging the gap between imaging and genomics for precision oncology. MedComm 5(9), e722 (2024). https://doi.org/10.1002/mco2.722

  12. [12]

    Clinical and Translational Radi- ation Oncology18, 74–79 (Apr 2019)

    van der Heide, U.A., Frantzen-Steneker, M., Astreinidou, E., Nowee, M.E., van Houdt, P.J.: MRI basics for radiation oncologists. Clinical and Translational Radi- ation Oncology18, 74–79 (Apr 2019). https://doi.org/10.1016/j.ctro.2019.04.008

  13. [13]

    In: International Confer- ence on Learning Representations 2023 (2023)

    Hollmann, N., Müller, S., Eggensperger, K., Hutter, F.: Tabpfn: A transformer that solves small tabular classification problems in a second. In: International Confer- ence on Learning Representations 2023 (2023)

  14. [14]

    Automated brain extracƟon of mulƟsequence MRI using arƟficial neural networks

    Isensee, F., et al.: Automated brain extraction of multisequence MRI using ar- tificial neural networks. Human Brain Mapping40(17), 4952–4964 (Aug 2019). https://doi.org/10.1002/hbm.24750

  15. [15]

    Kumar, A., Raghunathan, A., Jones, R., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution (2022), https://arxiv.org/abs/2202.10054

  16. [16]

    Neuro-Oncology23(8), 1231–1251 (Jun 2021)

    Louis, D.N., et al.: The 2021 WHO Classification of Tumors of the Central Nervous System: a summary. Neuro-Oncology23(8), 1231–1251 (Jun 2021). https://doi.org/10.1093/neuonc/noab106

  17. [17]

    Magnetic Resonance Imaging104, 72–79 (Dec 2023)

    Lu, J., Xu, W., Chen, X., Wang, T., Li, H.: Noninvasive prediction of IDH muta- tion status in gliomas using preoperative multiparametric MRI radiomics nomo- gram: A mutlicenter study. Magnetic Resonance Imaging104, 72–79 (Dec 2023). https://doi.org/10.1016/j.mri.2023.09.001

  18. [18]

    npj Precision Oncology9(1), 187 (Jun 2025)

    Nakase, T., et al.: Integration of MRI radiomics and germline genetics to predict the IDH mutation status of gliomas. npj Precision Oncology9(1), 187 (Jun 2025). https://doi.org/10.1038/s41698-025-00980-z

  19. [19]

    Oquab, M., et al.: Dinov2: Learning robust visual features without supervision (2024), https://arxiv.org/abs/2304.07193

  20. [20]

    Paszke, A., et al.: Pytorch: An imperative style, high-performance deep learning library (2019), https://arxiv.org/abs/1912.01703

  21. [21]

    In: 2019 53rd Asilo- mar Conference on Signals, Systems, and Computers

    Qin, J., Lou, Y.: L1-2 Regularized Logistic Regression. In: 2019 53rd Asilo- mar Conference on Signals, Systems, and Computers. pp. 779–783 (Nov 2019). https://doi.org/10.1109/IEEECONF44664.2019.9048830, iSSN: 2576-2303

  22. [22]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., et al.: Learning Transferable Visual Models From Natural Language Supervision (Feb 2021). https://doi.org/10.48550/arXiv.2103.00020

  23. [23]

    https://doi.org/10.7937/DFAE-1B86, version 1

    Reddy, D., et al.: The University of Texas Southwestern Glioma MRI dataset with molecular marker characterization and segmentations (UTSW-Glioma) (2026). https://doi.org/10.7937/DFAE-1B86, version 1

  24. [24]

    Siméoni, O., et al.: Dinov3 (2025), https://arxiv.org/abs/2508.10104 Title Suppressed Due to Excessive Length 11

  25. [25]

    Proceedings of the National Academy of Sciences of the United States of America110(10), 4009–4014 (Mar 2013)

    Sottoriva, A., et al.: Intratumor heterogeneity in human glioblastoma re- flects cancer evolutionary dynamics. Proceedings of the National Academy of Sciences of the United States of America110(10), 4009–4014 (Mar 2013). https://doi.org/10.1073/pnas.1219747110

  26. [26]

    Task representations in neural networks trained to perform many cognitive tasks

    Tak, D., et al.: A generalizable foundation model for analysis of human brain MRI. Nature Neuroscience29(4), 945–956 (Apr 2026). https://doi.org/10.1038/s41593- 026-02202-6

  27. [27]

    Data in Brief37, 107191 (Jun 2021)

    van der Voort, S.R., et al.: The Erasmus Glioma Database (EGD): Structural MRI scans, WHO 2016 subtypes, and segmentations of 774 patients with glioma. Data in Brief37, 107191 (Jun 2021). https://doi.org/10.1016/j.dib.2021.107191

  28. [28]

    Wu, Y., Wang, S., Li, Y., Safari, M., Hu, M., Chang, C.W., Veeraraghavan, H., Yang, X.: Braindino: A brain mri foundation model for generalizable clinical rep- resentation learning (2026), https://arxiv.org/abs/2604.27277

  29. [29]

    The New England journal of medicine360(8), 765–773 (Feb 2009)

    Yan, H., et al.: IDH1 and IDH2 Mutations in Gliomas. The New England journal of medicine360(8), 765–773 (Feb 2009). https://doi.org/10.1056/NEJMoa0808710

  30. [30]

    Health Care Science4(2), 110–143 (Apr 2025)

    Yuan, H., Zhu, M., Yang, R., Liu, H., Li, I., Hong, C.: Rethinking domain- specific pretraining by supervised or self-supervised learning for chest radio- graph classification: A comparative study against imagenet counterparts in cold-start active learning. Health Care Science4(2), 110–143 (Apr 2025). https://doi.org/10.1002/hcs2.70009

  31. [31]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., et al.: BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs (Jan 2025). https://doi.org/10.48550/arXiv.2303.00915

  32. [32]

    Cancer journal (Sudbury, Mass.)22(6), 418–422 (2016)

    Ziv, E., Durack, J.C., Solomon, S.B.: The Importance of Biopsy in the Era of Molecular Medicine. Cancer journal (Sudbury, Mass.)22(6), 418–422 (2016). https://doi.org/10.1097/PPO.0000000000000228