Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices
Pith reviewed 2026-07-02 14:01 UTC · model grok-4.3
The pith
Segmentation drives volume and stage tasks while classifier choice drives survival and histology in lung CT phenotyping.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study demonstrates that in controlled two-stage pipelines evaluated on worst-case cross-cohort performance, segmentation choice primarily governs accuracy for tumor volume and stage classification while classification head selection primarily governs accuracy for 2-year survival, histology classification, and age prediction. Curia with tumor segmentation and CatBoost reaches the best average rank across the three main clinical tasks, though per-task selection consistently beats any single default; radiomics performs competitively on volume and stage tasks partly due to label-derivation effects, DINOv3 trails slightly, and patch or slice aggregation adds little value.
What carries the argument
The two-stage pipeline that decouples feature extraction from the classification head, tested across five extractors, seven heads, and three segmentation regimes with worst-case cross-cohort performance as the primary metric.
If this is right
- Tumor segmentation should be prioritized for volume and stage tasks regardless of extractor type.
- Gradient-boosting heads such as CatBoost improve survival and histology results over linear or tree-based alternatives.
- Curia features reach peak scores comparable to radiomics on survival while remaining competitive elsewhere.
- Lung segmentation plus logistic regression supplies a practical fallback when tumor delineations are unavailable.
- Task-specific selection of head and segmentation outperforms any cross-task default pipeline.
Where Pith is reading between the lines
- Medical imaging pipelines may benefit from endpoint-specific tuning rather than a single recommended configuration.
- The benchmark design of isolating extractor, head, and segmentation could be repeated on other cancer types to check whether the same task-dependent pattern appears.
- The limited value of aggregation steps suggests that simpler preprocessing pipelines can be used without harming cross-cohort robustness.
Load-bearing premise
Performance differences between design choices mainly reflect the isolated effects of extractor, head, and segmentation rather than unmeasured differences in imaging protocols or label derivation between the two cohorts.
What would settle it
A controlled re-run on cohorts matched for imaging protocol and label source that finds the relative importance of segmentation versus classifier choice reverses or disappears.
Figures
read the original abstract
Radiomics is the established approach for CT-based lung cancer phenotyping, yet comparisons with foundation models rarely isolate contributions of feature extractor, classification head, and segmentation choice, or test cross-cohort robustness. We benchmark five feature extractors (Curia, Curia-2, DINOv3, Radiomics2D, Radiomics3D), seven classification heads (TabPFN, TabICL, XGBoost, CatBoost, Random Forest, logistic regression, Ridge), and three segmentation regimes on five tasks: tumor volume and stage classification, 2-year survival prediction, histology classification, and age prediction. Models are trained on LUNG1 (n=338) and evaluated on an internal test set (n=84) and the external LUNG2 cohort (n=211), with worst-case cross-cohort performance as the primary metric. The dominant design factor is task-dependent: segmentation drives volume and stage classification, while classifier choice drives survival, histology, and age prediction. Radiomics is competitive for tumor volume, tumor stage and survival (partly due to label-derivation effects for the former); Curia variants reach comparable peak scores for survival; DINOv3 falls slightly short across tasks. Patch and slice aggregation have negligible impact. We recommend Curia with tumor segmentation and a CatBoost head as a safe default, achieving the best mean rank across the three primary clinical tasks, though task-specific selection consistently outperforms any cross-task default. When tumor delineations are unavailable, Curia-2 with lung segmentation and logistic regression offers a competitive alternative. All pipelines use a two-stage design suited to small cohort sizes where end-to-end fine-tuning would risk overfitting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks five feature extractors (Curia, Curia-2, DINOv3, Radiomics2D, Radiomics3D), seven classification heads, and three segmentation regimes on five lung CT tasks (tumor volume, stage, 2-year survival, histology, age). Models are trained on LUNG1 (n=338) and evaluated on an internal hold-out plus external LUNG2 (n=211) using worst-case cross-cohort performance as the primary metric. It concludes that segmentation dominates volume/stage tasks while classifier choice dominates the others, recommends Curia + tumor segmentation + CatBoost as a safe default (best mean rank on primary tasks), and notes that task-specific selection outperforms any single default.
Significance. If the empirical patterns hold after controlling for cohort effects, the study supplies a practical, task-aware benchmark for small-cohort medical imaging pipelines. The explicit two-stage design, external validation, and worst-case metric are strengths that directly address overfitting risks common in this domain.
major comments (2)
- [Abstract and Results] Abstract and Results: the central claim that 'the dominant design factor is task-dependent' and the resulting recommendation rest on the assumption that observed performance gaps isolate the effects of extractor/head/segmentation. The manuscript notes label-derivation effects for volume/stage but provides no quantitative sensitivity analysis or ablation for other cohort-level confounds (imaging protocol, reconstruction parameters) between LUNG1 and LUNG2; this directly affects attribution of the task-specific patterns.
- [Results] Results section (performance tables): the paper reports mean ranks and 'best' configurations without mentioning statistical significance testing (paired tests, confidence intervals, or multiple-comparison correction) on the cross-cohort differences. Given that the recommendation and 'dominant factor' statements rely on these rankings, absence of such tests weakens the strength of the comparative claims.
minor comments (2)
- [Methods] Methods: the description of how worst-case cross-cohort performance is exactly computed (e.g., whether it is min over the two test sets or a different aggregation) should be stated explicitly with a formula.
- [Figures/Tables] Figure captions and tables: ensure all axes and row/column labels explicitly indicate the metric (e.g., AUC, accuracy) and the exact cohort split used for each entry.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing robustness to cohort effects and the need for statistical rigor in comparative claims. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: the central claim that 'the dominant design factor is task-dependent' and the resulting recommendation rest on the assumption that observed performance gaps isolate the effects of extractor/head/segmentation. The manuscript notes label-derivation effects for volume/stage but provides no quantitative sensitivity analysis or ablation for other cohort-level confounds (imaging protocol, reconstruction parameters) between LUNG1 and LUNG2; this directly affects attribution of the task-specific patterns.
Authors: We agree that attributing performance patterns specifically to design choices requires careful consideration of cohort confounds. The manuscript already flags label-derivation effects for volume and stage. The worst-case cross-cohort metric was chosen precisely to reduce sensitivity to cohort-specific artifacts. However, a dedicated quantitative sensitivity analysis for imaging protocol or reconstruction parameters is not present. In revision we will add an explicit limitations paragraph in the Discussion that discusses these potential confounds, notes the absence of detailed protocol metadata in the public LUNG1/LUNG2 releases, and qualifies the strength of the task-dependence claim accordingly. We maintain that the observed patterns are still informative under the two-cohort evaluation design. revision: partial
-
Referee: [Results] Results section (performance tables): the paper reports mean ranks and 'best' configurations without mentioning statistical significance testing (paired tests, confidence intervals, or multiple-comparison correction) on the cross-cohort differences. Given that the recommendation and 'dominant factor' statements rely on these rankings, absence of such tests weakens the strength of the comparative claims.
Authors: We accept that the absence of statistical testing weakens the comparative statements. In the revised manuscript we will augment the performance tables with paired Wilcoxon signed-rank tests (or equivalent non-parametric tests) on the cross-cohort differences, apply multiple-comparison correction, and report p-values and confidence intervals where appropriate. This will allow readers to evaluate the reliability of the mean-rank differences that underpin the task-specific dominance claims and the recommended default configuration. revision: yes
Circularity Check
No circularity: purely empirical benchmark with direct cross-cohort measurements
full rationale
The paper reports an empirical benchmark of feature extractors, heads, and segmentation choices on LUNG1 training and held-out LUNG2 evaluation cohorts. All claims (task-dependent dominance, recommendations for Curia+CatBoost) rest on tabulated performance metrics and mean ranks computed from those measurements. No equations, fitted parameters, uniqueness theorems, or self-citation chains appear in the derivation of results. The design is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The LUNG1 and LUNG2 cohorts are sufficiently representative for worst-case cross-cohort performance to indicate real-world robustness.
- domain assumption The five tasks are sufficiently distinct to reveal independent effects of segmentation versus classifier choice.
Reference graph
Works this paper leans on
-
[1]
author Aerts, H.J.W.L. , author Wee, L. , author Rios Velazquez, E. , author Leijenaar, R.T.H. , author Parmar, C. , author Grossmann, P. , author Carvalho, S. , author Bussink, J. , author Monshouwer, R. , author Haibe-Kains, B. , author Rietveld, D. , author Hoebers, F. , author Rietbergen, M.M. , author Leemans, C.R. , author Dekker, A. , author Quacke...
-
[2]
author Bakr, S. , author Gevaert, O. , author Echegaray, S. , author Ayers, K. , author Zhou, M. , author Shafiq, M. , author Zheng, H. , author Benson, J.A. , author Zhang, W. , author Leung, A.N.C. , author Kadoch, M. , author Hoang, C.D. , author Shrager, J. , author Quon, A. , author Rubin, D.L. , author Plevritis, S.K. , author Napel, S. , year 2018 ...
-
[3]
, author Marturano, F
author Braghetto, A. , author Marturano, F. , author Paiusco, M. , author Baiesi, M. , author Bettinelli, A. , year 2022 . title Radiomics and deep learning methods for the prediction of 2-year overall survival in LUNG1 dataset . journal Sci. Rep. volume 12 , pages 14132
2022
-
[4]
, author Laversanne, M
author Bray, F. , author Laversanne, M. , author Sung, H. , author Ferlay, J. , author Siegel, R.L. , author Soerjomataram, I. , author Jemal, A. , year 2024 . title Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries . journal CA Cancer J. Clin. volume 74 , pages 229--263
2024
-
[5]
, author Desrosiers, C
author Chaddad, A. , author Desrosiers, C. , author Toews, M. , author Abdulkarim, B. , year 2017 . title Predicting survival time of lung cancer patients using radiomic analysis . journal Oncotarget volume 8 , pages 104393--104407
2017
-
[6]
, author Guestrin, C
author Chen, T. , author Guestrin, C. , year 2016 . title XGBoost : A scalable tree boosting system , in: booktitle Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , publisher ACM . pp. pages 785--794
2016
-
[7]
Curia: A multi- modal foundation model for radiology.arXiv preprint arXiv:2509.06830, 2025
author Dancette, C. , author Khlaut, J. , author Saporta, A. , author Philippe, H. , author Ferreres, E. , author Callard, B. , author Danielou, T. , author Alberge, L. , author Machado, L. , author Tordjman, D. , author Dupuis, J. , author Le Floch, K. , author Du Terrail, J. , author Moshiri, M. , author Dercle, L. , author Boeken, T. , author Gregory, ...
-
[8]
, author Wang, N.K
author Dooms, T. , author Wang, N.K. , author Pearce, M.T. , year 2026 . title Covariance-based sequence pooling . journal Goodfire Research
2026
-
[9]
, author Weitz, P
author Haarburger, C. , author Weitz, P. , author Rippel, O. , author Merhof, D. , year 2019 . title Image-based survival prediction for lung cancer patients using CNNS , in: booktitle 2019 IEEE 16th International Symposium on Biomedical Imaging ( ISBI 2019) , publisher IEEE
2019
-
[10]
, author M \"u ller, S
author Hollmann, N. , author M \"u ller, S. , author Eggensperger, K. , author Hutter, F. , year 2023 . title TabPFN : A transformer that solves small tabular classification problems in a second , in: booktitle International Conference on Learning Representations . https://openreview.net/forum?id=cp5PvcI6w8_
2023
-
[11]
, author Georgescu, B
author Liu, H. , author Georgescu, B. , author Zhang, Y. , author Yoo, Y. , author Baumgartner, M. , author Gao, R. , author Wang, J. , author Zhao, G. , author Gibson, E. , author Comaniciu, D. , author Grbic, S. , year 2026 . title Revisiting 2d foundation models for scalable 3d medical image classification , in: booktitle Proceedings of the IEEE/CVF Co...
2026
-
[12]
, author Zhang, Y
author Liu, J. , author Zhang, Y. , author Chen, J.N. , author Xiao, J. , author Lu, Y. , author Landman, B.A. , author Yuan, Y. , author Yuille, A. , author Tang, Y. , author Zhou, Z. , year 2023 . title CLIP -driven universal model for organ segmentation and tumor detection , in: booktitle Proceedings of the IEEE/CVF International Conference on Computer...
2023
-
[13]
, author Bontempi, D
author Pai, S. , author Bontempi, D. , author Hadzic, I. , author Prudente, V. , author Soka c , M. , author Chaunzwa, T.L. , author Bernatz, S. , author Hosny, A. , author Mak, R.H. , author Birkbak, N.J. , author Aerts, H.J.W.L. , year 2024 . title Foundation model for cancer imaging biomarkers . journal Nat. Mach. Intell. volume 6 , pages 354--367
2024
-
[14]
, author Grossmann, P
author Parmar, C. , author Grossmann, P. , author Bussink, J. , author Lambin, P. , author Aerts, H.J.W.L. , year 2015 . title Machine learning methods for quantitative radiomic biomarkers . journal Sci. Rep. volume 5 , pages 13087
2015
-
[15]
, author Gusev, G
author Prokhorenkova, L. , author Gusev, G. , author Vorobev, A. , author Dorogush, A.V. , author Gulin, A. , year 2018 . title CatBoost : unbiased boosting with categorical features . journal Advances in Neural Information Processing Systems volume 31
2018
-
[16]
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
author Qu, J. , author Holzmüller, D. , author Varoquaux, G. , author Morvan, M.L. , year 2025 . title Tabicl: A tabular foundation model for in-context learning on large data . https://arxiv.org/abs/2502.05564, arXiv:2502.05564 http://arxiv.org/abs/2502.05564
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
, author Ilioudis, C
author Raptis, S. , author Ilioudis, C. , author Theodorou, K. , year 2024 . title Uncovering the diagnostic power of radiomic feature significance in automated lung cancer detection: An integrative analysis of texture, shape, and intensity contributions . journal BioMedInformatics volume 4 , pages 2400--2425
2024
-
[18]
author Saporta, A. , author Callard, B. , author Dancette, C. , author Khlaut, J. , author Corbière, C. , author Butsanets, L. , author Prat, A. , author Manceron, P. , year 2026 . title Curia-2: Scaling self-supervised learning for radiology foundation models . https://arxiv.org/abs/2604.01987, arXiv:2604.01987 http://arxiv.org/abs/2604.01987. note arXiv...
-
[19]
, author G \'o mez-Flores, W
author Scalco, E. , author G \'o mez-Flores, W. , author Rizzo, G. , year 2024 . title A genetic programming approach to radiomic-based feature construction for survival prediction in non-small cell lung cancer . journal Appl. Sci. (Basel) volume 14 , pages 6923
2024
-
[20]
, author Zhovannik, I
author Shi, Z. , author Zhovannik, I. , author Traverso, A. , author Dankers, F.J.W.M. , author Deist, T.M. , author Kalendralis, P. , author Monshouwer, R. , author Bussink, J. , author Fijten, R. , author Aerts, H.J.W.L. , author Dekker, A. , author Wee, L. , year 2019 . title Distributed radiomics as a signature validation study using the personal heal...
2019
-
[21]
author Siméoni, O. , author Vo, H.V. , author Seitzer, M. , author Baldassarre, F. , author Oquab, M. , author Jose, C. , author Khalidov, V. , author Szafraniec, M. , author Yi, S. , author Ramamonjisoa, M. , author Massa, F. , author Haziza, D. , author Wehrstedt, L. , author Wang, J. , author Darcet, T. , author Moutakanni, T. , author Sentana, L. , au...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
, author McIntosh, C
author Welch, M.L. , author McIntosh, C. , author Haibe-Kains, B. , author Milosevic, M.F. , author Wee, L. , author Dekker, A. , author Huang, S.H. , author Purdie, T.G. , author O'Sullivan, B. , author Aerts, H.J.W.L. , author Jaffray, D.A. , year 2019 . title Vulnerabilities of radiomic signature development: The need for safeguards . journal Radiother...
2019
-
[23]
, author Yang, J
author Yang, L. , author Yang, J. , author Zhou, X. , author Huang, L. , author Zhao, W. , author Wang, T. , author Zhuang, J. , author Tian, J. , year 2019 . title Development of a radiomics nomogram based on the 2D and 3D CT features to predict the survival of non-small cell lung cancer patients . journal Eur. Radiol. volume 29 , pages 2196--2206
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.