Recognition: no theorem link
CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification
Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3
The pith
Quantitative phenotypes extracted from organ segmentations in abdominal CT scans classify multiple diseases with AUCs matching or exceeding a vision transformer baseline while remaining inspectable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CT-IDP generates organ- and compartment-level descriptors spanning morphometry, attenuation, and contextual burden from TotalSegmentator segmentations, then applies elastic-net regularized logistic regression under a frozen specification to produce disease-specific predictions. On the MERLIN benchmark the method records a macro-AUC of 0.897 versus 0.880 for the vision-transformer baseline; the same frozen model yields 0.877 versus 0.857 on Duke-Abdomen and 0.780 versus 0.756 on AMOS. Coefficient inspection and phenotype-stratified audits confirm that the performance edge arises from explicit, human-readable features rather than learned embeddings.
What carries the argument
CT-IDP, the pipeline that converts multi-organ segmentations into sparse, elastic-net logistic regression models whose coefficients directly indicate the contribution of each measurable phenotype to disease probability.
Load-bearing premise
Automated segmentations remain accurate and unbiased enough across institutions that the derived numerical phenotypes retain their disease-discriminating power without systematic distortion.
What would settle it
Re-running the identical frozen models on a new multi-center cohort where expert manual segmentations replace the automated ones and performance falls below the vision-transformer baseline would show the phenotypes are not reliable.
read the original abstract
In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CT-IDP, a framework deriving over 900 quantitative phenotypes (morphometry, attenuation, and burden descriptors) from TotalSegmentator multi-organ segmentations of abdominal CTs. Sparse elastic-net logistic regression models are trained on the MERLIN dataset (15,175/5,018/5,082 train/val/test studies) for disease classification and externally validated under frozen specification on Duke-Abdomen (2,000 studies) and AMOS (1,107 studies). It reports macro-AUC improvements over a DINOv3 vision-transformer baseline (0.897 vs 0.880 on MERLIN; 0.877 vs 0.857 on Duke; 0.780 vs 0.756 on AMOS), supported by phenotype-stratified audits and coefficient inspection for interpretability.
Significance. If the segmentation-derived phenotypes remain disease-discriminative after accounting for tool errors, the work supplies a reproducible, interpretable alternative to end-to-end deep models for multi-institutional CT classification. External validation with frozen models, plus explicit phenotype audits, strengthens reproducibility and offers a path toward clinically auditable predictions; the modest but consistent AUC gains across three datasets indicate practical utility if bias is ruled out.
major comments (2)
- [Abstract and Methods] Abstract and Methods: All 900+ phenotypes are derived directly from TotalSegmentator outputs, yet no per-organ Dice, Hausdorff, or volume-error metrics are reported on the diseased subsets of MERLIN, Duke-Abdomen, or AMOS. Pathologies (tumors, ascites, inflammation) routinely distort boundaries and densities; without cohort-specific validation, it is unclear whether the modest AUC gains (e.g., +0.017 on MERLIN) reflect true signal or systematic segmentation bias propagated into the elastic-net models.
- [Results] Results and phenotype definition: The manuscript states that full phenotype definitions and any post-hoc selection or filtering steps are provided, but these are not visible in the supplied description; without an exhaustive, reproducible list (including exact formulas for attenuation statistics and burden ratios), independent replication and assessment of potential circularity in phenotype construction cannot be performed.
minor comments (2)
- [Abstract] The abstract should explicitly list the disease labels and number of classes underlying the macro-AUC computation to allow immediate assessment of task difficulty.
- Tables or supplementary material reporting the top phenotype coefficients per disease would benefit from standardized formatting and confidence intervals to facilitate direct comparison with the DINOv3 baseline.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which highlight important aspects of reproducibility and validation. We provide point-by-point responses below and will revise the manuscript accordingly to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: All 900+ phenotypes are derived directly from TotalSegmentator outputs, yet no per-organ Dice, Hausdorff, or volume-error metrics are reported on the diseased subsets of MERLIN, Duke-Abdomen, or AMOS. Pathologies (tumors, ascites, inflammation) routinely distort boundaries and densities; without cohort-specific validation, it is unclear whether the modest AUC gains (e.g., +0.017 on MERLIN) reflect true signal or systematic segmentation bias propagated into the elastic-net models.
Authors: We appreciate the referee's emphasis on potential segmentation inaccuracies in pathological cases. TotalSegmentator has been validated on diverse CT datasets including pathologies in its original work and follow-up studies, but we acknowledge that explicit per-organ Dice, Hausdorff, and volume-error metrics on the diseased subsets of our specific cohorts are not reported in the current manuscript. This is a valid limitation that could affect interpretation of the modest AUC improvements. In the revised version, we will add a new subsection in Methods discussing segmentation performance expectations based on published benchmarks, along with a small-scale manual audit of segmentation quality on a random sample of diseased cases from MERLIN. We will also expand the Discussion to address how any residual errors might influence phenotype derivation and model performance. While the consistent gains across external datasets and the use of sparse, interpretable models provide some reassurance against systematic bias, we agree these additions will strengthen the manuscript. revision: yes
-
Referee: [Results] Results and phenotype definition: The manuscript states that full phenotype definitions and any post-hoc selection or filtering steps are provided, but these are not visible in the supplied description; without an exhaustive, reproducible list (including exact formulas for attenuation statistics and burden ratios), independent replication and assessment of potential circularity in phenotype construction cannot be performed.
Authors: We apologize for the lack of immediate visibility of the full phenotype details in the review materials. The exhaustive list of over 900 phenotypes—including exact formulas for morphometry (e.g., volumes, surface areas), attenuation statistics (mean, standard deviation, percentiles of Hounsfield units within each organ mask), and burden ratios (e.g., compartment involvement fractions)—along with all post-hoc filtering steps, is provided in the Supplementary Materials and the linked public code repository. Phenotypes are constructed solely from segmentation outputs without any use of disease labels, avoiding circularity. To improve accessibility, we will revise the Methods section to include a summary table of phenotype categories with representative formulas and ensure the supplementary file is explicitly referenced in the main text. This will facilitate independent replication and allow direct assessment of the phenotype construction process. revision: yes
Circularity Check
No significant circularity; external validation and pre-training phenotype derivation keep claims independent
full rationale
The paper generates multi-organ segmentations via TotalSegmentator, derives >900 phenotypes (morphometry, attenuation, burden) from those outputs, trains elastic-net logistic regression on the MERLIN training split, and reports AUC on held-out MERLIN test plus two fully external datasets (Duke-Abdomen, AMOS). No equation or claim reduces a reported performance number to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled via prior work. The central results are therefore falsifiable on independent data and do not collapse to the inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- elastic-net regularization parameters
axioms (1)
- domain assumption TotalSegmentator produces sufficiently accurate multi-organ segmentations for phenotype derivation
Forward citations
Cited by 1 Pith paper
-
JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
JANUS conditions Vision Transformer embeddings on macro-radiomic priors via anatomically guided gating, reaching macro-AUROC 0.88 on an internal test set of 5082 cases and 0.87 on an external set of 2000 cases while i...
Reference graph
Works this paper leans on
-
[1]
Smith-Bindman, R., et al., Projected lifetime cancer risks from current computed tomography imaging. JAMA Intern Med. 2025. 2025
work page 2025
-
[2]
Winder, M., et al. Are we overdoing it? Changes in diagnostic imaging workload during the years 2010–2020 including the impact of the SARS-CoV-2 pandemic. in Healthcare
work page 2010
-
[3]
Momin, E., et al., Systematic review on the impact of deep learning-driven worklist triage on radiology workflow and clinical outcomes. European radiology, 2025. 35(11): p. 6879–6893
work page 2025
-
[4]
Draelos, R.L., et al., Machine-learning-based multiple abnormality prediction with large- scale chest computed tomography volumes. Medical image analysis, 2021. 67: p. 101857
work page 2021
-
[5]
Radiology: Artificial Intelligence, 2021
Tushar, F.I., et al., Classification of multiple diseases on body CT scans using weakly supervised deep learning. Radiology: Artificial Intelligence, 2021. 4(1): p. e210026
work page 2021
-
[6]
Beeche, C., et al., A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations. medRxiv, 2025
work page 2025
-
[7]
arXiv preprint arXiv:2511.17803 (2025)
Agrawal, K.K., et al., Pillar-0: A new frontier for radiology foundation models. arXiv preprint arXiv:2511.17803, 2025
-
[8]
Geirhos, R., et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. in International conference on learning representations. 2018
work page 2018
-
[9]
Nature Machine Intelligence, 2020
Geirhos, R., et al., Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020. 2(11): p. 665–673
work page 2020
-
[10]
Linguraru, M.G., et al., Assessing hepatomegaly: automated volumetric analysis of the liver. Academic radiology, 2012. 19(5): p. 588–598
work page 2012
-
[11]
Gücük, A. and U. Üyetürk, Usefulness of hounsfield unit and density in the assessment and treatment of urinary stones. World journal of nephrology, 2014. 3(4): p. 282
work page 2014
-
[12]
Paudyal, R., et al., Artificial intelligence in CT and MR imaging for oncological applications. Cancers, 2023. 15(9): p. 2573
work page 2023
-
[13]
Nature reviews Clinical oncology, 2017
Lambin, P., et al., Radiomics: the bridge between medical imaging and personalized medicine. Nature reviews Clinical oncology, 2017. 14(12): p. 749–762
work page 2017
-
[14]
Journal of Nuclear Medicine, 2020
Mayerhoefer, M.E., et al., Introduction to radiomics. Journal of Nuclear Medicine, 2020. 61(4): p. 488–495
work page 2020
-
[15]
Clinical pharmacology & therapeutics, 2001
Group, B.D.W., et al., Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clinical pharmacology & therapeutics, 2001. 69(3): p. 89–95
work page 2001
-
[16]
Sullivan, D.C., et al., Metrology standards for quantitative imaging biomarkers. Radiology, 2015. 277(3): p. 813–825
work page 2015
-
[17]
Zwanenburg, A., et al., The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology, 2020. 295(2): p. 328–338
work page 2020
-
[18]
Radiology: Artificial Intelligence, 2023
Wasserthal, J., et al., TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence, 2023. 5(5): p. e230024
work page 2023
-
[19]
Dahal, L., et al., XCAT 3.0: A comprehensive library of personalized digital twins derived from CT scans. Medical Image Analysis, 2025. 103: p. 103636
work page 2025
-
[20]
Isensee, F., et al., nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 2021. 18(2): p. 203–211
work page 2021
-
[21]
arXiv preprint arXiv:2511.11450, 2025
Rokuss, M., et al., Voxtell: Free-text promptable universal 3d medical image segmentation. arXiv preprint arXiv:2511.11450, 2025
-
[22]
Blankemeier, L., et al., Merlin: A vision language foundation model for 3d computed tomography. Research Square, 2024: p. rs. 3. rs–4546309
work page 2024
-
[23]
Advances in neural information processing systems, 2022
Ji, Y ., et al., Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. Advances in neural information processing systems, 2022. 35: p. 36722–36732
work page 2022
-
[24]
Siméoni, O., et al., Dinov3. arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Sellergren, A., et al., Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Yang, A., et al., Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Zou, H. and T. Hastie, Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005. 67(2): p. 301–320
work page 2005
-
[28]
Journal of Medical Imaging, 2020
Abadi, E., et al., Virtual clinical trials in medical imaging: a review. Journal of Medical Imaging, 2020. 7(4): p. 042805–042805
work page 2020
-
[29]
Tushar, F.I., et al., Virtual lung screening trial (VLST): An in silico study inspired by the national lung screening trial for lung cancer detection. Medical Image Analysis, 2025. 103: p. 103576
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.