Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Haoyu Dong; James E. Baciak; Mojtaba Safari; Shansong Wang; Xiaofeng Yang; Yuan Gao; Yuheng Li; Yuxiang Lai

arxiv: 2605.21906 · v2 · pith:OIVZEORGnew · submitted 2026-05-21 · 💻 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Yuheng Li , Yuan Gao , Haoyu Dong , Yuxiang Lai , Shansong Wang , Mojtaba Safari , James E. Baciak , Xiaofeng Yang This is my paper

Pith reviewed 2026-05-22 07:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords CTfoundation modelpretrainingmedical imagingsegmentationclassificationvision-language

0 comments

The pith

A single CT foundation model trained in three agglomerative stages matches or exceeds task-specific models across five clinical task families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexiCT, a family of models pretrained on 266,227 CT volumes drawn from 56 public datasets. It uses a three-stage process that starts with 2D axial slices, moves to 3D anatomical volumes, and finishes with report-guided semantic alignment. The goal is to replace the current patchwork of separate models for different CT tasks with one set of general representations. If the approach holds, clinicians and researchers could use the same embeddings for segmentation, classification, registration, report interpretation, and retrieval while also reading off disease progression signals directly from the learned space.

Core claim

FlexiCT is trained by agglomerative continual pretraining in three stages—two-dimensional axial pretraining, three-dimensional anatomical pretraining, and report-guided semantic alignment—on 266,227 CT volumes from 56 publicly available datasets. The resulting representations match or exceed prior task-specific approaches on benchmarks spanning segmentation, classification, registration, vision-language understanding, and clinical retrieval. The same embeddings further organize scans along gradients linked to tumor stage progression.

What carries the argument

Three-stage agglomerative continual pretraining that progressively builds slice-level, volume-level, and vision-language representations from the same data pool.

If this is right

One model family supports slice-level, volume-level, and vision-language analysis without retraining from scratch.
Embeddings capture disease phenotype information such as tumor stage gradients even without explicit supervision for those labels.
A public checkpoint and code release creates a shared starting point for new CT tasks instead of training each one separately.
Clinical retrieval and report alignment become feasible using the same representation space built for imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged pretraining recipe could be tried on MRI or PET to test whether modality-specific foundations emerge without starting from scratch each time.
If the tumor-stage organization generalizes, the embeddings might support longitudinal tracking of individual patients across multiple scans.
Evaluating the model on private multi-center data with scanner and demographic shifts would directly test whether the public-data pretraining is sufficient for broad deployment.

Load-bearing premise

The 56 public datasets are representative of clinical practice and free of leakage so that the learned representations transfer to new patient populations and scanners.

What would settle it

A clear drop in performance relative to task-specific baselines when the model is tested on CT data from a previously unseen hospital network or scanner vendor would show the representations are not yet universal.

Figures

Figures reproduced from arXiv: 2605.21906 by Haoyu Dong, James E. Baciak, Mojtaba Safari, Shansong Wang, Xiaofeng Yang, Yuan Gao, Yuheng Li, Yuxiang Lai.

**Figure 1.** Figure 1: Dataset statistics and three-stage pretraining strategy of FlexiCT. a, Composition of the FlexiCT pretraining dataset. Four donut charts summarise body region (top left; n = 266,227 volumes), geographic distribution (top right; n = 266,227), disease family (bottom left; n = 186,700 volumes with case- or cohort-level labels) and anatomical system (bottom right; same n). b, Frequency of the top 20 clinical c… view at source ↗

**Figure 2.** Figure 2: FlexiCT outperforms foundation models across 3D and 2D segmentation benchmarks. a, Volumetric segmentation Dice coefficient on six abdominal, thoracic and whole-body benchmarks (KiTS23, WORD, MSD Liver, MSD Lung, MSD Pancreas, and AutoPET), comparing nnU-Net, Primus-M, VoCo, CT-FM and FlexiCT-3D (red). b, Slice-level segmentation Dice coefficient on TotalSegmentator (104 anatomical classes partitioned into… view at source ↗

**Figure 3.** Figure 3: FlexiCT-2D enables training-free intra- and cross-modal abdominal registration. a, Per-organ Dice similarity coefficient on the Learn2Reg abdominal CT–CT task across 13 organs (n = 45 registration pairs across 5-fold cross-validation), comparing VoxelMorph, Curia, DINO-Reg and FlexiCT-2D (red). Curia, DINO-Reg and FlexiCT-2D share the same ConvexAdam optimisation framework and differ only in the feature ba… view at source ↗

**Figure 4.** Figure 4: FlexiCT-2D enables label-efficient disease classification from frozen features. a–d, Label-efficiency curves for frozen pretrained encoders trained for: renal tumor subtyping (KiTS; a), universal lesion classification (Deep-Lesion; b), pulmonary nodule detection (Luna16; c) and COVID19 identification (Covidx-CT; d). X-axis labels give training-sample counts; dashed lines mark each model’s full-data AUC (n… view at source ↗

**Figure 5.** Figure 5: FlexiCT-3D embeddings organize tumors along clinical severity gradients without staging supervision. a, Zero-shot tumor retrieval (Recall@1, Recall@3) for T-stage (NSCLCRadiogenomics) and ISUP grade (C4KC-KiTS), comparing CT-FM, VoCo, SPECTRE and FlexiCT3D. b, Linear probing (AUC, balanced accuracy) on frozen embeddings for T-stage and ISUP grade, including a tumor-diameter-only clinical baseline (grey).… view at source ↗

**Figure 6.** Figure 6: FlexiCT-3D-VLM supports zero-shot disease classification and report retrieval across chest and abdominal CT. a, Zero-shot multi-label disease classification on CT-RATE (left) and Merlin (right), reporting macro-averaged precision, F1, accuracy (ACC) and area under the ROC curve (AUC). Baselines are CT-CLIP, COLIPRI and SPECTRE on CT-RATE; Merlin, COLIPRI and SPECTRE on the Merlin benchmark. b, Semantic rep… view at source ↗

read the original abstract

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexiCT gives a workable three-stage pretraining recipe on a big public CT collection and releases code, but the generalization claims need tighter checks on dataset overlap before they land.

read the letter

Hey, the core of this paper is FlexiCT, a CT model pretrained in three stages—2D axial, then 3D anatomical, then report-guided alignment—on 266k volumes from 56 public datasets. It reports matching or beating task-specific baselines on segmentation, classification, registration, vision-language, and retrieval, and the embeddings appear to line up with tumor-stage gradients. The scale and the code release at the GitHub link are the practical wins; anyone who needs a starting CT backbone can actually use it without starting from scratch. The staged pipeline is a clear extension of earlier single-stage foundation work, and the phenotype-organization observation is worth following up if the numbers hold. The soft spots are straightforward. The abstract gives no concrete metrics, baselines, or statistical details, so the performance edge is hard to judge from the summary alone. The bigger issue is the 56-dataset mix: without explicit confirmation that downstream test splits are free of patient or scan overlap, the results could partly reflect leakage rather than true transfer. That assumption is load-bearing for the universal-representation story. Methods look standard for this area, no obvious circularity or invented steps. This is for medical-imaging groups that want a reusable CT encoder or are exploring direct phenotype readout from volumes. A reader who needs code and a broad starting point will get immediate value; someone chasing the strongest possible generalization claims will want the full evaluation tables first. It deserves a serious referee to examine the data splits and downstream protocols. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining on 266,227 volumes drawn from 56 public datasets. The stages consist of 2D axial pretraining, 3D anatomical pretraining, and report-guided semantic alignment. The resulting embeddings are evaluated across five downstream task families (segmentation, classification, registration, vision-language understanding, and clinical retrieval) and are reported to match or exceed prior task-specific models on multiple benchmarks while also organizing scans along gradients associated with tumor stages.

Significance. If the central claims hold after addressing evaluation details, this would constitute a meaningful contribution to medical imaging by demonstrating that a single set of representations can span anatomy-to-phenotype tasks without task-specific retraining. The scale of the public data collection and the staged pretraining approach are notable strengths, as is the public release of code.

major comments (2)

[§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.
[Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.

minor comments (2)

[§3] Clarify the precise definition and weighting of the three pretraining stages (e.g., loss functions and data sampling ratios) to improve reproducibility.
[Figures 4-5] Figure captions should explicitly state the number of samples and any exclusion criteria used for the phenotype organization visualizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, with revisions made to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.

Authors: We agree that explicit verification of patient and scan overlaps is critical to support the generalization claims. The pretraining corpus was assembled exclusively from public datasets, and downstream evaluations followed the official published splits and protocols for each benchmark. However, the initial submission did not include a dedicated overlap audit. We have now performed this analysis using available metadata (patient identifiers, acquisition dates, and institutional tags where present across the public releases). The audit results, including any minimal overlaps detected and mitigation steps, have been added to §4 along with a new supplementary table. This revision directly bolsters the validity of the reported transfer performance. revision: yes
Referee: [Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.

Authors: We appreciate the need for greater transparency in the quantitative results. The original manuscript summarized key outcomes in the main text while directing readers to supplementary materials for full tables. To address this concern, we have expanded the Results section with comprehensive tables for all five task families. These now report exact metric values, error bars or confidence intervals, statistical significance tests (e.g., paired comparisons against baselines), and complete baseline details with both originally reported and reproduced scores. Updated figures accompany the tables to facilitate assessment of effect sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretraining evaluated on external benchmarks

full rationale

The paper describes an empirical agglomerative pretraining pipeline on 266,227 CT volumes from 56 public datasets, followed by evaluation across standard downstream task families. No equations, fitted parameters, or derivations are presented that reduce reported performance or embeddings to definitional equivalence with the inputs. The approach relies on external public data and benchmarks rather than self-referential steps, self-citation chains, or ansatzes smuggled via prior work. This is a self-contained experimental result against external validation sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the aggregated public datasets plus standard deep-learning transfer assumptions; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption The 56 public datasets collectively provide unbiased coverage of anatomical and pathological variation sufficient for learning universal representations.
Invoked implicitly by the scale and diversity claims in the abstract; if violated, transfer performance would degrade.

pith-pipeline@v0.9.0 · 5736 in / 1350 out tokens · 50089 ms · 2026-05-22T07:47:38.467584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train using a DINOv3 self-supervised framework ... iBOT masked patch prediction loss ... contrastive loss.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.