pith. sign in

arxiv: 2606.11606 · v1 · pith:C7ZEXKSAnew · submitted 2026-06-10 · 💻 cs.CV

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

Pith reviewed 2026-06-27 10:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords chest radiographyvision transformersfoundation modelssmall lesionsembeddingspoolingsignal retention
0
0 comments X

The pith

Frozen ViT embeddings for chest X-rays suppress small-lesion signal during global pooling but recover it from patch tokens given a region of interest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five frozen vision-transformer foundation models on large chest-radiography collections and shows that classification-token and whole-image patch-mean embeddings perform at chance on small-scale perturbations and on small lesions. The same forward pass yields near-perfect detection when pooling is restricted to patches inside lesion bounding boxes. A ResNet control reproduces the same global-pooling failure. The work therefore isolates the global-aggregation step as the point where small-scale, low-contrast signal is discarded.

Core claim

Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

What carries the argument

Comparison of three pooling modes (CLS token, patch-mean, bounding-box-restricted patch-local) extracted from the identical frozen forward pass on real and perturbed CXR images.

If this is right

  • Any downstream CXR classifier that ingests only CLS or global-mean embeddings will under-detect small lesions.
  • Patch-local extraction from the same frozen ViT can be inserted into existing pipelines without retraining the backbone.
  • Architectural controls such as ResNet-50 exhibit the same global-pooling loss, indicating the effect is not ViT-specific.
  • Model selection for pre-deployment CXR screening must include small-lesion stratified evaluation rather than image-level AUC alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment pipelines could add an ROI-guided patch readout stage without changing the frozen encoder.
  • The same global-pooling loss may appear in other small-object medical imaging tasks that rely on frozen ViT backbones.
  • Future foundation-model training objectives could add explicit small-scale contrast preservation losses.

Load-bearing premise

The small-scale perturbation panel and bounding-box-stratified probe isolate embedding-level signal retention without being confounded by dataset labeling noise or pretraining objectives.

What would settle it

Re-running the same perturbation panel and bounding-box probe on a new CXR cohort with independently verified small-lesion annotations and obtaining the same AUC gap between CLS and patch-local pooling.

Figures

Figures reproduced from arXiv: 2606.11606 by Alekhya Jilla, Bardia Khosravi, Frank Li, Judy Gichoya, Mohammadreza Chavoshi, Raajitha Muthyala, Saptarshi Purkayastha, Theo Dapamede, Zhenan Yin.

Figure 1
Figure 1. Figure 1: Patch-pooling waterfall for representative perturbations. Per (model, perturbation, dataset) cell, three bars: CLS, patch-mean, patch-local. The patch-local bar reaches AUC ≥ 0.99 on every cell with a chance-level CLS bar, directly localizing the loss of small-scale signal to the global-aggregation step of the frozen forward pass. The patch-mean bar is indistinguishable from CLS at the chance floor for eve… view at source ↗
Figure 2
Figure 2. Figure 2: Summary: CLS vs patch-local AUC across all experimental conditions. Heat-map matrix of linear-probe AUC across the 90 controlled-stimulus cells (5 foundation models × 6 representative perturbations × 3 datasets) and the 15 natural-lesion cells (5 FMs × 3 ChestX-Det10 classes). Top row: CLS-pool AUC clusters around 0.50 across the full controlled-stimulus grid; the only cells lifting off the chance floor ar… view at source ↗
read the original abstract

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that frozen ViT foundation-model embeddings for chest radiography suppress small-scale, low-contrast lesion signals at the global-aggregation step (CLS token or full-image patch-mean pooling), with AUCs near chance (0.500-0.524) on a synthetic perturbation panel and small-vs-large stratum gaps up to 0.243 on ChestX-Det10; the suppressed signal is recoverable (AUC >=0.899, mean gains +0.412 to +0.488) from the same forward pass when patch tokens are restricted to ground-truth bounding boxes. This is shown consistently across five ViTs plus ResNet-50 control on aggregate n=492k CXR images plus ChestX-Det10 (n=3,543).

Significance. If the central empirical distinction between global aggregation loss and patch-local recovery holds after addressing potential confounds, the result is significant for pre-deployment evaluation of foundation models in medical imaging: it quantifies a previously unmeasured failure mode for small-lesion tasks and supplies a concrete diagnostic (bounding-box-restricted patch probes) that could guide ROI-aware fine-tuning or hybrid pooling strategies. The large aggregate sample, directional consistency across five ViTs, and dual synthetic-plus-real-lesion design are strengths that make the observation reproducible and falsifiable.

major comments (2)
  1. [Abstract / ChestX-Det10 results] Abstract and ChestX-Det10 bounding-box probe (n=1,462 small-lesion boxes): the claim that the observed +0.243 AUC small-vs-large gaps and +0.412–0.488 patch-local recovery isolate signal loss specifically at the global-aggregation step is load-bearing for the central thesis, yet the manuscript provides no sensitivity analysis to bounding-box jitter or localization precision; if small-lesion boxes contain typical annotation noise, the CLS depression could arise from mislocalized examples while patch-local AUC remains high simply because selected patches still capture residual signal.
  2. [Abstract / perturbation panel] Perturbation panel results (CLS AUC 0.500-0.524, patch-mean indistinguishable on iso-blur/reticular-fine cells): the reported values are presented without error bars, exact statistical tests, or full exclusion criteria, leaving open the possibility that unstated post-hoc choices affect the directional findings; this weakens attribution of the chance-floor behavior solely to aggregation mechanics rather than dataset or model-specific factors.
minor comments (2)
  1. [Methods / control model] The ResNet-50 architectural control is stated to reproduce the chance floor, but the exact pooling implementation (CLS-equivalent vs. global average) and layer from which tokens are taken should be specified to confirm equivalence with the ViT forward-pass setup.
  2. [Abstract / model list] Model naming (DINOv3 ViT-7B) and pretraining-domain details for each of the five ViTs could be clarified with a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract / ChestX-Det10 results] Abstract and ChestX-Det10 bounding-box probe (n=1,462 small-lesion boxes): the claim that the observed +0.243 AUC small-vs-large gaps and +0.412–0.488 patch-local recovery isolate signal loss specifically at the global-aggregation step is load-bearing for the central thesis, yet the manuscript provides no sensitivity analysis to bounding-box jitter or localization precision; if small-lesion boxes contain typical annotation noise, the CLS depression could arise from mislocalized examples while patch-local AUC remains high simply because selected patches still capture residual signal.

    Authors: We agree a sensitivity analysis to bounding-box jitter would further isolate aggregation effects from annotation noise. In revision we will add experiments that apply controlled Gaussian perturbations (sigma = 5–20 pixels) to the ground-truth boxes on ChestX-Det10 and recompute patch-local AUCs; we will also report the fraction of boxes whose perturbed regions still overlap the original lesion. The synthetic perturbation panel supplies orthogonal evidence that does not depend on any bounding-box annotations and already shows CLS AUC at the chance floor across five ViTs. revision: yes

  2. Referee: [Abstract / perturbation panel] Perturbation panel results (CLS AUC 0.500-0.524, patch-mean indistinguishable on iso-blur/reticular-fine cells): the reported values are presented without error bars, exact statistical tests, or full exclusion criteria, leaving open the possibility that unstated post-hoc choices affect the directional findings; this weakens attribution of the chance-floor behavior solely to aggregation mechanics rather than dataset or model-specific factors.

    Authors: We will add bootstrap 95 % confidence intervals for every AUC, state the use of DeLong’s test for AUC comparisons, and include the complete exclusion criteria, cohort definitions, and image-level filtering steps in the Methods. These changes will allow readers to evaluate the statistical support for the reported chance-floor behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical forward-pass measurements

full rationale

The paper consists entirely of empirical evaluations: forward passes of five frozen ViTs plus a ResNet control on public CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR, ChestX-Det10), followed by direct AUC comparisons across three pooling modes (CLS, patch-mean, bounding-box-restricted patch-local) on a synthetic perturbation panel and real-lesion bounding-box stratification. No equations, fitted parameters, derivations, or predictions appear; the reported AUC gaps (+0.243, +0.412–0.488) are raw measurement outputs, not reductions of any claimed derivation. No self-citations are invoked as load-bearing premises. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical measurement study; it relies on standard assumptions of machine-learning evaluation rather than new mathematical derivations or postulated entities.

axioms (1)
  • domain assumption Perturbation panel and bounding-box probes isolate embedding signal retention independent of downstream classifier choice or labeling noise.
    Central to interpreting AUC differences as evidence of signal loss at aggregation step.

pith-pipeline@v0.9.1-grok · 5983 in / 1165 out tokens · 31316 ms · 2026-06-27T10:34:47.569949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects

    AkinciD’Antonoli,T.,Bluethgen,C.,Cuocolo,R.,etal.,2025. Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects. Diagnostic and Interventional Radiology . Banerjee, I., Bhattacharjee, K., Burns, J.L., Trivedi, H., Purkayastha, S., Seyyed-Kalantari, L., Patel, B.N., Shiradkar, R., Gichoya, J.W.,

  2. [2]

    shortcuts

    “shortcuts”causingbiasinradiologyartificialintelligence:Causes,eval- uation, and mitigation. Journal of the American College of Radiology 20, 842–851. doi:10.1016/j.jacr.2023.06.025. Benjamini, Y., Hochberg, Y.,

  3. [3]

    MedImageInsight:Anopen-source embedding model for general domain medical imaging

    Codella,N.,Jin,Y.,Jain,S.,etal.,2024. MedImageInsight:Anopen-source embedding model for general domain medical imaging. arXiv preprint arXiv:2410.06542 . Dapamede, T., Li, F., Khosravi, B., Purkayastha, S., Trivedi, H., Gichoya, J.,

  4. [4]

    JournalofImagingInformaticsinMedicine 38, 3040–3048

    DICOM LUT is a key step in medical image preprocessing towardsAIgeneralizability. JournalofImagingInformaticsinMedicine 38, 3040–3048. doi:10.1007/s10278-025-01418-5. Darcet,T.,Oquab,M.,Mairal,J.,Bojanowski,P.,2024.Visiontransformers need registers. International Conference on Learning Representations (ICLR). DeLong, E., DeLong, D., Clarke-Pearson, D.,

  5. [5]

    Biometrics 44, 837–845

    Comparing the areas undertwoormorecorrelatedreceiveroperatingcharacteristiccurves:A nonparametric approach. Biometrics 44, 837–845. Fang,M.,Wang,Z.,Pan,S.,etal.,2025. Largemodelsinmedicalimaging: Advances and prospects. Chinese Medical Journal 138, 1647–1664. Hansell,D.M.,Bankier,A.A.,MacMahon,H.,McLoud,T.C.,Müller,N.L., Remy, J.,

  6. [6]

    Radiology 246, 697–722

    Fleischner society: Glossary of terms for thoracic imaging. Radiology 246, 697–722. Henschke, C.I., Yankelevitz, D.F., Libby, D.M., Pasmantier, M.W., Smith, J.P.,Miettinen,O.S.,2006. SurvivalofpatientswithstageIlungcancer detectedonCTscreening.NewEnglandJournalofMedicine355,1763–

  7. [7]

    Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024

    R. Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024. Embeddingsareallyouneed!achieving high performance medical image classification through training-free embedding analysis. arXiv preprint arXiv:2412.09445 . Khosravi, B., Li, F., Dapamede, T., Rouzrokh, P., Gamble, C.U., Trivedi, H.M., W...

  8. [8]

    arXiv preprint arXiv:2509.06467

    Does DINOv3 set a new medical vision standard? benchmarking2Dand3Dclassification,segmentation,andregistration. arXiv preprint arXiv:2509.06467 . Liu, J., Lian, J., Yu, Y.,

  9. [9]

    Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

    ChestX-Det10: Chest x-ray dataset on detection of thoracic abnormalities.arXiv:2006.10550v3. Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

  10. [10]

    arXiv:2602.08626

    Revisiting [CLS] and patch token interaction in vision transformers. arXiv:2602.08626. McInnes, L., Healy, J., Melville, J.,

  11. [11]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 . Morin, R., Mahesh, M.,

  12. [12]

    Nature Machine Intelligence 7, 119–130

    RAD-DINO: Exploring scalable medical image encoders beyond text supervision. Nature Machine Intelligence 7, 119–130. Ranftl,R.,Bochkovskiy,A.,Koltun,V.,2021.Visiontransformersfordense prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12179–12188. Rashtchian, C., Herrmann, C., Ferng, C.S., Chakrabarti, A., ...

  13. [13]

    Rousseeuw, P.,

    Substance or style: What does your image embedding know? arXiv preprint arXiv:2307.05610 . Rousseeuw, P.,

  14. [14]

    MedGemma Technical Report

    MedGemma technical report. arXiv preprint arXiv:2507.05201 . Siméoni, O., Vo, H., Seitzer, M., et al.,

  15. [15]

    DINOv3

    DINOv3.arXiv:2508.10104. Sun, X., Xu, W.,

  16. [16]

    3462–3471

    ChestX-ray8: Hospital-scale chest x-raydatabaseandbenchmarksonweakly-supervisedclassificationand localization of common thorax diseases, in: CVPR, pp. 3462–3471. Yang,Z.,Xu,X.,Zhang,J.,Wang,G.,Kalra,M.K.,Yan,P.,2025. ChestX- ray foundation model with global and local representations integration. IEEE Transactions on Medical Imaging 44, 4787–4799. Zedda, L...

  17. [17]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 . Zhou, D., Yu, Z., Xie, E., et al.,