Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

Alekhya Jilla; Bardia Khosravi; Frank Li; Judy Gichoya; Mohammadreza Chavoshi; Raajitha Muthyala; Saptarshi Purkayastha; Theo Dapamede; Zhenan Yin

arxiv: 2606.11606 · v1 · pith:C7ZEXKSAnew · submitted 2026-06-10 · 💻 cs.CV

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

Raajitha Muthyala , Zhenan Yin , Alekhya Jilla , Frank Li , Theo Dapamede , Bardia Khosravi , Mohammadreza Chavoshi , Judy Gichoya

show 1 more author

Saptarshi Purkayastha

This is my paper

Pith reviewed 2026-06-27 10:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest radiographyvision transformersfoundation modelssmall lesionsembeddingspoolingsignal retention

0 comments

The pith

Frozen ViT embeddings for chest X-rays suppress small-lesion signal during global pooling but recover it from patch tokens given a region of interest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five frozen vision-transformer foundation models on large chest-radiography collections and shows that classification-token and whole-image patch-mean embeddings perform at chance on small-scale perturbations and on small lesions. The same forward pass yields near-perfect detection when pooling is restricted to patches inside lesion bounding boxes. A ResNet control reproduces the same global-pooling failure. The work therefore isolates the global-aggregation step as the point where small-scale, low-contrast signal is discarded.

Core claim

Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

What carries the argument

Comparison of three pooling modes (CLS token, patch-mean, bounding-box-restricted patch-local) extracted from the identical frozen forward pass on real and perturbed CXR images.

If this is right

Any downstream CXR classifier that ingests only CLS or global-mean embeddings will under-detect small lesions.
Patch-local extraction from the same frozen ViT can be inserted into existing pipelines without retraining the backbone.
Architectural controls such as ResNet-50 exhibit the same global-pooling loss, indicating the effect is not ViT-specific.
Model selection for pre-deployment CXR screening must include small-lesion stratified evaluation rather than image-level AUC alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines could add an ROI-guided patch readout stage without changing the frozen encoder.
The same global-pooling loss may appear in other small-object medical imaging tasks that rely on frozen ViT backbones.
Future foundation-model training objectives could add explicit small-scale contrast preservation losses.

Load-bearing premise

The small-scale perturbation panel and bounding-box-stratified probe isolate embedding-level signal retention without being confounded by dataset labeling noise or pretraining objectives.

What would settle it

Re-running the same perturbation panel and bounding-box probe on a new CXR cohort with independently verified small-lesion annotations and obtaining the same AUC gap between CLS and patch-local pooling.

Figures

Figures reproduced from arXiv: 2606.11606 by Alekhya Jilla, Bardia Khosravi, Frank Li, Judy Gichoya, Mohammadreza Chavoshi, Raajitha Muthyala, Saptarshi Purkayastha, Theo Dapamede, Zhenan Yin.

**Figure 1.** Figure 1: Patch-pooling waterfall for representative perturbations. Per (model, perturbation, dataset) cell, three bars: CLS, patch-mean, patch-local. The patch-local bar reaches AUC ≥ 0.99 on every cell with a chance-level CLS bar, directly localizing the loss of small-scale signal to the global-aggregation step of the frozen forward pass. The patch-mean bar is indistinguishable from CLS at the chance floor for eve… view at source ↗

**Figure 2.** Figure 2: Summary: CLS vs patch-local AUC across all experimental conditions. Heat-map matrix of linear-probe AUC across the 90 controlled-stimulus cells (5 foundation models × 6 representative perturbations × 3 datasets) and the 15 natural-lesion cells (5 FMs × 3 ChestX-Det10 classes). Top row: CLS-pool AUC clusters around 0.50 across the full controlled-stimulus grid; the only cells lifting off the chance floor ar… view at source ↗

read the original abstract

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frozen ViTs suppress small-lesion signal in global CXR embeddings but recover it from patch tokens; useful empirical warning with one clear methodological soft spot.

read the letter

The paper's core observation is that five frozen ViT foundation models drop small-scale, low-contrast lesion signal to chance levels in the CLS token or global mean pool, while the same forward pass keeps the signal in patch tokens when restricted to the lesion region. They show this both with synthetic perturbations and on real bounding boxes from ChestX-Det10, with consistent direction across models and a ResNet control that does not exhibit the same drop.

What stands out is the scale of the test: nearly 500k images across three cohorts plus the stratified lesion set, plus the clean within-forward-pass comparison of pooling modes. The perturbation panel isolates scale dependence without relying on annotations, and the recovery gaps (+0.4 AUC range) are large enough to matter for anyone freezing these embeddings for downstream CXR tasks.

The main soft spot is the bounding-box probe. Small-lesion annotations are often imprecise; if the boxes have localization noise or partial coverage, the image-level CLS AUC for the small stratum will be pulled down while the patch-local probe can still hit high numbers simply because it samples inside the noisy box. The synthetic panel sidesteps this, but the real-lesion claim would be stronger with a sensitivity check on box jitter or inter-annotator agreement. No error bars or exact tests appear in the abstract, though the directional consistency across models reduces the chance this is noise.

This is for groups building or auditing frozen-embedding pipelines in medical imaging. It does not overturn the models but flags a deployment risk that is easy to miss in standard benchmarks. It deserves peer review; the empirical pattern is reproducible enough to warrant scrutiny on the annotation side and statistical details.

Referee Report

2 major / 2 minor

Summary. The paper claims that frozen ViT foundation-model embeddings for chest radiography suppress small-scale, low-contrast lesion signals at the global-aggregation step (CLS token or full-image patch-mean pooling), with AUCs near chance (0.500-0.524) on a synthetic perturbation panel and small-vs-large stratum gaps up to 0.243 on ChestX-Det10; the suppressed signal is recoverable (AUC >=0.899, mean gains +0.412 to +0.488) from the same forward pass when patch tokens are restricted to ground-truth bounding boxes. This is shown consistently across five ViTs plus ResNet-50 control on aggregate n=492k CXR images plus ChestX-Det10 (n=3,543).

Significance. If the central empirical distinction between global aggregation loss and patch-local recovery holds after addressing potential confounds, the result is significant for pre-deployment evaluation of foundation models in medical imaging: it quantifies a previously unmeasured failure mode for small-lesion tasks and supplies a concrete diagnostic (bounding-box-restricted patch probes) that could guide ROI-aware fine-tuning or hybrid pooling strategies. The large aggregate sample, directional consistency across five ViTs, and dual synthetic-plus-real-lesion design are strengths that make the observation reproducible and falsifiable.

major comments (2)

[Abstract / ChestX-Det10 results] Abstract and ChestX-Det10 bounding-box probe (n=1,462 small-lesion boxes): the claim that the observed +0.243 AUC small-vs-large gaps and +0.412–0.488 patch-local recovery isolate signal loss specifically at the global-aggregation step is load-bearing for the central thesis, yet the manuscript provides no sensitivity analysis to bounding-box jitter or localization precision; if small-lesion boxes contain typical annotation noise, the CLS depression could arise from mislocalized examples while patch-local AUC remains high simply because selected patches still capture residual signal.
[Abstract / perturbation panel] Perturbation panel results (CLS AUC 0.500-0.524, patch-mean indistinguishable on iso-blur/reticular-fine cells): the reported values are presented without error bars, exact statistical tests, or full exclusion criteria, leaving open the possibility that unstated post-hoc choices affect the directional findings; this weakens attribution of the chance-floor behavior solely to aggregation mechanics rather than dataset or model-specific factors.

minor comments (2)

[Methods / control model] The ResNet-50 architectural control is stated to reproduce the chance floor, but the exact pooling implementation (CLS-equivalent vs. global average) and layer from which tokens are taken should be specified to confirm equivalence with the ViT forward-pass setup.
[Abstract / model list] Model naming (DINOv3 ViT-7B) and pretraining-domain details for each of the five ViTs could be clarified with a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses

Referee: [Abstract / ChestX-Det10 results] Abstract and ChestX-Det10 bounding-box probe (n=1,462 small-lesion boxes): the claim that the observed +0.243 AUC small-vs-large gaps and +0.412–0.488 patch-local recovery isolate signal loss specifically at the global-aggregation step is load-bearing for the central thesis, yet the manuscript provides no sensitivity analysis to bounding-box jitter or localization precision; if small-lesion boxes contain typical annotation noise, the CLS depression could arise from mislocalized examples while patch-local AUC remains high simply because selected patches still capture residual signal.

Authors: We agree a sensitivity analysis to bounding-box jitter would further isolate aggregation effects from annotation noise. In revision we will add experiments that apply controlled Gaussian perturbations (sigma = 5–20 pixels) to the ground-truth boxes on ChestX-Det10 and recompute patch-local AUCs; we will also report the fraction of boxes whose perturbed regions still overlap the original lesion. The synthetic perturbation panel supplies orthogonal evidence that does not depend on any bounding-box annotations and already shows CLS AUC at the chance floor across five ViTs. revision: yes
Referee: [Abstract / perturbation panel] Perturbation panel results (CLS AUC 0.500-0.524, patch-mean indistinguishable on iso-blur/reticular-fine cells): the reported values are presented without error bars, exact statistical tests, or full exclusion criteria, leaving open the possibility that unstated post-hoc choices affect the directional findings; this weakens attribution of the chance-floor behavior solely to aggregation mechanics rather than dataset or model-specific factors.

Authors: We will add bootstrap 95 % confidence intervals for every AUC, state the use of DeLong’s test for AUC comparisons, and include the complete exclusion criteria, cohort definitions, and image-level filtering steps in the Methods. These changes will allow readers to evaluate the statistical support for the reported chance-floor behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical forward-pass measurements

full rationale

The paper consists entirely of empirical evaluations: forward passes of five frozen ViTs plus a ResNet control on public CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR, ChestX-Det10), followed by direct AUC comparisons across three pooling modes (CLS, patch-mean, bounding-box-restricted patch-local) on a synthetic perturbation panel and real-lesion bounding-box stratification. No equations, fitted parameters, derivations, or predictions appear; the reported AUC gaps (+0.243, +0.412–0.488) are raw measurement outputs, not reductions of any claimed derivation. No self-citations are invoked as load-bearing premises. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical measurement study; it relies on standard assumptions of machine-learning evaluation rather than new mathematical derivations or postulated entities.

axioms (1)

domain assumption Perturbation panel and bounding-box probes isolate embedding signal retention independent of downstream classifier choice or labeling noise.
Central to interpreting AUC differences as evidence of signal loss at aggregation step.

pith-pipeline@v0.9.1-grok · 5983 in / 1165 out tokens · 31316 ms · 2026-06-27T10:34:47.569949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects

AkinciD’Antonoli,T.,Bluethgen,C.,Cuocolo,R.,etal.,2025. Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects. Diagnostic and Interventional Radiology . Banerjee, I., Bhattacharjee, K., Burns, J.L., Trivedi, H., Purkayastha, S., Seyyed-Kalantari, L., Patel, B.N., Shiradkar, R., Gichoya, J.W.,

2025
[2]

shortcuts

“shortcuts”causingbiasinradiologyartificialintelligence:Causes,eval- uation, and mitigation. Journal of the American College of Radiology 20, 842–851. doi:10.1016/j.jacr.2023.06.025. Benjamini, Y., Hochberg, Y.,

work page doi:10.1016/j.jacr.2023.06.025 2023
[3]

MedImageInsight:Anopen-source embedding model for general domain medical imaging

Codella,N.,Jin,Y.,Jain,S.,etal.,2024. MedImageInsight:Anopen-source embedding model for general domain medical imaging. arXiv preprint arXiv:2410.06542 . Dapamede, T., Li, F., Khosravi, B., Purkayastha, S., Trivedi, H., Gichoya, J.,

work page arXiv 2024
[4]

JournalofImagingInformaticsinMedicine 38, 3040–3048

DICOM LUT is a key step in medical image preprocessing towardsAIgeneralizability. JournalofImagingInformaticsinMedicine 38, 3040–3048. doi:10.1007/s10278-025-01418-5. Darcet,T.,Oquab,M.,Mairal,J.,Bojanowski,P.,2024.Visiontransformers need registers. International Conference on Learning Representations (ICLR). DeLong, E., DeLong, D., Clarke-Pearson, D.,

work page doi:10.1007/s10278-025-01418-5 2024
[5]

Biometrics 44, 837–845

Comparing the areas undertwoormorecorrelatedreceiveroperatingcharacteristiccurves:A nonparametric approach. Biometrics 44, 837–845. Fang,M.,Wang,Z.,Pan,S.,etal.,2025. Largemodelsinmedicalimaging: Advances and prospects. Chinese Medical Journal 138, 1647–1664. Hansell,D.M.,Bankier,A.A.,MacMahon,H.,McLoud,T.C.,Müller,N.L., Remy, J.,

2025
[6]

Radiology 246, 697–722

Fleischner society: Glossary of terms for thoracic imaging. Radiology 246, 697–722. Henschke, C.I., Yankelevitz, D.F., Libby, D.M., Pasmantier, M.W., Smith, J.P.,Miettinen,O.S.,2006. SurvivalofpatientswithstageIlungcancer detectedonCTscreening.NewEnglandJournalofMedicine355,1763–

2006
[7]

Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024

R. Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024. Embeddingsareallyouneed!achieving high performance medical image classification through training-free embedding analysis. arXiv preprint arXiv:2412.09445 . Khosravi, B., Li, F., Dapamede, T., Rouzrokh, P., Gamble, C.U., Trivedi, H.M., W...

work page doi:10.1016/j.ebiom.2024.105174 2024
[8]

arXiv preprint arXiv:2509.06467

Does DINOv3 set a new medical vision standard? benchmarking2Dand3Dclassification,segmentation,andregistration. arXiv preprint arXiv:2509.06467 . Liu, J., Lian, J., Yu, Y.,

work page arXiv
[9]

Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

ChestX-Det10: Chest x-ray dataset on detection of thoracic abnormalities.arXiv:2006.10550v3. Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

work page arXiv 2006
[10]

arXiv:2602.08626

Revisiting [CLS] and patch token interaction in vision transformers. arXiv:2602.08626. McInnes, L., Healy, J., Melville, J.,

work page arXiv
[11]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 . Morin, R., Mahesh, M.,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Nature Machine Intelligence 7, 119–130

RAD-DINO: Exploring scalable medical image encoders beyond text supervision. Nature Machine Intelligence 7, 119–130. Ranftl,R.,Bochkovskiy,A.,Koltun,V.,2021.Visiontransformersfordense prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12179–12188. Rashtchian, C., Herrmann, C., Ferng, C.S., Chakrabarti, A., ...

2021
[13]

Rousseeuw, P.,

Substance or style: What does your image embedding know? arXiv preprint arXiv:2307.05610 . Rousseeuw, P.,

work page arXiv
[14]

MedGemma Technical Report

MedGemma technical report. arXiv preprint arXiv:2507.05201 . Siméoni, O., Vo, H., Seitzer, M., et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DINOv3

DINOv3.arXiv:2508.10104. Sun, X., Xu, W.,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

3462–3471

ChestX-ray8: Hospital-scale chest x-raydatabaseandbenchmarksonweakly-supervisedclassificationand localization of common thorax diseases, in: CVPR, pp. 3462–3471. Yang,Z.,Xu,X.,Zhang,J.,Wang,G.,Kalra,M.K.,Yan,P.,2025. ChestX- ray foundation model with global and local representations integration. IEEE Transactions on Medical Imaging 44, 4787–4799. Zedda, L...

2025
[17]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 . Zhou, D., Yu, Z., Xie, E., et al.,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects

AkinciD’Antonoli,T.,Bluethgen,C.,Cuocolo,R.,etal.,2025. Foundation models for radiology: Fundamentals, applications, opportunities, chal- lenges, risks, and prospects. Diagnostic and Interventional Radiology . Banerjee, I., Bhattacharjee, K., Burns, J.L., Trivedi, H., Purkayastha, S., Seyyed-Kalantari, L., Patel, B.N., Shiradkar, R., Gichoya, J.W.,

2025

[2] [2]

shortcuts

“shortcuts”causingbiasinradiologyartificialintelligence:Causes,eval- uation, and mitigation. Journal of the American College of Radiology 20, 842–851. doi:10.1016/j.jacr.2023.06.025. Benjamini, Y., Hochberg, Y.,

work page doi:10.1016/j.jacr.2023.06.025 2023

[3] [3]

MedImageInsight:Anopen-source embedding model for general domain medical imaging

Codella,N.,Jin,Y.,Jain,S.,etal.,2024. MedImageInsight:Anopen-source embedding model for general domain medical imaging. arXiv preprint arXiv:2410.06542 . Dapamede, T., Li, F., Khosravi, B., Purkayastha, S., Trivedi, H., Gichoya, J.,

work page arXiv 2024

[4] [4]

JournalofImagingInformaticsinMedicine 38, 3040–3048

DICOM LUT is a key step in medical image preprocessing towardsAIgeneralizability. JournalofImagingInformaticsinMedicine 38, 3040–3048. doi:10.1007/s10278-025-01418-5. Darcet,T.,Oquab,M.,Mairal,J.,Bojanowski,P.,2024.Visiontransformers need registers. International Conference on Learning Representations (ICLR). DeLong, E., DeLong, D., Clarke-Pearson, D.,

work page doi:10.1007/s10278-025-01418-5 2024

[5] [5]

Biometrics 44, 837–845

Comparing the areas undertwoormorecorrelatedreceiveroperatingcharacteristiccurves:A nonparametric approach. Biometrics 44, 837–845. Fang,M.,Wang,Z.,Pan,S.,etal.,2025. Largemodelsinmedicalimaging: Advances and prospects. Chinese Medical Journal 138, 1647–1664. Hansell,D.M.,Bankier,A.A.,MacMahon,H.,McLoud,T.C.,Müller,N.L., Remy, J.,

2025

[6] [6]

Radiology 246, 697–722

Fleischner society: Glossary of terms for thoracic imaging. Radiology 246, 697–722. Henschke, C.I., Yankelevitz, D.F., Libby, D.M., Pasmantier, M.W., Smith, J.P.,Miettinen,O.S.,2006. SurvivalofpatientswithstageIlungcancer detectedonCTscreening.NewEnglandJournalofMedicine355,1763–

2006

[7] [7]

Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024

R. Muthyala et al.:Preprint Page 22 of 23 Small-lesion signal loss in CXR foundation models Khoiwal,R.,McMillan,A.,2024. Embeddingsareallyouneed!achieving high performance medical image classification through training-free embedding analysis. arXiv preprint arXiv:2412.09445 . Khosravi, B., Li, F., Dapamede, T., Rouzrokh, P., Gamble, C.U., Trivedi, H.M., W...

work page doi:10.1016/j.ebiom.2024.105174 2024

[8] [8]

arXiv preprint arXiv:2509.06467

Does DINOv3 set a new medical vision standard? benchmarking2Dand3Dclassification,segmentation,andregistration. arXiv preprint arXiv:2509.06467 . Liu, J., Lian, J., Yu, Y.,

work page arXiv

[9] [9]

Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

ChestX-Det10: Chest x-ray dataset on detection of thoracic abnormalities.arXiv:2006.10550v3. Marouani, A., Siméoni, O., Jégou, H., Bojanowski, P., Vo, H.V.,

work page arXiv 2006

[10] [10]

arXiv:2602.08626

Revisiting [CLS] and patch token interaction in vision transformers. arXiv:2602.08626. McInnes, L., Healy, J., Melville, J.,

work page arXiv

[11] [11]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 . Morin, R., Mahesh, M.,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Nature Machine Intelligence 7, 119–130

RAD-DINO: Exploring scalable medical image encoders beyond text supervision. Nature Machine Intelligence 7, 119–130. Ranftl,R.,Bochkovskiy,A.,Koltun,V.,2021.Visiontransformersfordense prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12179–12188. Rashtchian, C., Herrmann, C., Ferng, C.S., Chakrabarti, A., ...

2021

[13] [13]

Rousseeuw, P.,

Substance or style: What does your image embedding know? arXiv preprint arXiv:2307.05610 . Rousseeuw, P.,

work page arXiv

[14] [14]

MedGemma Technical Report

MedGemma technical report. arXiv preprint arXiv:2507.05201 . Siméoni, O., Vo, H., Seitzer, M., et al.,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

DINOv3

DINOv3.arXiv:2508.10104. Sun, X., Xu, W.,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

3462–3471

ChestX-ray8: Hospital-scale chest x-raydatabaseandbenchmarksonweakly-supervisedclassificationand localization of common thorax diseases, in: CVPR, pp. 3462–3471. Yang,Z.,Xu,X.,Zhang,J.,Wang,G.,Kalra,M.K.,Yan,P.,2025. ChestX- ray foundation model with global and local representations integration. IEEE Transactions on Medical Imaging 44, 4787–4799. Zedda, L...

2025

[17] [17]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 . Zhou, D., Yu, Z., Xie, E., et al.,

work page internal anchor Pith review Pith/arXiv arXiv