arxiv: 2604.18444 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CV

Recognition: unknown

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

Florian Kittler , Sheethal Bhat , Andreas Maier

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords zero-shot classificationchest X-rayvision-language modelsprototype alignmentdistillationmedical imagingdomain shiftco-occurrence bias

0 comments

The pith

ProtoCLIP refines zero-shot chest X-ray classification through pathology-focused curation and prototype-aligned distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ProtoCLIP to fix common failures in CLIP-style vision-language models when applied to chest radiographs without task-specific training. It builds pathology-centered data subsets that include deliberately chosen negative examples and adds a distillation step that keeps the original semantic structure while sharpening distinctions for key findings. If correct, this shows a practical route to more stable zero-shot medical image classification by controlling data exposure and adaptation rather than retraining the entire model from scratch.

Core claim

ProtoCLIP improves zero-shot discrimination in CLIP-style VLMs by constructing pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and by introducing a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure. On the unseen VinDr-CXR dataset this produces AUC gains of 2-10 percentage points over a strong baseline across multiple findings and reaches a state-of-the-art AUC of 0.94 for pneumothorax.

What carries the argument

Prototype-aligned latent refinement: targeted curation of pathology subsets plus representation-preserving distillation that aligns latents to pathology anchors without destroying the pre-trained semantic geometry.

If this is right

Better separation of clinically co-occurring pathologies becomes possible without full retraining.
Controlled adaptation reduces instability when moving the same model to new scanner or hospital data.
The same refinement pattern can be applied to other zero-shot medical VLM tasks that suffer from label co-occurrence.
Gains appear on external validation sets, indicating the method transfers beyond the curation source.
Anchor-guided refinement lowers the data volume needed for usable medical zero-shot performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curation-plus-distillation pattern may extend directly to other medical imaging modalities such as CT or MRI where label co-occurrence is also common.
Negative-sample selection could become a standard preprocessing step for any zero-shot medical VLM to limit spurious correlations.
Combining ProtoCLIP with prompt engineering or test-time adaptation might produce further gains without additional training data.

Load-bearing premise

The specific choices of pathology subsets and negative samples reduce co-occurrence bias and transfer instability without introducing new selection artifacts or overfitting to those curation decisions.

What would settle it

Applying the identical curation and distillation procedure to a second, independent unseen chest X-ray collection and observing no AUC improvement or outright degradation would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.18444 by Andreas Maier, Florian Kittler, Sheethal Bhat.

**Figure 1.** Figure 1: Overview of the ProtoCLIP method. a, Data curation from MIMIC-CXR includes a set of pneumothorax-positive images Dt, and single-pathology background examples Db. b, ProtoCLIP training schema depicts a trainable student visual encoder, refined using a dual-objective loss. Ldist performs feature distillation from a frozen CheXZero teacher and LBCE aligns student image embeddings with frozen text-defined path… view at source ↗

**Figure 2.** Figure 2: Comparison of TSNE clustering of pneumothorax (n=18) vs. healthy (n=18) samples between CheXZero [5] and ProtoCLIP. analysis over equal number of True Positives (TP) for pneumothorax cases and healthy findings. This shows a modest improvement in inter and intra-class clustering [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Attention map comparison of TP pneumothorax from VinDR with bounding box annotation. Attention maps: Zero-shot pneumothorax classification with VLMs does not inherently guarantee anatomically meaningful attention maps. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtoCLIP pairs curated negative samples with distillation to lift zero-shot CLIP AUC on external chest X-ray data, but the abstract leaves the relative contributions unclear.

read the letter

The main point is that ProtoCLIP builds pathology-focused subsets with hand-curated negatives and adds a representation-preserving distillation step to adapt CLIP for zero-shot chest X-ray work. It reports 2-10 point AUC gains over a strong baseline on the unseen VinDr-CXR set and reaches 0.94 AUC on pneumothorax, which it calls state-of-the-art. That external validation is the concrete result worth noting first. The approach targets label co-occurrence and domain-shift problems that actually matter for clinical use, and it tries to do so without full retraining, which is a practical angle. The distillation objective is meant to keep semantic structure while sharpening discrimination on co-occurring findings, and the curation step aims to cut bias from the training distribution. Those are reasonable extensions of existing VLM adaptation ideas rather than a wholly new framework. The paper does a clean job laying out the clinical motivation and the high-level recipe. The soft spots sit in the missing controls. The abstract gives no ablation that holds the distillation fixed while changing the negative-sample strategy, or the reverse, so it is still possible the reported lifts come mostly from the curation choices rather than the prototype alignment itself. There are also no details on baseline definitions, run-to-run variance, or statistical tests, which makes the 2-10 point range hard to interpret precisely. The stress-test note is on target here: the link between the specific method and the bias reduction is not yet isolated. This work is for people already working on medical VLM adaptation or radiology zero-shot tasks. They would get usable ideas on subset construction and anchor-guided refinement, but they would need the full methods and code to replicate or extend it. The paper deserves peer review because the external dataset result and the focus on co-occurrence are relevant enough to warrant referee time, even though the current version will need clearer experiments and ablations to stand on its own.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ProtoCLIP, a refinement strategy for CLIP-style vision-language models aimed at robust zero-shot chest X-ray classification. It constructs pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and introduces a representation-preserving distillation objective to stabilize adaptation while preserving semantic structure. Evaluated on the unseen VinDr-CXR dataset, the method reports AUC gains of 2-10 percentage points over a CLIP baseline across findings, with a state-of-the-art AUC of 0.94 for pneumothorax.

Significance. If the reported gains are shown to arise specifically from the proposed components rather than data curation artifacts, this could offer a practical, low-cost way to improve zero-shot VLM reliability in medical imaging by mitigating label co-occurrence and domain-shift issues without full retraining.

major comments (2)

[Section 4 (Experiments)] Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.
[Section 3 (Method)] Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.

minor comments (2)

[Abstract] Abstract: The phrase 'strong CLIP-based baseline' is used without naming the exact model variant, pre-training corpus, or zero-shot prompt template employed for comparison.
[Abstract] Abstract: The 'state-of-the-art' claim for pneumothorax AUC lacks reference to other recent non-CLIP methods on VinDr-CXR, limiting context for the 0.94 result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of isolating component contributions and providing supporting analyses for the distillation objective. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.

Authors: We agree that explicit isolation of the two components strengthens the central claim. The original experiments compared ProtoCLIP (curated subsets + distillation) against a standard CLIP baseline, but did not include a controlled ablation holding the pathology-focused subset fixed while varying only the distillation objective. In the revision, we will add a dedicated ablation table that reports performance for: (1) standard CLIP, (2) CLIP fine-tuned on the curated subsets without distillation, and (3) full ProtoCLIP. This will directly quantify the incremental benefit of the prototype-aligned objective. revision: yes
Referee: Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.

Authors: We acknowledge that the method section presents the objective at a high level without empirical verification of its representation-preserving properties. In the revised manuscript, we will include: (i) t-SNE visualizations of the latent space before and after distillation for a subset of co-occurring findings (e.g., pneumothorax and pleural effusion), (ii) quantitative metrics such as average intra-class compactness and inter-class separation for clinically related pairs, and (iii) a controlled experiment measuring zero-shot transfer stability on a held-out domain-shift split. These additions will substantiate that the objective improves discrimination while preserving semantic structure. revision: yes

Circularity Check

0 steps flagged

No circularity; results rest on external unseen dataset evaluation

full rationale

The paper describes a method of pathology-focused subset curation plus representation-preserving distillation, then reports empirical AUC gains (2-10 pp) and a specific 0.94 pneumothorax result on the held-out VinDr-CXR dataset. No equations, fitted-parameter renamings, or self-citation chains appear in the text that would make any claimed prediction equivalent to its inputs by construction. The evaluation uses an external benchmark independent of the training curation choices, satisfying the self-contained criterion against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training objectives, or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5485 in / 1137 out tokens · 54606 ms · 2026-05-10T04:38:27.808100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021)

Jones, C.M., Buchlak, Q.D., Oakden-Rayner, L., Milne, M., Seah, J., Esmaili, N., Hachey, B.: Chest radiographs and machine learning – Past, present and fu- ture. Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021). https://doi.org/10.1111/1754-9485.13274

work page doi:10.1111/1754-9485.13274 2021
[2]

Mandating Limits on Workload, Duty, and Speed in Radiology

Alexander, R., Waite, S., Bruno, M.A., Krupinski, E.A., Berlin, L., Macknik, S., Martinez-Conde, S.: Mandating limits on workload, duty, and speed in radiology. Radiology304(2), 274–282 (2022). https://doi.org/10.1148/radiol.212631

work page doi:10.1148/radiol.212631 2022
[3]

Systematic Reviews3, 3 (2014)

Roberts, D.J., Leigh-Smith, S., Faris, P.D., et al.: Clinical manifestations of tension pneumothorax: protocol for a systematic review and meta-analysis. Systematic Reviews3, 3 (2014). https://doi.org/10.1186/2046-4053-3-3

work page doi:10.1186/2046-4053-3-3 2014
[4]

URLhttps://www.nature.com/articles/s41597-019-0322-0

Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.Y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly avail- able database of chest radiographs with free-text reports. Scientific Data6(1), 317 (2019). https://doi.org/10.1038/s41597-019-0322-0

work page doi:10.1038/s41597-019-0322-0 2019
[5]

Mingxing Tan and Quoc Le

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert- level detection of pathologies from unannotated chest X-ray images via self- supervised learning. Nature Biomedical Engineering6(12), 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

work page doi:10.1038/s41551-022-00936-9 2022
[6]

In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp

Madhipati, R., Maier, A.: CXR-CML: Improved zero-shot classification of long- tailed multi-label diseases in Chest X-Rays. In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp. 119-129). Cham: Springer Nature Switzerland 2025

2025
[7]

2023 , url =

Wu, C., Zhang, X., Zhang, Y., Wang, Y., et al.: MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–12 (2023). https://doi.org/10.1109/ICCV51070.2023.01954

work page doi:10.1109/iccv51070.2023.01954 2023
[8]

JunnanLi, DongxuLi, CaimingXiong, andStevenHoi

Du, Y., Liu, Z., Li, J., Zhao, W.X.: A Survey of Vision-Language Pre- Trained Models. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), pp. 5436–5443 (2022). https://doi.org/10.24963/ijcai.2022/762

work page doi:10.24963/ijcai.2022/762 2022
[9]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, vol. 139, pp. 8748– 8763 (2021). https://doi.org/10....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Huang, S.-C., Shen, L., Lungren, M.P., Yeung, S.: GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Im- age Recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3942–3951 (2021)

2021
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Bannur, S., Hyland, S., Liu, Q., Pérez-García, F., Ilse, M., Castro, D.C., Boeck- ing, B., Sharma, H., Bouzid, K., Thieme, A., Schwaighofer, A., Wetscherek, M., Lungren, M.P., Nori, A., Alvarez-Valle, J., Oktay, O.: Learning To Exploit Tem- poral Structure for Biomedical Vision-Language Processing. In: Proceedings of the IEEE/CVF Conference on Computer Vi...

2023
[12]

Learning to prompt for vision-language models.Int

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to Prompt for Vision-Language Models. arXiv preprint arXiv:2109.01134 (2021). https://doi.org/10.48550/arXiv.2109.01134 10 Anonymized Author et al

work page doi:10.48550/arxiv.2109.01134 2021
[13]

In: European Conference on Computer Vision (ECCV), pp

Zhang, R., Wei, Z., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip- Adapter: Training-Free Adaption of CLIP for Few-shot Classification. In: European Conference on Computer Vision (ECCV), pp. 1–17 (2022)

2022
[14]

Radiology: Artificial Intel- ligence3(4), e200190 (2021)

Thian, Y.L., et al.: Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation Study. Radiology: Artificial Intel- ligence3(4), e200190 (2021). https://doi.org/10.1148/ryai.2021200190

work page doi:10.1148/ryai.2021200190 2021
[15]

In: Greenspan, H

You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. In: Greenspan, H. et al. (eds.) MICCAI 2023, LNCS, vol. 14221, pp. 101–111. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43895-0_10

work page doi:10.1007/978-3-031-43895-0_10 2023
[16]

Springer, New York (2006)

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)

2006
[17]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015). https://doi.org/10.48550/arXiv.1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015
[18]

Scientific Data9, 429 (2022)

Nguyen, H.Q., Lam, K., Le, L.T., et al.: VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Scientific Data9, 429 (2022). https://doi.org/10.1038/s41597-022-01498-w

work page doi:10.1038/s41597-022-01498-w 2022
[19]

In: Scientific Report (2023)

Bhat, S., Maier, A.: AUCReshaping: improved sensitiv- ity at high-specificity. In: Scientific Report (2023). Springer, https://doi.org/https://doi.org/10.1038/s41598-023-48482-x

work page doi:10.1038/s41598-023-48482-x 2023
[20]

Alistair E

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D.A., Halabi, S.S., Sand- berg, J.K., Jones, R., Larson, D.B., Langlotz, C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Pro...

work page doi:10.1609/aaai.v33i01.3301590 2019
[21]

In: BVM Workshop (2025) (pp

Bhat, S., Maier, A.: Towards robust zero-shot chest x-ray classification: exploring data distribution bias in chest x-ray datasets. In: BVM Workshop (2025) (pp. 191-196). Wiesbaden: Springer Fachmedien Wiesbaden

2025
[22]

Medical Image Analysis

Lin, M., Peng, Y.: CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray. Medical Image Analysis. In: Medical Image Analysis, 2025, 103739

2024