Recognition: unknown
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification
Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3
The pith
ProtoCLIP refines zero-shot chest X-ray classification through pathology-focused curation and prototype-aligned distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProtoCLIP improves zero-shot discrimination in CLIP-style VLMs by constructing pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and by introducing a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure. On the unseen VinDr-CXR dataset this produces AUC gains of 2-10 percentage points over a strong baseline across multiple findings and reaches a state-of-the-art AUC of 0.94 for pneumothorax.
What carries the argument
Prototype-aligned latent refinement: targeted curation of pathology subsets plus representation-preserving distillation that aligns latents to pathology anchors without destroying the pre-trained semantic geometry.
If this is right
- Better separation of clinically co-occurring pathologies becomes possible without full retraining.
- Controlled adaptation reduces instability when moving the same model to new scanner or hospital data.
- The same refinement pattern can be applied to other zero-shot medical VLM tasks that suffer from label co-occurrence.
- Gains appear on external validation sets, indicating the method transfers beyond the curation source.
- Anchor-guided refinement lowers the data volume needed for usable medical zero-shot performance.
Where Pith is reading between the lines
- The curation-plus-distillation pattern may extend directly to other medical imaging modalities such as CT or MRI where label co-occurrence is also common.
- Negative-sample selection could become a standard preprocessing step for any zero-shot medical VLM to limit spurious correlations.
- Combining ProtoCLIP with prompt engineering or test-time adaptation might produce further gains without additional training data.
Load-bearing premise
The specific choices of pathology subsets and negative samples reduce co-occurrence bias and transfer instability without introducing new selection artifacts or overfitting to those curation decisions.
What would settle it
Applying the identical curation and distillation procedure to a second, independent unseen chest X-ray collection and observing no AUC improvement or outright degradation would falsify the claim.
Figures
read the original abstract
Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ProtoCLIP, a refinement strategy for CLIP-style vision-language models aimed at robust zero-shot chest X-ray classification. It constructs pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and introduces a representation-preserving distillation objective to stabilize adaptation while preserving semantic structure. Evaluated on the unseen VinDr-CXR dataset, the method reports AUC gains of 2-10 percentage points over a CLIP baseline across findings, with a state-of-the-art AUC of 0.94 for pneumothorax.
Significance. If the reported gains are shown to arise specifically from the proposed components rather than data curation artifacts, this could offer a practical, low-cost way to improve zero-shot VLM reliability in medical imaging by mitigating label co-occurrence and domain-shift issues without full retraining.
major comments (2)
- [Section 4 (Experiments)] Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.
- [Section 3 (Method)] Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.
minor comments (2)
- [Abstract] Abstract: The phrase 'strong CLIP-based baseline' is used without naming the exact model variant, pre-training corpus, or zero-shot prompt template employed for comparison.
- [Abstract] Abstract: The 'state-of-the-art' claim for pneumothorax AUC lacks reference to other recent non-CLIP methods on VinDr-CXR, limiting context for the 0.94 result.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of isolating component contributions and providing supporting analyses for the distillation objective. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.
Authors: We agree that explicit isolation of the two components strengthens the central claim. The original experiments compared ProtoCLIP (curated subsets + distillation) against a standard CLIP baseline, but did not include a controlled ablation holding the pathology-focused subset fixed while varying only the distillation objective. In the revision, we will add a dedicated ablation table that reports performance for: (1) standard CLIP, (2) CLIP fine-tuned on the curated subsets without distillation, and (3) full ProtoCLIP. This will directly quantify the incremental benefit of the prototype-aligned objective. revision: yes
-
Referee: Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.
Authors: We acknowledge that the method section presents the objective at a high level without empirical verification of its representation-preserving properties. In the revised manuscript, we will include: (i) t-SNE visualizations of the latent space before and after distillation for a subset of co-occurring findings (e.g., pneumothorax and pleural effusion), (ii) quantitative metrics such as average intra-class compactness and inter-class separation for clinically related pairs, and (iii) a controlled experiment measuring zero-shot transfer stability on a held-out domain-shift split. These additions will substantiate that the objective improves discrimination while preserving semantic structure. revision: yes
Circularity Check
No circularity; results rest on external unseen dataset evaluation
full rationale
The paper describes a method of pathology-focused subset curation plus representation-preserving distillation, then reports empirical AUC gains (2-10 pp) and a specific 0.94 pneumothorax result on the held-out VinDr-CXR dataset. No equations, fitted-parameter renamings, or self-citation chains appear in the text that would make any claimed prediction equivalent to its inputs by construction. The evaluation uses an external benchmark independent of the training curation choices, satisfying the self-contained criterion against external data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021)
Jones, C.M., Buchlak, Q.D., Oakden-Rayner, L., Milne, M., Seah, J., Esmaili, N., Hachey, B.: Chest radiographs and machine learning – Past, present and fu- ture. Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021). https://doi.org/10.1111/1754-9485.13274
-
[2]
Mandating Limits on Workload, Duty, and Speed in Radiology
Alexander, R., Waite, S., Bruno, M.A., Krupinski, E.A., Berlin, L., Macknik, S., Martinez-Conde, S.: Mandating limits on workload, duty, and speed in radiology. Radiology304(2), 274–282 (2022). https://doi.org/10.1148/radiol.212631
-
[3]
Roberts, D.J., Leigh-Smith, S., Faris, P.D., et al.: Clinical manifestations of tension pneumothorax: protocol for a systematic review and meta-analysis. Systematic Reviews3, 3 (2014). https://doi.org/10.1186/2046-4053-3-3
-
[4]
URLhttps://www.nature.com/articles/s41597-019-0322-0
Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.Y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly avail- able database of chest radiographs with free-text reports. Scientific Data6(1), 317 (2019). https://doi.org/10.1038/s41597-019-0322-0
-
[5]
Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert- level detection of pathologies from unannotated chest X-ray images via self- supervised learning. Nature Biomedical Engineering6(12), 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9
-
[6]
In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp
Madhipati, R., Maier, A.: CXR-CML: Improved zero-shot classification of long- tailed multi-label diseases in Chest X-Rays. In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp. 119-129). Cham: Springer Nature Switzerland 2025
2025
-
[7]
Wu, C., Zhang, X., Zhang, Y., Wang, Y., et al.: MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–12 (2023). https://doi.org/10.1109/ICCV51070.2023.01954
-
[8]
JunnanLi, DongxuLi, CaimingXiong, andStevenHoi
Du, Y., Liu, Z., Li, J., Zhao, W.X.: A Survey of Vision-Language Pre- Trained Models. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), pp. 5436–5443 (2022). https://doi.org/10.24963/ijcai.2022/762
-
[9]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, vol. 139, pp. 8748– 8763 (2021). https://doi.org/10....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
-
[10]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Huang, S.-C., Shen, L., Lungren, M.P., Yeung, S.: GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Im- age Recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3942–3951 (2021)
2021
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Bannur, S., Hyland, S., Liu, Q., Pérez-García, F., Ilse, M., Castro, D.C., Boeck- ing, B., Sharma, H., Bouzid, K., Thieme, A., Schwaighofer, A., Wetscherek, M., Lungren, M.P., Nori, A., Alvarez-Valle, J., Oktay, O.: Learning To Exploit Tem- poral Structure for Biomedical Vision-Language Processing. In: Proceedings of the IEEE/CVF Conference on Computer Vi...
2023
-
[12]
Learning to prompt for vision-language models.Int
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to Prompt for Vision-Language Models. arXiv preprint arXiv:2109.01134 (2021). https://doi.org/10.48550/arXiv.2109.01134 10 Anonymized Author et al
-
[13]
In: European Conference on Computer Vision (ECCV), pp
Zhang, R., Wei, Z., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip- Adapter: Training-Free Adaption of CLIP for Few-shot Classification. In: European Conference on Computer Vision (ECCV), pp. 1–17 (2022)
2022
-
[14]
Radiology: Artificial Intel- ligence3(4), e200190 (2021)
Thian, Y.L., et al.: Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation Study. Radiology: Artificial Intel- ligence3(4), e200190 (2021). https://doi.org/10.1148/ryai.2021200190
-
[15]
You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. In: Greenspan, H. et al. (eds.) MICCAI 2023, LNCS, vol. 14221, pp. 101–111. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43895-0_10
-
[16]
Springer, New York (2006)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
2006
-
[17]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015). https://doi.org/10.48550/arXiv.1503.02531
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015
-
[18]
Nguyen, H.Q., Lam, K., Le, L.T., et al.: VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Scientific Data9, 429 (2022). https://doi.org/10.1038/s41597-022-01498-w
-
[19]
Bhat, S., Maier, A.: AUCReshaping: improved sensitiv- ity at high-specificity. In: Scientific Report (2023). Springer, https://doi.org/https://doi.org/10.1038/s41598-023-48482-x
-
[20]
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D.A., Halabi, S.S., Sand- berg, J.K., Jones, R., Larson, D.B., Langlotz, C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Pro...
-
[21]
In: BVM Workshop (2025) (pp
Bhat, S., Maier, A.: Towards robust zero-shot chest x-ray classification: exploring data distribution bias in chest x-ray datasets. In: BVM Workshop (2025) (pp. 191-196). Wiesbaden: Springer Fachmedien Wiesbaden
2025
-
[22]
Medical Image Analysis
Lin, M., Peng, Y.: CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray. Medical Image Analysis. In: Medical Image Analysis, 2025, 103739
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.