pith. machine review for the scientific record. sign in

arxiv: 2604.18444 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CV

Recognition: unknown

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords zero-shot classificationchest X-rayvision-language modelsprototype alignmentdistillationmedical imagingdomain shiftco-occurrence bias
0
0 comments X

The pith

ProtoCLIP refines zero-shot chest X-ray classification through pathology-focused curation and prototype-aligned distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ProtoCLIP to fix common failures in CLIP-style vision-language models when applied to chest radiographs without task-specific training. It builds pathology-centered data subsets that include deliberately chosen negative examples and adds a distillation step that keeps the original semantic structure while sharpening distinctions for key findings. If correct, this shows a practical route to more stable zero-shot medical image classification by controlling data exposure and adaptation rather than retraining the entire model from scratch.

Core claim

ProtoCLIP improves zero-shot discrimination in CLIP-style VLMs by constructing pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and by introducing a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure. On the unseen VinDr-CXR dataset this produces AUC gains of 2-10 percentage points over a strong baseline across multiple findings and reaches a state-of-the-art AUC of 0.94 for pneumothorax.

What carries the argument

Prototype-aligned latent refinement: targeted curation of pathology subsets plus representation-preserving distillation that aligns latents to pathology anchors without destroying the pre-trained semantic geometry.

If this is right

  • Better separation of clinically co-occurring pathologies becomes possible without full retraining.
  • Controlled adaptation reduces instability when moving the same model to new scanner or hospital data.
  • The same refinement pattern can be applied to other zero-shot medical VLM tasks that suffer from label co-occurrence.
  • Gains appear on external validation sets, indicating the method transfers beyond the curation source.
  • Anchor-guided refinement lowers the data volume needed for usable medical zero-shot performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The curation-plus-distillation pattern may extend directly to other medical imaging modalities such as CT or MRI where label co-occurrence is also common.
  • Negative-sample selection could become a standard preprocessing step for any zero-shot medical VLM to limit spurious correlations.
  • Combining ProtoCLIP with prompt engineering or test-time adaptation might produce further gains without additional training data.

Load-bearing premise

The specific choices of pathology subsets and negative samples reduce co-occurrence bias and transfer instability without introducing new selection artifacts or overfitting to those curation decisions.

What would settle it

Applying the identical curation and distillation procedure to a second, independent unseen chest X-ray collection and observing no AUC improvement or outright degradation would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.18444 by Andreas Maier, Florian Kittler, Sheethal Bhat.

Figure 1
Figure 1. Figure 1: Overview of the ProtoCLIP method. a, Data curation from MIMIC-CXR includes a set of pneumothorax-positive images Dt, and single-pathology background examples Db. b, ProtoCLIP training schema depicts a trainable student visual encoder, refined using a dual-objective loss. Ldist performs feature distillation from a frozen CheXZero teacher and LBCE aligns student image embeddings with frozen text-defined path… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of TSNE clustering of pneumothorax (n=18) vs. healthy (n=18) samples between CheXZero [5] and ProtoCLIP. analysis over equal number of True Positives (TP) for pneumothorax cases and healthy findings. This shows a modest improvement in inter and intra-class clus￾tering [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attention map comparison of TP pneumothorax from VinDR with bounding box annotation. Attention maps: Zero-shot pneumothorax classification with VLMs does not inherently guarantee anatomically meaningful attention maps. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ProtoCLIP, a refinement strategy for CLIP-style vision-language models aimed at robust zero-shot chest X-ray classification. It constructs pathology-focused training subsets with curated negative samples to reduce co-occurrence bias and introduces a representation-preserving distillation objective to stabilize adaptation while preserving semantic structure. Evaluated on the unseen VinDr-CXR dataset, the method reports AUC gains of 2-10 percentage points over a CLIP baseline across findings, with a state-of-the-art AUC of 0.94 for pneumothorax.

Significance. If the reported gains are shown to arise specifically from the proposed components rather than data curation artifacts, this could offer a practical, low-cost way to improve zero-shot VLM reliability in medical imaging by mitigating label co-occurrence and domain-shift issues without full retraining.

major comments (2)
  1. [Section 4 (Experiments)] Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.
  2. [Section 3 (Method)] Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'strong CLIP-based baseline' is used without naming the exact model variant, pre-training corpus, or zero-shot prompt template employed for comparison.
  2. [Abstract] Abstract: The 'state-of-the-art' claim for pneumothorax AUC lacks reference to other recent non-CLIP methods on VinDr-CXR, limiting context for the 0.94 result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of isolating component contributions and providing supporting analyses for the distillation objective. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Section 4 (Experiments): No ablation studies isolate the distillation objective while holding the pathology-focused subset construction fixed (or vice versa). This is load-bearing for the central claim that the prototype-aligned refinement specifically mitigates co-occurrence bias and transfer instability, since the 2-10 pp AUC lift and 0.94 pneumothorax result on VinDr-CXR could arise primarily from favorable negative-sample selection rather than the proposed objective.

    Authors: We agree that explicit isolation of the two components strengthens the central claim. The original experiments compared ProtoCLIP (curated subsets + distillation) against a standard CLIP baseline, but did not include a controlled ablation holding the pathology-focused subset fixed while varying only the distillation objective. In the revision, we will add a dedicated ablation table that reports performance for: (1) standard CLIP, (2) CLIP fine-tuned on the curated subsets without distillation, and (3) full ProtoCLIP. This will directly quantify the incremental benefit of the prototype-aligned objective. revision: yes

  2. Referee: Section 3 (Method): The representation-preserving distillation objective is defined at a high level with no accompanying analysis (e.g., t-SNE visualizations, distance metrics, or controlled experiments) demonstrating that it maintains semantic structure while improving discrimination of clinically co-occurring pathologies.

    Authors: We acknowledge that the method section presents the objective at a high level without empirical verification of its representation-preserving properties. In the revised manuscript, we will include: (i) t-SNE visualizations of the latent space before and after distillation for a subset of co-occurring findings (e.g., pneumothorax and pleural effusion), (ii) quantitative metrics such as average intra-class compactness and inter-class separation for clinically related pairs, and (iii) a controlled experiment measuring zero-shot transfer stability on a held-out domain-shift split. These additions will substantiate that the objective improves discrimination while preserving semantic structure. revision: yes

Circularity Check

0 steps flagged

No circularity; results rest on external unseen dataset evaluation

full rationale

The paper describes a method of pathology-focused subset curation plus representation-preserving distillation, then reports empirical AUC gains (2-10 pp) and a specific 0.94 pneumothorax result on the held-out VinDr-CXR dataset. No equations, fitted-parameter renamings, or self-citation chains appear in the text that would make any claimed prediction equivalent to its inputs by construction. The evaluation uses an external benchmark independent of the training curation choices, satisfying the self-contained criterion against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training objectives, or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5485 in / 1137 out tokens · 54606 ms · 2026-05-10T04:38:27.808100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021)

    Jones, C.M., Buchlak, Q.D., Oakden-Rayner, L., Milne, M., Seah, J., Esmaili, N., Hachey, B.: Chest radiographs and machine learning – Past, present and fu- ture. Journal of Medical Imaging and Radiation Oncology65(5), 538–544 (2021). https://doi.org/10.1111/1754-9485.13274

  2. [2]

    Mandating Limits on Workload, Duty, and Speed in Radiology

    Alexander, R., Waite, S., Bruno, M.A., Krupinski, E.A., Berlin, L., Macknik, S., Martinez-Conde, S.: Mandating limits on workload, duty, and speed in radiology. Radiology304(2), 274–282 (2022). https://doi.org/10.1148/radiol.212631

  3. [3]

    Systematic Reviews3, 3 (2014)

    Roberts, D.J., Leigh-Smith, S., Faris, P.D., et al.: Clinical manifestations of tension pneumothorax: protocol for a systematic review and meta-analysis. Systematic Reviews3, 3 (2014). https://doi.org/10.1186/2046-4053-3-3

  4. [4]

    URLhttps://www.nature.com/articles/s41597-019-0322-0

    Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.Y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly avail- able database of chest radiographs with free-text reports. Scientific Data6(1), 317 (2019). https://doi.org/10.1038/s41597-019-0322-0

  5. [5]

    Mingxing Tan and Quoc Le

    Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert- level detection of pathologies from unannotated chest X-ray images via self- supervised learning. Nature Biomedical Engineering6(12), 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

  6. [6]

    In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp

    Madhipati, R., Maier, A.: CXR-CML: Improved zero-shot classification of long- tailed multi-label diseases in Chest X-Rays. In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention (pp. 119-129). Cham: Springer Nature Switzerland 2025

  7. [7]

    2023 , url =

    Wu, C., Zhang, X., Zhang, Y., Wang, Y., et al.: MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–12 (2023). https://doi.org/10.1109/ICCV51070.2023.01954

  8. [8]

    JunnanLi, DongxuLi, CaimingXiong, andStevenHoi

    Du, Y., Liu, Z., Li, J., Zhao, W.X.: A Survey of Vision-Language Pre- Trained Models. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), pp. 5436–5443 (2022). https://doi.org/10.24963/ijcai.2022/762

  9. [9]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, vol. 139, pp. 8748– 8763 (2021). https://doi.org/10....

  10. [10]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Huang, S.-C., Shen, L., Lungren, M.P., Yeung, S.: GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Im- age Recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3942–3951 (2021)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Bannur, S., Hyland, S., Liu, Q., Pérez-García, F., Ilse, M., Castro, D.C., Boeck- ing, B., Sharma, H., Bouzid, K., Thieme, A., Schwaighofer, A., Wetscherek, M., Lungren, M.P., Nori, A., Alvarez-Valle, J., Oktay, O.: Learning To Exploit Tem- poral Structure for Biomedical Vision-Language Processing. In: Proceedings of the IEEE/CVF Conference on Computer Vi...

  12. [12]

    Learning to prompt for vision-language models.Int

    Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to Prompt for Vision-Language Models. arXiv preprint arXiv:2109.01134 (2021). https://doi.org/10.48550/arXiv.2109.01134 10 Anonymized Author et al

  13. [13]

    In: European Conference on Computer Vision (ECCV), pp

    Zhang, R., Wei, Z., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip- Adapter: Training-Free Adaption of CLIP for Few-shot Classification. In: European Conference on Computer Vision (ECCV), pp. 1–17 (2022)

  14. [14]

    Radiology: Artificial Intel- ligence3(4), e200190 (2021)

    Thian, Y.L., et al.: Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation Study. Radiology: Artificial Intel- ligence3(4), e200190 (2021). https://doi.org/10.1148/ryai.2021200190

  15. [15]

    In: Greenspan, H

    You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. In: Greenspan, H. et al. (eds.) MICCAI 2023, LNCS, vol. 14221, pp. 101–111. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43895-0_10

  16. [16]

    Springer, New York (2006)

    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)

  17. [17]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015). https://doi.org/10.48550/arXiv.1503.02531

  18. [18]

    Scientific Data9, 429 (2022)

    Nguyen, H.Q., Lam, K., Le, L.T., et al.: VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Scientific Data9, 429 (2022). https://doi.org/10.1038/s41597-022-01498-w

  19. [19]

    In: Scientific Report (2023)

    Bhat, S., Maier, A.: AUCReshaping: improved sensitiv- ity at high-specificity. In: Scientific Report (2023). Springer, https://doi.org/https://doi.org/10.1038/s41598-023-48482-x

  20. [20]

    Alistair E

    Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D.A., Halabi, S.S., Sand- berg, J.K., Jones, R., Larson, D.B., Langlotz, C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Pro...

  21. [21]

    In: BVM Workshop (2025) (pp

    Bhat, S., Maier, A.: Towards robust zero-shot chest x-ray classification: exploring data distribution bias in chest x-ray datasets. In: BVM Workshop (2025) (pp. 191-196). Wiesbaden: Springer Fachmedien Wiesbaden

  22. [22]

    Medical Image Analysis

    Lin, M., Peng, Y.: CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray. Medical Image Analysis. In: Medical Image Analysis, 2025, 103739