Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Christian Wachinger; Morteza Ghahremani; Yitong Li

arxiv: 2505.21698 · v3 · pith:DPSKYSMBnew · submitted 2025-05-27 · 💻 cs.CV

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Yitong Li , Morteza Ghahremani , Christian Wachinger This is my paper

Pith reviewed 2026-05-22 01:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords MedBridgevision-language modelsmedical image diagnosisdomain adaptationmixture of expertschest radiographsmulti-label classificationquery tokens

0 comments

The pith

MedBridge adapts vision-language models to medical diagnosis by injecting learnable query tokens that align domains and route expert models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedBridge as a lightweight way to adapt existing vision-language foundation models for medical tasks such as multi-label diagnosis of thoracic diseases on chest radiographs. It works by placing a small set of learnable query tokens into the intermediate layers of pretrained models, which handle domain shifts and keep high-resolution details through multi-view sampling while also routing between different expert models. This setup avoids the high cost of building medical models from scratch and removes the need for a single shared representation space across models. A reader would care because the method reports clear gains on real clinical benchmarks and applies across many different starting models.

Core claim

MedBridge transforms pretrained VLMs into multi-view query encoders that inject a compact set of learnable query tokens into intermediate layers, enabling non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling. These query tokens further act as routing signals for a mixture-of-experts, dynamically integrating heterogeneous foundation models for multi-label reasoning without requiring a shared representation space.

What carries the argument

Compact set of learnable query tokens injected into intermediate VLM layers that perform domain alignment and serve as dynamic routing signals for a mixture-of-experts.

If this is right

MedBridge yields 6-15% AUC gains over prior adaptation methods on five chest radiograph datasets for multi-label thoracic disease diagnosis.
The gains appear in both cross-domain generalization and same-distribution fine-tuning settings.
The framework works across at least eight different pretrained VLMs without modification to the core approach.
It removes the requirement to train large medical-specific models from scratch on limited clinical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-token routing might allow combining models trained on entirely different imaging modalities more readily than methods that force a common embedding space.
If the number of query tokens can be kept small while retaining performance, the method could scale to even larger foundation models with minimal added compute.
The multi-view sampling step suggests that resolution handling might be separable from the alignment step in future adaptations of other vision tasks.

Load-bearing premise

A small number of learnable query tokens placed in intermediate layers can align domains and preserve pathological details without destroying original model knowledge or requiring a shared space across experts.

What would settle it

Evaluating MedBridge on the same five chest radiograph benchmarks and obtaining AUC gains below 6% or no gain over strong baseline adaptation methods would show the performance claim does not hold.

Figures

Figures reproduced from arXiv: 2505.21698 by Christian Wachinger, Morteza Ghahremani, Yitong Li.

**Figure 1.** Figure 1: (a) To predict the label(s) of an image x, i.e., p(y|x), MedBridge operates in three stages. In the first stage, it expands the visual observations from a single image x to N images x through a focal sampling module. We then introduce a mixture of experts (MoE), where experts are frozen foundation VLMs. Each expert produces a set of M tokens for the input image, and simultaneously generates a set of Q lear… view at source ↗

**Figure 2.** Figure 2: MedBridge framework: Focal sampling extracts fine-grained regions from the high-resolution input image, encoded by QEncoders into frozen tokens and learnable queries for lightweight adaptation. A Mixture of Experts (MoE) module routes these tokens through the most relevant encoders, and the final prediction combines soft labels from both query and frozen tokens with weight α. the learnable queries first i… view at source ↗

**Figure 3.** Figure 3: We evaluate MedBridge in three key adaptation tasks: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on (a) the Focal Sampling module and (b) the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch, however, is costly and data-intensive. Here, we propose MedBridge, a lightweight adaptation framework that opens a new direction in domain-gap mitigation by jointly combining domain alignment, resolution preservation, and multi-label reasoning via complementary VLM experts for medical image diagnosis. Specifically, MedBridge transforms pretrained VLMs into multi-view query encoders that inject a compact set of learnable query tokens into intermediate layers, enabling non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling. These query tokens further act as routing signals for a mixture-of-experts, dynamically integrating heterogeneous foundation models for multi-label reasoning without requiring a shared representation space. We evaluated MedBridge on five chest radiograph benchmarks in three key adaptation tasks. MedBridge demonstrates superior performance in both cross-domain generalization (out-of-distribution transfer) and in-domain specialization (same-distribution tuning) settings, yielding a significant 6-15% AUC improvement over state-of-the-art adaptation methods for multi-label thoracic disease diagnosis. Furthermore, MedBridge is model-agnostic and demonstrates broad extensibility across eight diverse VLMs (e.g., CLIP, LLaVA, Qwen-VL, MedGemma), highlighting its ability to flexibly adapt arbitrary foundation models into a powerful medical diagnostic tool. Our code will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedBridge adapts VLMs to medical multi-label tasks via query token injection and MoE routing, with reported 6-15% AUC gains across models and benchmarks that hold up under the given experimental design.

read the letter

The main takeaway is that MedBridge adapts pretrained vision-language models to chest X-ray diagnosis by injecting a small set of learnable query tokens into intermediate layers and routing them through a mixture of experts. This setup targets domain shift, resolution differences, and multi-label reasoning at once, without forcing all experts into one shared space, and the paper reports 6-15% AUC lifts over prior adaptation methods in both in-domain and cross-domain tests.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes MedBridge, a lightweight adaptation framework that transforms pretrained vision-language models into multi-view query encoders by injecting a compact set of learnable query tokens into intermediate layers. These tokens enable non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling, and additionally serve as routing signals for a mixture-of-experts that dynamically integrates heterogeneous VLMs for multi-label reasoning without requiring a shared representation space. The framework is evaluated on five chest radiograph benchmarks across cross-domain generalization and in-domain specialization tasks, reporting 6-15% AUC gains over state-of-the-art adaptation methods, with extensibility demonstrated across eight VLMs including CLIP, LLaVA, Qwen-VL, and MedGemma.

Significance. If the reported results hold, this work offers a practical and scalable approach to bridging domain gaps in medical imaging by adapting existing foundation VLMs rather than training new models from scratch. The model-agnostic design and ability to handle both out-of-distribution transfer and in-distribution tuning while supporting multi-label diagnosis represent a meaningful advance. Explicit credit is due for the planned code release, which directly supports reproducibility and follow-on research.

minor comments (2)

Abstract: the phrase 'three key adaptation tasks' is used without enumeration; explicitly listing these tasks (e.g., cross-domain, in-domain, and perhaps a third) would improve reader orientation from the outset.
Experimental section: while the abstract and skeptic note indicate ablations and baselines are present, ensure that all comparison methods are referenced with citations and that any statistical significance testing for the 6-15% AUC deltas is clearly reported with p-values or confidence intervals.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive review and recommendation for minor revision. We appreciate the recognition of MedBridge's practical contributions to VLM adaptation for medical diagnosis, including its model-agnostic nature and extensibility. No specific major comments were provided in the report, so we will incorporate minor improvements for clarity and presentation in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces MedBridge as an empirical adaptation framework relying on learnable query tokens injected into VLMs and dynamic expert routing for medical diagnosis. All performance claims (6-15% AUC gains) are grounded in direct experimental evaluation across five external benchmarks, multiple VLMs, and both in-domain and cross-domain splits. No derivation chain, equations, or first-principles predictions are presented that reduce outputs to fitted parameters, self-defined quantities, or self-citation chains by construction. The architecture and results remain self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of learnable query tokens as both alignment and routing signals; these are introduced as part of the method rather than derived from prior literature.

free parameters (1)

learnable query tokens
Compact set of adjustable tokens injected into intermediate layers to enable domain alignment and expert routing.

axioms (1)

domain assumption Pretrained VLMs contain transferable features that can be aligned to medical domains via lightweight query injection without destructive modification.
Invoked in the description of transforming VLMs into multi-view query encoders.

pith-pipeline@v0.9.0 · 5812 in / 1174 out tokens · 41961 ms · 2026-05-22T01:40:53.476881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Query-Encoder (QEncoder): The QEncoder adds Q learnable queries Q={q_i} to interact with the M frozen tokens M={m_i} from foundation VLMs... Zi = [Mi;Qi] + MHSA(Norm([Mi;Qi]))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mixture of Experts (MoE): ... gating network G ... gk(QG) = exp(w_k^T QG + b_k) / sum ... v = sum gk(QG) * a_k

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 8 internal anchors

[1]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text ra- diology reports, patient demographics and additional image formats.arXiv preprint arXiv:2405.19538, 2024. 5, 6, 8

work page arXiv 2024
[4]

Adapting pretrained vision-language foundational models to medical imaging do- mains

Pierre Joseph Marcel Chambon, Christian Bluethgen, Cur- tis Langlotz, and Akshay Chaudhari. Adapting pretrained vision-language foundational models to medical imaging do- mains. InNeurIPS 2022 Foundation Models for Decision Making Workshop. 3

work page 2022
[5]

Meta-adapter: an online few-shot learner for vision-language model

Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, and Ying Shan. Meta-adapter: an online few-shot learner for vision-language model. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 55361–55374, 2023. 1

work page 2023
[6]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 1, 2

work page 2023
[7]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 1, 2, 5, 6, 7

work page 2024
[8]

Gemini: A Family of Highly Capable Multimodal Models

Team Gemini. Gemini: A family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Mmrl: Multi-modal rep- resentation learning for vision-language models

Yuncheng Guo and Xiaodong Gu. Mmrl: Multi-modal rep- resentation learning for vision-language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 5, 6, 7

work page 2025
[10]

Towards long-tailed, multi-label disease classification from chest x- ray: Overview of the cxr-lt challenge.Medical Image Anal- ysis, page 103224, 2024

Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, et al. Towards long-tailed, multi-label disease classification from chest x- ray: Overview of the cxr-lt challenge.Medical Image Anal- ysis, page 103224, 2024. 1

work page 2024
[11]

Adapting visual-language models for generalizable anomaly detection in medical im- ages

Chaoqin Huang, Aofan Jiang, Jinghao Feng, Ya Zhang, Xin- chao Wang, and Yanfeng Wang. Adapting visual-language models for generalizable anomaly detection in medical im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 11375–11385,

work page
[12]

Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 3942–3951, 2021. 3

work page 2021
[13]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 1

work page 2021
[14]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 5, 6, 7, 8

work page 2019
[15]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[16]

Less could be better: Parameter-efficient fine-tuning advances medical vision foundation models

Chenyu Lian, Hong-Yu Zhou, Yizhou Yu, and Liansheng Wang. Less could be better: Parameter-efficient fine-tuning advances medical vision foundation models. InMedical Imaging with Deep Learning, 2024. 3

work page 2024
[17]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuhang Li, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2024. 5

work page 2024
[20]

Radia- log: Large vision-language models for x-ray reporting and dialog-driven assistance

Chantal Pellegrini, Ege ¨Ozsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radia- log: Large vision-language models for x-ray reporting and dialog-driven assistance. InMedical Imaging with Deep Learning, 2025. 3

work page 2025
[21]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025. 3

work page 2025
[22]

Decomposing disease de- scriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework

Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh- Son To, and Johan W Verjans. Decomposing disease de- scriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11492...

work page 2024
[23]

Ex- 10 ploring transfer learning in medical image segmentation us- ing vision-language models

Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, Ra- bin Adhikari, Safal Thapaliya, and Bishesh Khanal. Ex- 10 ploring transfer learning in medical image segmentation us- ing vision-language models. InMedical Imaging with Deep Learning, pages 1142–1165. PMLR, 2024. 1, 3

work page 2024
[24]

Medical image understanding with pretrained vision language mod- els: A comprehensive study

Ziyuan Qin, Huahui Yi, Qicheng Lao, and Kang Li. Medical image understanding with pretrained vision language mod- els: A comprehensive study. InThe Eleventh International Conference on Learning Representations. 1, 3

work page
[25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 5, 6, 7

work page 2021
[26]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022
[27]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Under- diagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.Na- ture medicine, 27(12):2176–2182, 2021

Laleh Seyyed-Kalantari, Haoran Zhang, Matthew BA Mc- Dermott, Irene Y Chen, and Marzyeh Ghassemi. Under- diagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.Na- ture medicine, 27(12):2176–2182, 2021. 5

work page 2021
[29]

Few-shot adaptation of medical vision-language models

Fereshteh Shakeri, Yunshi Huang, Julio Silva-Rodr ´ıguez, Houda Bahig, An Tang, Jose Dolz, and Ismail Ben Ayed. Few-shot adaptation of medical vision-language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 553–563. Springer,

work page
[30]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possi- ble pneumonia.Radiology: Artificial Intelligence, 1(1): e180041, 2019

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin- Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possi- ble pneumonia.Radiology: Artificial Intelligence, 1(1): e180041, 2019. 5, 6, 8

work page 2019
[31]

Vision-language model selection and reuse for downstream adaptation

Hao-Zhe Tan, Zhi Zhou, Yu-feng Li, and Lan-Zhe Guo. Vision-language model selection and reuse for downstream adaptation. InForty-second International Conference on Machine Learning, 2025. 1

work page 2025
[32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning.Nature biomedical engineering, 6(12): 1399–1406, 2022

Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, An- drew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning.Nature biomedical engineering, 6(12): 1399–1406, 2022. 2, 3, 5, 7, 8

work page 2022
[34]

Multi-granularity cross-modal align- ment for generalized medical visual representation learn- ing.Advances in Neural Information Processing Systems, 35:33536–33549, 2022

Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanab- huti, and Lequan Yu. Multi-granularity cross-modal align- ment for generalized medical visual representation learn- ing.Advances in Neural Information Processing Systems, 35:33536–33549, 2022. 3

work page 2022
[35]

Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo- hammadhadi Bagheri, and Ronald M Summers. Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106,

work page 2097
[36]

Medclip: Contrastive learning from unpaired medi- cal images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022. 3

work page 2022
[37]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21372–21383, 2023. 3

work page 2023
[38]

Covidx cxr-4: An expanded multi-institutional open-source benchmark dataset for chest x-ray image- based computer-aided covid-19 diagnostics.arXiv preprint arXiv:2311.17677, 2023

Yifan Wu, Hayden Gunraj, Chi-en Amy Tai, and Alexan- der Wong. Covidx cxr-4: An expanded multi-institutional open-source benchmark dataset for chest x-ray image- based computer-aided covid-19 diagnostics.arXiv preprint arXiv:2311.17677, 2023. 5, 6

work page arXiv 2023
[39]

Post-pre-training for modality alignment in vision-language foundation models

Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. Post-pre-training for modality alignment in vision-language foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1

work page 2025
[40]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Mma: Multi-modal adapter for vision-language models

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. Mma: Multi-modal adapter for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23826– 23837, 2024. 1, 2, 5, 6, 7

work page 2024
[42]

Visual- language prompt tuning with knowledge-guided context op- timization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual- language prompt tuning with knowledge-guided context op- timization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767,

work page
[43]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 2, 5, 6, 7

work page 2024
[44]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 2, 5

work page 2023
[45]

Disease-informed adaptation of vision-language mod- els.IEEE Transactions on Medical Imaging, 2024

Jiajin Zhang, Ge Wang, Mannudeep K Kalra, and Pingkun Yan. Disease-informed adaptation of vision-language mod- els.IEEE Transactions on Medical Imaging, 2024. 3

work page 2024
[46]

Tip- 11 adapter: Training-free adaption of clip for few-shot classi- fication

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- 11 adapter: Training-free adaption of clip for few-shot classi- fication. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, page 493–510. Springer-Verlag, 2022. 2

work page 2022
[47]

Knowledge-enhanced visual-language pre- training on chest radiology images.Nature Communications, 14(1):4542, 2023

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. Knowledge-enhanced visual-language pre- training on chest radiology images.Nature Communications, 14(1):4542, 2023. 3

work page 2023
[48]

Mediclip: Adapting clip for few-shot medical image anomaly detection

Ximiao Zhang, Min Xu, Dehui Qiu, Ruixin Yan, Ning Lang, and Xiuzhuang Zhou. Mediclip: Adapting clip for few-shot medical image anomaly detection. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 458–468. Springer, 2024. 1, 3

work page 2024
[49]

Adapting pre- trained vision transformers from 2d to 3d through weight in- flation improves medical image segmentation

Yuhui Zhang, Shih-Cheng Huang, Zhengping Zhou, Matthew P Lungren, and Serena Yeung. Adapting pre- trained vision transformers from 2d to 3d through weight in- flation improves medical image segmentation. InMachine Learning for Health, pages 391–404. PMLR, 2022. 1, 3

work page 2022
[50]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022. 3

work page 2022
[51]

Clip in medical imaging: A survey.Medical Image Analysis, page 103551, 2025

Zihao Zhao, Yuxiao Liu, Han Wu, Mei Wang, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Zhiming Cui, Qian Wang, et al. Clip in medical imaging: A survey.Medical Image Analysis, page 103551, 2025. 2

work page 2025
[52]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

work page
[53]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[54]

Benchx: A unified benchmark framework for medical vision-language pretrain- ing on chest x-rays.Advances in Neural Information Pro- cessing Systems, 37:6625–6647, 2024

Yang Zhou, Tan Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, and Rick Siow Mong Goh. Benchx: A unified benchmark framework for medical vision-language pretrain- ing on chest x-rays.Advances in Neural Information Pro- cessing Systems, 37:6625–6647, 2024. 3

work page 2024
[55]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 15659–15669, 2023. 1, 2, 5, 6, 7 12

work page 2023

[1] [1]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text ra- diology reports, patient demographics and additional image formats.arXiv preprint arXiv:2405.19538, 2024. 5, 6, 8

work page arXiv 2024

[4] [4]

Adapting pretrained vision-language foundational models to medical imaging do- mains

Pierre Joseph Marcel Chambon, Christian Bluethgen, Cur- tis Langlotz, and Akshay Chaudhari. Adapting pretrained vision-language foundational models to medical imaging do- mains. InNeurIPS 2022 Foundation Models for Decision Making Workshop. 3

work page 2022

[5] [5]

Meta-adapter: an online few-shot learner for vision-language model

Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, and Ying Shan. Meta-adapter: an online few-shot learner for vision-language model. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 55361–55374, 2023. 1

work page 2023

[6] [6]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 1, 2

work page 2023

[7] [7]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 1, 2, 5, 6, 7

work page 2024

[8] [8]

Gemini: A Family of Highly Capable Multimodal Models

Team Gemini. Gemini: A family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Mmrl: Multi-modal rep- resentation learning for vision-language models

Yuncheng Guo and Xiaodong Gu. Mmrl: Multi-modal rep- resentation learning for vision-language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 5, 6, 7

work page 2025

[10] [10]

Towards long-tailed, multi-label disease classification from chest x- ray: Overview of the cxr-lt challenge.Medical Image Anal- ysis, page 103224, 2024

Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, et al. Towards long-tailed, multi-label disease classification from chest x- ray: Overview of the cxr-lt challenge.Medical Image Anal- ysis, page 103224, 2024. 1

work page 2024

[11] [11]

Adapting visual-language models for generalizable anomaly detection in medical im- ages

Chaoqin Huang, Aofan Jiang, Jinghao Feng, Ya Zhang, Xin- chao Wang, and Yanfeng Wang. Adapting visual-language models for generalizable anomaly detection in medical im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 11375–11385,

work page

[12] [12]

Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 3942–3951, 2021. 3

work page 2021

[13] [13]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 1

work page 2021

[14] [14]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 5, 6, 7, 8

work page 2019

[15] [15]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023

[16] [16]

Less could be better: Parameter-efficient fine-tuning advances medical vision foundation models

Chenyu Lian, Hong-Yu Zhou, Yizhou Yu, and Liansheng Wang. Less could be better: Parameter-efficient fine-tuning advances medical vision foundation models. InMedical Imaging with Deep Learning, 2024. 3

work page 2024

[17] [17]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuhang Li, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2024. 5

work page 2024

[20] [20]

Radia- log: Large vision-language models for x-ray reporting and dialog-driven assistance

Chantal Pellegrini, Ege ¨Ozsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radia- log: Large vision-language models for x-ray reporting and dialog-driven assistance. InMedical Imaging with Deep Learning, 2025. 3

work page 2025

[21] [21]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025. 3

work page 2025

[22] [22]

Decomposing disease de- scriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework

Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh- Son To, and Johan W Verjans. Decomposing disease de- scriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11492...

work page 2024

[23] [23]

Ex- 10 ploring transfer learning in medical image segmentation us- ing vision-language models

Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, Ra- bin Adhikari, Safal Thapaliya, and Bishesh Khanal. Ex- 10 ploring transfer learning in medical image segmentation us- ing vision-language models. InMedical Imaging with Deep Learning, pages 1142–1165. PMLR, 2024. 1, 3

work page 2024

[24] [24]

Medical image understanding with pretrained vision language mod- els: A comprehensive study

Ziyuan Qin, Huahui Yi, Qicheng Lao, and Kang Li. Medical image understanding with pretrained vision language mod- els: A comprehensive study. InThe Eleventh International Conference on Learning Representations. 1, 3

work page

[25] [25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 5, 6, 7

work page 2021

[26] [26]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022

[27] [27]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Under- diagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.Na- ture medicine, 27(12):2176–2182, 2021

Laleh Seyyed-Kalantari, Haoran Zhang, Matthew BA Mc- Dermott, Irene Y Chen, and Marzyeh Ghassemi. Under- diagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.Na- ture medicine, 27(12):2176–2182, 2021. 5

work page 2021

[29] [29]

Few-shot adaptation of medical vision-language models

Fereshteh Shakeri, Yunshi Huang, Julio Silva-Rodr ´ıguez, Houda Bahig, An Tang, Jose Dolz, and Ismail Ben Ayed. Few-shot adaptation of medical vision-language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 553–563. Springer,

work page

[30] [30]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possi- ble pneumonia.Radiology: Artificial Intelligence, 1(1): e180041, 2019

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin- Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possi- ble pneumonia.Radiology: Artificial Intelligence, 1(1): e180041, 2019. 5, 6, 8

work page 2019

[31] [31]

Vision-language model selection and reuse for downstream adaptation

Hao-Zhe Tan, Zhi Zhou, Yu-feng Li, and Lan-Zhe Guo. Vision-language model selection and reuse for downstream adaptation. InForty-second International Conference on Machine Learning, 2025. 1

work page 2025

[32] [32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning.Nature biomedical engineering, 6(12): 1399–1406, 2022

Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, An- drew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning.Nature biomedical engineering, 6(12): 1399–1406, 2022. 2, 3, 5, 7, 8

work page 2022

[34] [34]

Multi-granularity cross-modal align- ment for generalized medical visual representation learn- ing.Advances in Neural Information Processing Systems, 35:33536–33549, 2022

Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanab- huti, and Lequan Yu. Multi-granularity cross-modal align- ment for generalized medical visual representation learn- ing.Advances in Neural Information Processing Systems, 35:33536–33549, 2022. 3

work page 2022

[35] [35]

Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo- hammadhadi Bagheri, and Ronald M Summers. Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106,

work page 2097

[36] [36]

Medclip: Contrastive learning from unpaired medi- cal images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022. 3

work page 2022

[37] [37]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21372–21383, 2023. 3

work page 2023

[38] [38]

Covidx cxr-4: An expanded multi-institutional open-source benchmark dataset for chest x-ray image- based computer-aided covid-19 diagnostics.arXiv preprint arXiv:2311.17677, 2023

Yifan Wu, Hayden Gunraj, Chi-en Amy Tai, and Alexan- der Wong. Covidx cxr-4: An expanded multi-institutional open-source benchmark dataset for chest x-ray image- based computer-aided covid-19 diagnostics.arXiv preprint arXiv:2311.17677, 2023. 5, 6

work page arXiv 2023

[39] [39]

Post-pre-training for modality alignment in vision-language foundation models

Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. Post-pre-training for modality alignment in vision-language foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1

work page 2025

[40] [40]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Mma: Multi-modal adapter for vision-language models

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. Mma: Multi-modal adapter for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23826– 23837, 2024. 1, 2, 5, 6, 7

work page 2024

[42] [42]

Visual- language prompt tuning with knowledge-guided context op- timization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual- language prompt tuning with knowledge-guided context op- timization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767,

work page

[43] [43]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 2, 5, 6, 7

work page 2024

[44] [44]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 2, 5

work page 2023

[45] [45]

Disease-informed adaptation of vision-language mod- els.IEEE Transactions on Medical Imaging, 2024

Jiajin Zhang, Ge Wang, Mannudeep K Kalra, and Pingkun Yan. Disease-informed adaptation of vision-language mod- els.IEEE Transactions on Medical Imaging, 2024. 3

work page 2024

[46] [46]

Tip- 11 adapter: Training-free adaption of clip for few-shot classi- fication

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- 11 adapter: Training-free adaption of clip for few-shot classi- fication. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, page 493–510. Springer-Verlag, 2022. 2

work page 2022

[47] [47]

Knowledge-enhanced visual-language pre- training on chest radiology images.Nature Communications, 14(1):4542, 2023

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. Knowledge-enhanced visual-language pre- training on chest radiology images.Nature Communications, 14(1):4542, 2023. 3

work page 2023

[48] [48]

Mediclip: Adapting clip for few-shot medical image anomaly detection

Ximiao Zhang, Min Xu, Dehui Qiu, Ruixin Yan, Ning Lang, and Xiuzhuang Zhou. Mediclip: Adapting clip for few-shot medical image anomaly detection. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 458–468. Springer, 2024. 1, 3

work page 2024

[49] [49]

Adapting pre- trained vision transformers from 2d to 3d through weight in- flation improves medical image segmentation

Yuhui Zhang, Shih-Cheng Huang, Zhengping Zhou, Matthew P Lungren, and Serena Yeung. Adapting pre- trained vision transformers from 2d to 3d through weight in- flation improves medical image segmentation. InMachine Learning for Health, pages 391–404. PMLR, 2022. 1, 3

work page 2022

[50] [50]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022. 3

work page 2022

[51] [51]

Clip in medical imaging: A survey.Medical Image Analysis, page 103551, 2025

Zihao Zhao, Yuxiao Liu, Han Wu, Mei Wang, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Zhiming Cui, Qian Wang, et al. Clip in medical imaging: A survey.Medical Image Analysis, page 103551, 2025. 2

work page 2025

[52] [52]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

work page

[53] [53]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page

[54] [54]

Benchx: A unified benchmark framework for medical vision-language pretrain- ing on chest x-rays.Advances in Neural Information Pro- cessing Systems, 37:6625–6647, 2024

Yang Zhou, Tan Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, and Rick Siow Mong Goh. Benchx: A unified benchmark framework for medical vision-language pretrain- ing on chest x-rays.Advances in Neural Information Pro- cessing Systems, 37:6625–6647, 2024. 3

work page 2024

[55] [55]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 15659–15669, 2023. 1, 2, 5, 6, 7 12

work page 2023