Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Tran Dinh Tien; Zhiqiang Shen

arxiv: 2606.04922 · v1 · pith:D3G5BJ5Lnew · submitted 2026-06-03 · 💻 cs.CV · cs.AI· cs.LG

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Tran Dinh Tien , Zhiqiang Shen This is my paper

Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords prompt tuningvision-language modelsknowledge distillationmedical imagingclass relationsfew-shot adaptationbiomedical VLMsgeometry-aware distillation

0 comments

The pith

Directional geometry targets from class relations improve prompt tuning of biomedical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard prompt tuning for vision-language models in medicine optimizes only the correct class label and treats every other class as equally wrong. This approach ignores relations that exist between classes, such as similar pathologies, and produces unstable decision boundaries when training data is limited. The paper introduces a distillation method that builds directional targets inside the teacher model; these targets keep the ground-truth label intact while adding geometric structure among classes. Two losses then transfer the structure: one on the global image token and one on attentive patch tokens. Experiments across eleven medical datasets show consistent accuracy gains and better handling of unseen classes under base-to-novel and few-shot protocols.

Core claim

We propose Omni-Geometry Knowledge Distillation (OGKD) that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop Global Geometry-Aware Distillation (GAD) on the global image token and Label-Guided Geometry Distillation (LGD) on attentive patch tokens. Across comprehensive experiments on 11 medical datasets for base-to-novel and few-shot evaluations, OGKD achieves average absolute accuracy gains of 1.7%-2.8% over prior state-of-the-art VLM adaptation methods, with robust generalization to unseen classes and more reliable predictions.

What carries the argument

Omni-Geometry Knowledge Distillation (OGKD) framework that generates directional targets encoding inter-class geometry while preserving ground truth, applied via Global Geometry-Aware Distillation (GAD) and Label-Guided Geometry Distillation (LGD) losses.

If this is right

Prompt-tuned models reach higher accuracy on medical image classification under limited supervision.
Gains appear in both base-to-novel class splits and few-shot regimes.
Generalization to unseen classes improves compared with prior adaptation techniques.
Prediction reliability increases as measured across the evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-injection idea could be tested on natural-image datasets that contain clear semantic hierarchies between classes.
If directional targets can be constructed from class labels alone without a separate teacher, dependence on large pre-trained VLMs might decrease.
Applying the method to tasks with conflicting or noisy class relations would reveal when geometry injection helps versus harms performance.

Load-bearing premise

Clinically meaningful relations between classes can be captured as directional geometry targets in the teacher model without distorting the ground-truth signal.

What would settle it

Measure whether the reported accuracy gains disappear when the same method is applied to a medical dataset in which all inter-class relations have been randomly permuted or removed.

Figures

Figures reproduced from arXiv: 2606.04922 by Tran Dinh Tien, Zhiqiang Shen.

**Figure 2.** Figure 2: Risk–coverage across four biomedical datasets. Our method yields consistently lower curves (smaller AURC) than BiomedCoOp (Koleilat et al., 2025), indicating fewer high-confidence errors at the same coverage and improved selective reliability. Biomedical imaging poses additional challenges due to complex visual features, diverse modalities, and heterogeneous acquisition devices. BiomedCoOp (Koleilat et al.… view at source ↗

**Figure 3.** Figure 3: From biomedical text prototypes, we build a class graph [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Wins analysis via KDE curves on the RETINA dataset. We compare BiomedCoOp and our method on challenging test samples that our method classifies correctly while BiomedCoOp is wrong (“wins”). Left: KDE of the logit margin z(y) − maxj̸=y z(j). Right: KDE of the probability assigned to the ground-truth class’s top-k neighbors in W (here k=3). The legend in each panel also reports Σ, the sum over all win sample… view at source ↗

**Figure 5.** Figure 5: Probability mass on disease classes (excluded GT) on Retina dataset. Each curve is a kernel density estimate over per-image quantity s(x) = P c∈D pθ(y = c | x), where D are the disease classes (Normal excluded). Legends report P x s(x) (Σ): total probability mass placed on disease classes across all images in the subset. Left: subset where both methods misclassify and the ground-truth class is a disease. R… view at source ↗

**Figure 6.** Figure 6: Few-shot evaluation averaged over 11 datasets under different teacher budgets (number of LLM prompt templates per class) with shots K ∈ {1, 2, 4, 8, 16}. The left y-axis shows absolute accuracies (%) for BiomedCoOp (Koleilat et al., 2025) and our method; the right y-axis and the green strip show the per-budget improvement ∆ = Ours − BiomedCoOp (%), annotated at each budget. Analysis on misclassified sample… view at source ↗

**Figure 7.** Figure 7: Effect of geometry strength γ on accuracy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OGKD frames class-relation geometry as directional distillation targets for medical VLM prompt tuning, but the abstract supplies no construction details or checks that the targets stay faithful to ground truth.

read the letter

The main thing here is a new pair of distillation losses that try to inject inter-class geometry into teacher targets for prompt-tuned VLMs on medical images. Global Geometry-Aware Distillation works on the image token while Label-Guided Geometry Distillation applies the same structure to attentive patches. The claim is that this respects clinically meaningful relations without treating every wrong class as identical, and the paper reports 1.7-2.8% average gains over prior adaptation methods across 11 datasets in base-to-novel and few-shot setups, plus better unseen-class generalization.

What is actually new is the explicit separation into global and patch-level geometry losses with directional targets derived from class relations. Standard prompt tuning and adapter work optimizes only the ground-truth logit; this one adds a geometry term on top. That framing is not directly in the cited prior literature, and the medical imaging setting gives it a clear use case where class relations (e.g., different pathologies) are not arbitrary.

The soft spots are exactly where the stress-test note flags them. The abstract never says how the geometry targets are computed, whether they keep the ground-truth class as the unique optimum, or any verification that the directional encoding does not push an incorrect class closer than the true one. No error bars, no ablation that isolates the geometry component, and no teacher construction details appear. Those omissions make the performance numbers difficult to evaluate on their own.

The paper is aimed at researchers adapting VLMs to medical imaging under data constraints. Someone already running prompt tuning experiments could pick up the loss formulation and test it quickly. It is coherent on its own terms and shows honest engagement with the limitation of ignoring class relations, so it clears the bar for serious refereeing even though the current evidence is thin. I would send it out for review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces Omni-Geometry Knowledge Distillation (OGKD) for prompt tuning of biomedical vision-language models. It constructs directional geometry targets in the teacher that are claimed to preserve ground-truth labels while encoding inter-class relations, then applies two losses—Global Geometry-Aware Distillation (GAD) on the global token and Label-Guided Geometry Distillation (LGD) on attentive patches—to improve adaptation under limited supervision. Experiments across 11 medical datasets report average absolute accuracy gains of 1.7–2.8% over prior SOTA VLM adaptation methods in base-to-novel and few-shot settings, plus improved generalization to unseen classes.

Significance. If the geometry targets can be shown to preserve ground truth without distortion, the framework would address a genuine limitation of standard distillation (equal treatment of incorrect classes) in data-scarce medical domains where class relations carry clinical meaning. The public code release supports reproducibility and is a clear strength.

major comments (3)

[§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.
[§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.
[Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.

minor comments (2)

[§3] Notation for the geometry targets and the two losses (GAD, LGD) should be introduced with explicit equations rather than descriptive text only.
[§4] Table captions and axis labels in the result figures should explicitly state whether the reported numbers are means over multiple runs or single-run values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying aspects of the method and experiments while committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.

Authors: We agree that the current description in §3 would benefit from greater explicitness to allow full verification. The directional geometry targets are formed from the frozen teacher's class-level embeddings by computing a similarity-based geometry matrix (via normalized dot products) and then shifting the target logits such that the ground-truth class is fixed at the maximum value while incorrect-class entries are modulated proportionally to their embedding distances. The teacher is the unmodified pre-trained VLM. We will insert explicit pseudocode, a step-by-step encoding procedure, and a short argument showing that the ground-truth class remains the unique optimum under this construction. revision: yes
Referee: [§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.

Authors: We concur that these elements would make the experimental claims more robust. In the revision we will report mean accuracy with standard error bars over multiple random seeds, add an ablation that removes only the geometry injection (keeping all other loss terms and hyperparameters fixed), and include a counter-example audit confirming that the constructed targets never assign a higher value to an incorrect class than to the ground-truth class on any sample from the 11 datasets. revision: yes
Referee: [Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.

Authors: The consistent gains and improved generalization to unseen classes across the 11 datasets provide supporting empirical evidence that the targets remain faithful in practice. To make this explicit, we will add a dedicated verification paragraph (with a table) showing that, for every test image in all evaluated datasets, the ground-truth class retains the highest target value after geometry injection. This directly addresses the request for verification. revision: yes

Circularity Check

0 steps flagged

No circularity; new loss construction presented as independent of fitted inputs

full rationale

The provided abstract and description introduce OGKD, GAD, and LGD as novel distillation losses that construct directional geometry targets from class relations. No equations, self-citations, or derivation steps are exhibited that reduce the targets, losses, or reported gains (1.7-2.8%) to quantities defined by the paper's own fitted parameters or prior self-referential results. The central claim rests on empirical evaluation across datasets rather than any self-definitional or fitted-input reduction. This is the expected non-finding for a methods paper whose contribution is a new loss formulation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the existence of extractable inter-class geometric structure that can be turned into directional teacher targets. No free parameters, axioms, or invented physical entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.1-grok · 5763 in / 1233 out tokens · 30388 ms · 2026-06-28T06:31:46.240363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Dataset of breast ultrasound images

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28: 0 104863, 2020

2020
[2]

Learning to exploit temporal structure for biomedical vision-language processing

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15016--15027, 2023

2023
[3]

Making the most of text semantics to improve biomedical vision--language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision--language processing. In European conference on computer vision, pp.\ 1--21. Springer, 2022

2022
[4]

Lung and colon cancer histopathological image dataset (lc25000)

Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A DeLand, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019

work page arXiv 1912
[5]

Knee osteoarthritis severity grading dataset

Pingjun Chen. Knee osteoarthritis severity grading dataset. Mendeley Data, 1 0 (10.17632): 0 30784984, 2018

2018
[6]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

Sedigheh Eslami, Gerard De Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

work page arXiv 2021
[8]

Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling

Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, and Weidong Cai. Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11524--11534, 2024

2024
[9]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

2017
[10]

Clip-adapter: Better vision-language models with feature adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132 0 (2): 0 581--595, 2024

2024
[11]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017

2017
[12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3942--3951, 2021

2021
[14]

Lp++: A surprisingly strong linear probe for few-shot clip

Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23773--23782, 2024

2024
[15]

A visual--language foundation model for pathology image analysis using medical twitter

Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual--language foundation model for pathology image analysis using medical twitter. Nature medicine, 29 0 (9): 0 2307--2316, 2023

2023
[16]

Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography

Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography. Scientific Reports, 12 0 (1): 0 11440, 2022

2022
[17]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

2021
[18]

Learning generalizable prompt for clip with class similarity knowledge

Sehun Jung and Hyang-won Lee. Learning generalizable prompt for clip with class similarity knowledge. arXiv preprint arXiv:2502.11969, 2025

work page arXiv 2025
[19]

Multi-class texture analysis in colorectal cancer histology

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z \"o llner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6 0 (1): 0 1--11, 2016

2016
[20]

Identifying medical diagnoses and treatable diseases by image-based deep learning

Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172 0 (5): 0 1122--1131, 2018

2018
[21]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19113--19122, 2023 a

2023
[22]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15190--15200, 2023 b

2023
[23]

Automatic no-reference quality assessment for retinal fundus images using vessel segmentation

Thomas K \"o hler, Attila Budai, Martin F Kraus, Jan Odstr c ilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation. In Proceedings of the 26th IEEE international symposium on computer-based medical systems, pp.\ 95--100. IEEE, 2013

2013
[24]

Biomedcoop: Learning to prompt for biomedical vision-language models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 14766--14776, 2025

2025
[25]

Current perspectives in medical image perception

Elizabeth A Krupinski. Current perspectives in medical image perception. Attention, Perception, & Psychophysics, 72 0 (5): 0 1205--1217, 2010

2010
[26]

Graphadapter: Tuning vision-language models with dual knowledge graph

Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang. Graphadapter: Tuning vision-language models with dual knowledge graph. Advances in Neural Information Processing Systems, 36: 0 13448--13466, 2023

2023
[27]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26617--26626, 2024

2024
[28]

A survey on deep learning in medical image analysis

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I S \'a nchez. A survey on deep learning in medical image analysis. Medical image analysis, 42: 0 60--88, 2017

2017
[29]

A visual-language foundation model for computational pathology

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature medicine, 30 0 (3): 0 863--874, 2024

2024
[30]

Joint learning of localized representations from medical images and reports

Philip M \"u ller, Georgios Kaissis, Congyu Zou, and Daniel Rueckert. Joint learning of localized representations from medical images and reports. In European conference on computer vision, pp.\ 685--701. Springer, 2022

2022
[31]

Brain tumor mri dataset

Msoud Nickparvar. Brain tumor mri dataset. Kaggle, 2021

2021
[32]

Quilt-1m: One million image-text pairs for histopathology

Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. arXiv e-prints, pp.\ arXiv--2306, 2023

2023
[33]

Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference,...

2017
[34]

Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research

Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data, 3 0 (3): 0 25, 2018

2018
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

2021
[36]

A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision

Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, and Ismail Ben Ayed. A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. Medical Image Analysis, 99: 0 103357, 2025

2025
[37]

Prototypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

2017
[38]

Covid-19 infection localization and severity grading from chest x-ray images

Anas M Tahir, Muhammad EH Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M Sohel Rahman, Somaya Al-Maadeed, et al. Covid-19 infection localization and severity grading from chest x-ray images. Computers in biology and medicine, 139: 0 105002, 2021

2021
[39]

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5 0 (1): 0 1--9, 2018

2018
[40]

Matching networks for one shot learning

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016

2016
[41]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, pp.\ 3876, 2022

2022
[42]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15952--15962, 2024

2024
[43]

Visual-language prompt tuning with knowledge-guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6757--6767, 2023

2023
[44]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022

2022
[45]

Tip-adapter: Training-free clip-adapter for better vision-language modeling

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021

work page arXiv 2021
[46]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16816--16825, 2022 a

2022
[48]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 b

2022
[49]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15659--15669, 2023

2023

[1] [1]

Dataset of breast ultrasound images

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28: 0 104863, 2020

2020

[2] [2]

Learning to exploit temporal structure for biomedical vision-language processing

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15016--15027, 2023

2023

[3] [3]

Making the most of text semantics to improve biomedical vision--language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision--language processing. In European conference on computer vision, pp.\ 1--21. Springer, 2022

2022

[4] [4]

Lung and colon cancer histopathological image dataset (lc25000)

Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A DeLand, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019

work page arXiv 1912

[5] [5]

Knee osteoarthritis severity grading dataset

Pingjun Chen. Knee osteoarthritis severity grading dataset. Mendeley Data, 1 0 (10.17632): 0 30784984, 2018

2018

[6] [6]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

Sedigheh Eslami, Gerard De Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

work page arXiv 2021

[8] [8]

Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling

Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, and Weidong Cai. Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11524--11534, 2024

2024

[9] [9]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

2017

[10] [10]

Clip-adapter: Better vision-language models with feature adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132 0 (2): 0 581--595, 2024

2024

[11] [11]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017

2017

[12] [12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3942--3951, 2021

2021

[14] [14]

Lp++: A surprisingly strong linear probe for few-shot clip

Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23773--23782, 2024

2024

[15] [15]

A visual--language foundation model for pathology image analysis using medical twitter

Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual--language foundation model for pathology image analysis using medical twitter. Nature medicine, 29 0 (9): 0 2307--2316, 2023

2023

[16] [16]

Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography

Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography. Scientific Reports, 12 0 (1): 0 11440, 2022

2022

[17] [17]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

2021

[18] [18]

Learning generalizable prompt for clip with class similarity knowledge

Sehun Jung and Hyang-won Lee. Learning generalizable prompt for clip with class similarity knowledge. arXiv preprint arXiv:2502.11969, 2025

work page arXiv 2025

[19] [19]

Multi-class texture analysis in colorectal cancer histology

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z \"o llner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6 0 (1): 0 1--11, 2016

2016

[20] [20]

Identifying medical diagnoses and treatable diseases by image-based deep learning

Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172 0 (5): 0 1122--1131, 2018

2018

[21] [21]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19113--19122, 2023 a

2023

[22] [22]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15190--15200, 2023 b

2023

[23] [23]

Automatic no-reference quality assessment for retinal fundus images using vessel segmentation

Thomas K \"o hler, Attila Budai, Martin F Kraus, Jan Odstr c ilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation. In Proceedings of the 26th IEEE international symposium on computer-based medical systems, pp.\ 95--100. IEEE, 2013

2013

[24] [24]

Biomedcoop: Learning to prompt for biomedical vision-language models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 14766--14776, 2025

2025

[25] [25]

Current perspectives in medical image perception

Elizabeth A Krupinski. Current perspectives in medical image perception. Attention, Perception, & Psychophysics, 72 0 (5): 0 1205--1217, 2010

2010

[26] [26]

Graphadapter: Tuning vision-language models with dual knowledge graph

Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang. Graphadapter: Tuning vision-language models with dual knowledge graph. Advances in Neural Information Processing Systems, 36: 0 13448--13466, 2023

2023

[27] [27]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26617--26626, 2024

2024

[28] [28]

A survey on deep learning in medical image analysis

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I S \'a nchez. A survey on deep learning in medical image analysis. Medical image analysis, 42: 0 60--88, 2017

2017

[29] [29]

A visual-language foundation model for computational pathology

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature medicine, 30 0 (3): 0 863--874, 2024

2024

[30] [30]

Joint learning of localized representations from medical images and reports

Philip M \"u ller, Georgios Kaissis, Congyu Zou, and Daniel Rueckert. Joint learning of localized representations from medical images and reports. In European conference on computer vision, pp.\ 685--701. Springer, 2022

2022

[31] [31]

Brain tumor mri dataset

Msoud Nickparvar. Brain tumor mri dataset. Kaggle, 2021

2021

[32] [32]

Quilt-1m: One million image-text pairs for histopathology

Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. arXiv e-prints, pp.\ arXiv--2306, 2023

2023

[33] [33]

Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference,...

2017

[34] [34]

Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research

Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data, 3 0 (3): 0 25, 2018

2018

[35] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

2021

[36] [36]

A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision

Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, and Ismail Ben Ayed. A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. Medical Image Analysis, 99: 0 103357, 2025

2025

[37] [37]

Prototypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

2017

[38] [38]

Covid-19 infection localization and severity grading from chest x-ray images

Anas M Tahir, Muhammad EH Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M Sohel Rahman, Somaya Al-Maadeed, et al. Covid-19 infection localization and severity grading from chest x-ray images. Computers in biology and medicine, 139: 0 105002, 2021

2021

[39] [39]

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5 0 (1): 0 1--9, 2018

2018

[40] [40]

Matching networks for one shot learning

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016

2016

[41] [41]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, pp.\ 3876, 2022

2022

[42] [42]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15952--15962, 2024

2024

[43] [43]

Visual-language prompt tuning with knowledge-guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6757--6767, 2023

2023

[44] [44]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022

2022

[45] [45]

Tip-adapter: Training-free clip-adapter for better vision-language modeling

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021

work page arXiv 2021

[46] [46]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16816--16825, 2022 a

2022

[48] [48]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 b

2022

[49] [49]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15659--15669, 2023

2023