Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models
Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3
The pith
Directional geometry targets from class relations improve prompt tuning of biomedical vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose Omni-Geometry Knowledge Distillation (OGKD) that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop Global Geometry-Aware Distillation (GAD) on the global image token and Label-Guided Geometry Distillation (LGD) on attentive patch tokens. Across comprehensive experiments on 11 medical datasets for base-to-novel and few-shot evaluations, OGKD achieves average absolute accuracy gains of 1.7%-2.8% over prior state-of-the-art VLM adaptation methods, with robust generalization to unseen classes and more reliable predictions.
What carries the argument
Omni-Geometry Knowledge Distillation (OGKD) framework that generates directional targets encoding inter-class geometry while preserving ground truth, applied via Global Geometry-Aware Distillation (GAD) and Label-Guided Geometry Distillation (LGD) losses.
If this is right
- Prompt-tuned models reach higher accuracy on medical image classification under limited supervision.
- Gains appear in both base-to-novel class splits and few-shot regimes.
- Generalization to unseen classes improves compared with prior adaptation techniques.
- Prediction reliability increases as measured across the evaluated datasets.
Where Pith is reading between the lines
- The same geometry-injection idea could be tested on natural-image datasets that contain clear semantic hierarchies between classes.
- If directional targets can be constructed from class labels alone without a separate teacher, dependence on large pre-trained VLMs might decrease.
- Applying the method to tasks with conflicting or noisy class relations would reveal when geometry injection helps versus harms performance.
Load-bearing premise
Clinically meaningful relations between classes can be captured as directional geometry targets in the teacher model without distorting the ground-truth signal.
What would settle it
Measure whether the reported accuracy gains disappear when the same method is applied to a medical dataset in which all inter-class relations have been randomly permuted or removed.
Figures
read the original abstract
Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Omni-Geometry Knowledge Distillation (OGKD) for prompt tuning of biomedical vision-language models. It constructs directional geometry targets in the teacher that are claimed to preserve ground-truth labels while encoding inter-class relations, then applies two losses—Global Geometry-Aware Distillation (GAD) on the global token and Label-Guided Geometry Distillation (LGD) on attentive patches—to improve adaptation under limited supervision. Experiments across 11 medical datasets report average absolute accuracy gains of 1.7–2.8% over prior SOTA VLM adaptation methods in base-to-novel and few-shot settings, plus improved generalization to unseen classes.
Significance. If the geometry targets can be shown to preserve ground truth without distortion, the framework would address a genuine limitation of standard distillation (equal treatment of incorrect classes) in data-scarce medical domains where class relations carry clinical meaning. The public code release supports reproducibility and is a clear strength.
major comments (3)
- [§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.
- [§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.
- [Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.
minor comments (2)
- [§3] Notation for the geometry targets and the two losses (GAD, LGD) should be introduced with explicit equations rather than descriptive text only.
- [§4] Table captions and axis labels in the result figures should explicitly state whether the reported numbers are means over multiple runs or single-run values.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying aspects of the method and experiments while committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.
Authors: We agree that the current description in §3 would benefit from greater explicitness to allow full verification. The directional geometry targets are formed from the frozen teacher's class-level embeddings by computing a similarity-based geometry matrix (via normalized dot products) and then shifting the target logits such that the ground-truth class is fixed at the maximum value while incorrect-class entries are modulated proportionally to their embedding distances. The teacher is the unmodified pre-trained VLM. We will insert explicit pseudocode, a step-by-step encoding procedure, and a short argument showing that the ground-truth class remains the unique optimum under this construction. revision: yes
-
Referee: [§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.
Authors: We concur that these elements would make the experimental claims more robust. In the revision we will report mean accuracy with standard error bars over multiple random seeds, add an ablation that removes only the geometry injection (keeping all other loss terms and hyperparameters fixed), and include a counter-example audit confirming that the constructed targets never assign a higher value to an incorrect class than to the ground-truth class on any sample from the 11 datasets. revision: yes
-
Referee: [Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.
Authors: The consistent gains and improved generalization to unseen classes across the 11 datasets provide supporting empirical evidence that the targets remain faithful in practice. To make this explicit, we will add a dedicated verification paragraph (with a table) showing that, for every test image in all evaluated datasets, the ground-truth class retains the highest target value after geometry injection. This directly addresses the request for verification. revision: yes
Circularity Check
No circularity; new loss construction presented as independent of fitted inputs
full rationale
The provided abstract and description introduce OGKD, GAD, and LGD as novel distillation losses that construct directional geometry targets from class relations. No equations, self-citations, or derivation steps are exhibited that reduce the targets, losses, or reported gains (1.7-2.8%) to quantities defined by the paper's own fitted parameters or prior self-referential results. The central claim rests on empirical evaluation across datasets rather than any self-definitional or fitted-input reduction. This is the expected non-finding for a methods paper whose contribution is a new loss formulation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dataset of breast ultrasound images
Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28: 0 104863, 2020
2020
-
[2]
Learning to exploit temporal structure for biomedical vision-language processing
Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15016--15027, 2023
2023
-
[3]
Making the most of text semantics to improve biomedical vision--language processing
Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision--language processing. In European conference on computer vision, pp.\ 1--21. Springer, 2022
2022
-
[4]
Lung and colon cancer histopathological image dataset (lc25000)
Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A DeLand, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019
-
[5]
Knee osteoarthritis severity grading dataset
Pingjun Chen. Knee osteoarthritis severity grading dataset. Mendeley Data, 1 0 (10.17632): 0 30784984, 2018
2018
-
[6]
Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Sedigheh Eslami, Gerard De Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021
-
[8]
Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling
Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, and Weidong Cai. Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11524--11534, 2024
2024
-
[9]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017
2017
-
[10]
Clip-adapter: Better vision-language models with feature adapters
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132 0 (2): 0 581--595, 2024
2024
-
[11]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017
2017
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition
Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3942--3951, 2021
2021
-
[14]
Lp++: A surprisingly strong linear probe for few-shot clip
Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23773--23782, 2024
2024
-
[15]
A visual--language foundation model for pathology image analysis using medical twitter
Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual--language foundation model for pathology image analysis using medical twitter. Nature medicine, 29 0 (9): 0 2307--2316, 2023
2023
-
[16]
Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography
Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography. Scientific Reports, 12 0 (1): 0 11440, 2022
2022
-
[17]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021
2021
-
[18]
Learning generalizable prompt for clip with class similarity knowledge
Sehun Jung and Hyang-won Lee. Learning generalizable prompt for clip with class similarity knowledge. arXiv preprint arXiv:2502.11969, 2025
-
[19]
Multi-class texture analysis in colorectal cancer histology
Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z \"o llner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6 0 (1): 0 1--11, 2016
2016
-
[20]
Identifying medical diagnoses and treatable diseases by image-based deep learning
Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172 0 (5): 0 1122--1131, 2018
2018
-
[21]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19113--19122, 2023 a
2023
-
[22]
Self-regulating prompts: Foundational model adaptation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15190--15200, 2023 b
2023
-
[23]
Automatic no-reference quality assessment for retinal fundus images using vessel segmentation
Thomas K \"o hler, Attila Budai, Martin F Kraus, Jan Odstr c ilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation. In Proceedings of the 26th IEEE international symposium on computer-based medical systems, pp.\ 95--100. IEEE, 2013
2013
-
[24]
Biomedcoop: Learning to prompt for biomedical vision-language models
Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 14766--14776, 2025
2025
-
[25]
Current perspectives in medical image perception
Elizabeth A Krupinski. Current perspectives in medical image perception. Attention, Perception, & Psychophysics, 72 0 (5): 0 1205--1217, 2010
2010
-
[26]
Graphadapter: Tuning vision-language models with dual knowledge graph
Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang. Graphadapter: Tuning vision-language models with dual knowledge graph. Advances in Neural Information Processing Systems, 36: 0 13448--13466, 2023
2023
-
[27]
Promptkd: Unsupervised prompt distillation for vision-language models
Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26617--26626, 2024
2024
-
[28]
A survey on deep learning in medical image analysis
Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I S \'a nchez. A survey on deep learning in medical image analysis. Medical image analysis, 42: 0 60--88, 2017
2017
-
[29]
A visual-language foundation model for computational pathology
Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature medicine, 30 0 (3): 0 863--874, 2024
2024
-
[30]
Joint learning of localized representations from medical images and reports
Philip M \"u ller, Georgios Kaissis, Congyu Zou, and Daniel Rueckert. Joint learning of localized representations from medical images and reports. In European conference on computer vision, pp.\ 685--701. Springer, 2022
2022
-
[31]
Brain tumor mri dataset
Msoud Nickparvar. Brain tumor mri dataset. Kaggle, 2021
2021
-
[32]
Quilt-1m: One million image-text pairs for histopathology
Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. arXiv e-prints, pp.\ arXiv--2306, 2023
2023
-
[33]
Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection
Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference,...
2017
-
[34]
Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research
Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data, 3 0 (3): 0 25, 2018
2018
-
[35]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021
2021
-
[36]
A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision
Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, and Ismail Ben Ayed. A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. Medical Image Analysis, 99: 0 103357, 2025
2025
-
[37]
Prototypical networks for few-shot learning
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017
2017
-
[38]
Covid-19 infection localization and severity grading from chest x-ray images
Anas M Tahir, Muhammad EH Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M Sohel Rahman, Somaya Al-Maadeed, et al. Covid-19 infection localization and severity grading from chest x-ray images. Computers in biology and medicine, 139: 0 105002, 2021
2021
-
[39]
The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5 0 (1): 0 1--9, 2018
2018
-
[40]
Matching networks for one shot learning
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016
2016
-
[41]
Medclip: Contrastive learning from unpaired medical images and text
Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, pp.\ 3876, 2022
2022
-
[42]
Clip-kd: An empirical study of clip model distillation
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15952--15962, 2024
2024
-
[43]
Visual-language prompt tuning with knowledge-guided context optimization
Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6757--6767, 2023
2023
-
[44]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022
2022
-
[45]
Tip-adapter: Training-free clip-adapter for better vision-language modeling
Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021
-
[46]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16816--16825, 2022 a
2022
-
[48]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 b
2022
-
[49]
Prompt-aligned gradient for prompt tuning
Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15659--15669, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.