The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models
Pith reviewed 2026-06-26 08:40 UTC · model grok-4.3
The pith
A two-teacher ensemble of domain-finetuned and zero-shot models improves unsupervised prompt distillation for vision-language models, with the biggest lift on domain-shifted data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that prompt distillation from a fixed two-teacher ensemble, using confidence-weighted averaging of logits from a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher, raises average harmonic mean performance from 87.52 to 89.28 across Caltech-101, DTD, UCF101, and EuroSAT, with the largest gain of 5.78 points on the domain-shifted EuroSAT dataset.
What carries the argument
The two-teacher ensemble that pre-computes logits from a domain-finetuned PromptSRC teacher and a zero-shot EVA-CLIP teacher, then applies confidence-weighted averaging to form the distillation target for each unlabeled image.
If this is right
- Confidence-weighted ensembling outperforms equal averaging on the tested datasets.
- Multi-teacher distillation produces the largest accuracy lift precisely when the second teacher is applied to domain-shifted data.
- The method requires no additional training of the teachers, only pre-computation of their logits per dataset.
- Single-teacher results from prior work are improved by adding the zero-shot teacher under domain shift.
Where Pith is reading between the lines
- The approach could be tested with three or more teachers to check whether further complementary signals continue to add value.
- If the zero-shot teacher already captures most of the needed signal, domain-specific fine-tuning of the first teacher might become unnecessary in some settings.
- The pre-computed logits make the ensemble cheap to apply at distillation time, suggesting it could scale to larger numbers of teachers without extra cost.
Load-bearing premise
The zero-shot EVA-CLIP teacher supplies genuinely complementary supervision that the domain-finetuned PromptSRC teacher does not already capture, especially on domain-shifted data.
What would settle it
Re-running the 12-run single-seed experiments on EuroSAT with the same teachers but different random seeds yields no gain for the confidence-weighted ensemble over the single PromptSRC teacher.
Figures
read the original abstract
Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes The Professor, a multi-teacher extension of PromptKD for unsupervised prompt distillation in VLMs. It distills from an ensemble of a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher (with pre-computed logits), comparing single-teacher, equal-probability, and confidence-weighted ensembling. On Caltech-101, DTD, UCF101, and EuroSAT, a 12-run single-seed sweep shows confidence-weighted ensembling raises average HM from 87.52 to 89.28 (+1.77), with equal averaging at 88.88 (+1.37) and the largest gain on domain-shifted EuroSAT (+5.78 HM). The work claims the second teacher supplies complementary supervision under domain shift.
Significance. If the directional gains hold under proper variance estimation, the result indicates that multi-teacher logit ensembling can extract complementary zero-shot and domain-finetuned signals, extending the single-teacher PromptKD paradigm most effectively on domain-shifted data. The concrete HM deltas and dataset-specific pattern provide a falsifiable empirical claim, though the single-seed protocol and missing reproducibility details weaken the strength of the reported improvements.
major comments (2)
- [Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.
- [Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.
minor comments (1)
- [Abstract] The abstract states the results 'update our earlier Caltech-only analysis' but provides no citation or pointer to that prior work, which would help readers assess novelty.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive suggestions. We address the concerns about the statistical presentation of results and the reproducibility of the experimental protocol below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.
Authors: We acknowledge that the lack of error bars and statistical tests in the reported results limits the strength of the evidence for complementary supervision. Although the experiments consist of a 12-run single-seed sweep, which provides some measure of consistency, we agree that reporting standard deviations would be valuable. In the revised version, we will update the abstract and main text to include standard deviations across the 12 runs and discuss the implications of the single-seed protocol. We believe the directional gains, particularly the larger improvement on EuroSAT, still support the claim, but we will strengthen the presentation accordingly. revision: yes
-
Referee: [Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.
Authors: We agree that full details are necessary for reproducibility. The evaluation follows the exact base-to-novel protocol and HM metric as in PromptKD. In the revised manuscript, we will add explicit verification of protocol matching and include a comprehensive appendix or section with all hyperparameters, data preprocessing procedures, and details on pre-computing the logits for the zero-shot EVA-CLIP-L/14 teacher. This will allow readers to fully reproduce and attribute the gains to the multi-teacher setup. revision: yes
Circularity Check
No significant circularity
full rationale
The paper reports an empirical comparison of single-teacher vs. multi-teacher prompt distillation on four public datasets using fixed teacher models and standard HM metrics. No derivation, ansatz, fitted parameter, or uniqueness theorem is invoked; the claimed gains are measured directly from 12-run experiments on held-out splits. The reference to an earlier Caltech-only analysis is a minor self-citation that does not support any load-bearing premise. All reported deltas are externally falsifiable on the same public benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Knowledge distillation with the reused teacher classifier
Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. In CVPR, 2022
2022
-
[2]
BAM! Born-again multi-task networks for natural language understanding
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D.\ Manning, and Quoc V.\ Le. BAM! Born-again multi-task networks for natural language understanding. In ACL, 2019
2019
-
[3]
EVA: Exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023
2023
-
[4]
Learning generative visual models from few training examples
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples. In CVPR Workshop, 2004
2004
-
[5]
Efficient knowledge distillation from an ensemble of teachers
Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017
2017
-
[6]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
Pith/arXiv arXiv 2015
-
[7]
MaPLe: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. In CVPR, 2023
2023
-
[8]
Self-regulating prompts: Foundational model adaptation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023
2023
-
[9]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021
Pith/arXiv arXiv 2021
-
[10]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021
Pith/arXiv arXiv 2021
-
[11]
PromptKD: Unsupervised prompt distillation for vision-language models
Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. PromptKD: Unsupervised prompt distillation for vision-language models. In CVPR, 2024
2024
-
[12]
Adaptive multi-teacher multi-level knowledge distillation
Yuang Liu, Wei Zhang, and Jun Wang. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 2020
2020
-
[13]
Ensemble distribution distillation
Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In ICLR, 2020
2020
-
[14]
Relational knowledge distillation
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019
2019
-
[15]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021
2021
-
[16]
FitNets: Hints for thin deep nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015
2015
-
[17]
EVA-CLIP: Improved training techniques for CLIP at scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EVA-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389, 2023
Pith/arXiv arXiv 2023
-
[18]
TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance
Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In ICCV, 2023
2023
-
[19]
CLIP-KD: An empirical study of distilling CLIP models
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, and Yongjun Xu. CLIP-KD: An empirical study of distilling CLIP models. arXiv preprint arXiv:2307.12732, 2023
arXiv 2023
-
[20]
Learning from multiple teacher networks
Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In KDD, 2017
2017
-
[21]
Revisiting knowledge distillation via label smoothing regularization
Li Yuan, Francis E.\ H.\ Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020
2020
-
[22]
Decoupled knowledge distillation
Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, 2022
2022
-
[23]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 130(9):2337--2348, 2022
2022
-
[24]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022. enumerate Additional Implementation Details app:impl PromptKD configuration. We use the official PromptKD ViT-B/16 configuration vit\_b16\_c2\_ep20\_batch8\_4+4ctx.yaml : 4 vision context tokens, 4 text context tokens, prompt...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.