pith. sign in

arxiv: 2606.23897 · v1 · pith:XHUSUDRCnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

Pith reviewed 2026-06-26 08:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords prompt distillationvision-language modelsmulti-teacher ensemblingunsupervised distillationCLIPdomain shiftbase-to-novel generalizationharmonic mean
0
0 comments X

The pith

A two-teacher ensemble of domain-finetuned and zero-shot models improves unsupervised prompt distillation for vision-language models, with the biggest lift on domain-shifted data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends single-teacher prompt distillation by distilling from two fixed teachers at once: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 whose logits are pre-computed per dataset. It compares single-teacher PromptKD against equal-probability and confidence-weighted ensembling on four base-to-novel datasets. Confidence-weighted ensembling raises average harmonic mean from 87.52 to 89.28, with equal averaging reaching 88.88; the improvement is negligible on Caltech-101 but reaches 5.78 points on EuroSAT. The results indicate that the second teacher supplies useful complementary supervision mainly when the test distribution shifts from the fine-tuning domain.

Core claim

The paper establishes that prompt distillation from a fixed two-teacher ensemble, using confidence-weighted averaging of logits from a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher, raises average harmonic mean performance from 87.52 to 89.28 across Caltech-101, DTD, UCF101, and EuroSAT, with the largest gain of 5.78 points on the domain-shifted EuroSAT dataset.

What carries the argument

The two-teacher ensemble that pre-computes logits from a domain-finetuned PromptSRC teacher and a zero-shot EVA-CLIP teacher, then applies confidence-weighted averaging to form the distillation target for each unlabeled image.

If this is right

  • Confidence-weighted ensembling outperforms equal averaging on the tested datasets.
  • Multi-teacher distillation produces the largest accuracy lift precisely when the second teacher is applied to domain-shifted data.
  • The method requires no additional training of the teachers, only pre-computation of their logits per dataset.
  • Single-teacher results from prior work are improved by adding the zero-shot teacher under domain shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested with three or more teachers to check whether further complementary signals continue to add value.
  • If the zero-shot teacher already captures most of the needed signal, domain-specific fine-tuning of the first teacher might become unnecessary in some settings.
  • The pre-computed logits make the ensemble cheap to apply at distillation time, suggesting it could scale to larger numbers of teachers without extra cost.

Load-bearing premise

The zero-shot EVA-CLIP teacher supplies genuinely complementary supervision that the domain-finetuned PromptSRC teacher does not already capture, especially on domain-shifted data.

What would settle it

Re-running the 12-run single-seed experiments on EuroSAT with the same teachers but different random seeds yields no gain for the confidence-weighted ensemble over the single PromptSRC teacher.

Figures

Figures reproduced from arXiv: 2606.23897 by Ahmad Algadhi, Ahmed Alzuhair, Muzammil Behzad, Omar Alkhulaif.

Figure 1
Figure 1. Figure 1: TheProfessor architecture. Stage I pre-trains one PromptSRC teacher per source dataset using labeled images. Stage II trains a ViT-B/16 student on unlabeled images from the same dataset using a frozen PromptSRC teacher and cached EVA-CLIP-L/14 logits. The ensemble target is formed by equal averaging or confidence weighting, and the student is optimized with KL divergence. At inference time, only the traine… view at source ↗
read the original abstract

Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes The Professor, a multi-teacher extension of PromptKD for unsupervised prompt distillation in VLMs. It distills from an ensemble of a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher (with pre-computed logits), comparing single-teacher, equal-probability, and confidence-weighted ensembling. On Caltech-101, DTD, UCF101, and EuroSAT, a 12-run single-seed sweep shows confidence-weighted ensembling raises average HM from 87.52 to 89.28 (+1.77), with equal averaging at 88.88 (+1.37) and the largest gain on domain-shifted EuroSAT (+5.78 HM). The work claims the second teacher supplies complementary supervision under domain shift.

Significance. If the directional gains hold under proper variance estimation, the result indicates that multi-teacher logit ensembling can extract complementary zero-shot and domain-finetuned signals, extending the single-teacher PromptKD paradigm most effectively on domain-shifted data. The concrete HM deltas and dataset-specific pattern provide a falsifiable empirical claim, though the single-seed protocol and missing reproducibility details weaken the strength of the reported improvements.

major comments (2)
  1. [Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.
  2. [Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.
minor comments (1)
  1. [Abstract] The abstract states the results 'update our earlier Caltech-only analysis' but provides no citation or pointer to that prior work, which would help readers assess novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive suggestions. We address the concerns about the statistical presentation of results and the reproducibility of the experimental protocol below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.

    Authors: We acknowledge that the lack of error bars and statistical tests in the reported results limits the strength of the evidence for complementary supervision. Although the experiments consist of a 12-run single-seed sweep, which provides some measure of consistency, we agree that reporting standard deviations would be valuable. In the revised version, we will update the abstract and main text to include standard deviations across the 12 runs and discuss the implications of the single-seed protocol. We believe the directional gains, particularly the larger improvement on EuroSAT, still support the claim, but we will strengthen the presentation accordingly. revision: yes

  2. Referee: [Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.

    Authors: We agree that full details are necessary for reproducibility. The evaluation follows the exact base-to-novel protocol and HM metric as in PromptKD. In the revised manuscript, we will add explicit verification of protocol matching and include a comprehensive appendix or section with all hyperparameters, data preprocessing procedures, and details on pre-computing the logits for the zero-shot EVA-CLIP-L/14 teacher. This will allow readers to fully reproduce and attribute the gains to the multi-teacher setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports an empirical comparison of single-teacher vs. multi-teacher prompt distillation on four public datasets using fixed teacher models and standard HM metrics. No derivation, ansatz, fitted parameter, or uniqueness theorem is invoked; the claimed gains are measured directly from 12-run experiments on held-out splits. The reference to an earlier Caltech-only analysis is a minor self-citation that does not support any load-bearing premise. All reported deltas are externally falsifiable on the same public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the choice of two fixed teachers and the confidence-weighting rule.

pith-pipeline@v0.9.1-grok · 5800 in / 1181 out tokens · 22988 ms · 2026-06-26T08:40:19.559381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 4 linked inside Pith

  1. [1]

    Knowledge distillation with the reused teacher classifier

    Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. In CVPR, 2022

  2. [2]

    BAM! Born-again multi-task networks for natural language understanding

    Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D.\ Manning, and Quoc V.\ Le. BAM! Born-again multi-task networks for natural language understanding. In ACL, 2019

  3. [3]

    EVA: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023

  4. [4]

    Learning generative visual models from few training examples

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples. In CVPR Workshop, 2004

  5. [5]

    Efficient knowledge distillation from an ensemble of teachers

    Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017

  6. [6]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  7. [7]

    MaPLe: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. In CVPR, 2023

  8. [8]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023

  9. [9]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

  10. [10]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

  11. [11]

    PromptKD: Unsupervised prompt distillation for vision-language models

    Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. PromptKD: Unsupervised prompt distillation for vision-language models. In CVPR, 2024

  12. [12]

    Adaptive multi-teacher multi-level knowledge distillation

    Yuang Liu, Wei Zhang, and Jun Wang. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 2020

  13. [13]

    Ensemble distribution distillation

    Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In ICLR, 2020

  14. [14]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019

  15. [15]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

  16. [16]

    FitNets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015

  17. [17]

    EVA-CLIP: Improved training techniques for CLIP at scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EVA-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389, 2023

  18. [18]

    TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance

    Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In ICCV, 2023

  19. [19]

    CLIP-KD: An empirical study of distilling CLIP models

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, and Yongjun Xu. CLIP-KD: An empirical study of distilling CLIP models. arXiv preprint arXiv:2307.12732, 2023

  20. [20]

    Learning from multiple teacher networks

    Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In KDD, 2017

  21. [21]

    Revisiting knowledge distillation via label smoothing regularization

    Li Yuan, Francis E.\ H.\ Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020

  22. [22]

    Decoupled knowledge distillation

    Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, 2022

  23. [23]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 130(9):2337--2348, 2022

  24. [24]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022. enumerate Additional Implementation Details app:impl PromptKD configuration. We use the official PromptKD ViT-B/16 configuration vit\_b16\_c2\_ep20\_batch8\_4+4ctx.yaml : 4 vision context tokens, 4 text context tokens, prompt...