The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

Ahmad Algadhi; Ahmed Alzuhair; Muzammil Behzad; Omar Alkhulaif

arxiv: 2606.23897 · v1 · pith:XHUSUDRCnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

Ahmad Algadhi , Ahmed Alzuhair , Omar Alkhulaif , Muzammil Behzad This is my paper

Pith reviewed 2026-06-26 08:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords prompt distillationvision-language modelsmulti-teacher ensemblingunsupervised distillationCLIPdomain shiftbase-to-novel generalizationharmonic mean

0 comments

The pith

A two-teacher ensemble of domain-finetuned and zero-shot models improves unsupervised prompt distillation for vision-language models, with the biggest lift on domain-shifted data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends single-teacher prompt distillation by distilling from two fixed teachers at once: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 whose logits are pre-computed per dataset. It compares single-teacher PromptKD against equal-probability and confidence-weighted ensembling on four base-to-novel datasets. Confidence-weighted ensembling raises average harmonic mean from 87.52 to 89.28, with equal averaging reaching 88.88; the improvement is negligible on Caltech-101 but reaches 5.78 points on EuroSAT. The results indicate that the second teacher supplies useful complementary supervision mainly when the test distribution shifts from the fine-tuning domain.

Core claim

The paper establishes that prompt distillation from a fixed two-teacher ensemble, using confidence-weighted averaging of logits from a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher, raises average harmonic mean performance from 87.52 to 89.28 across Caltech-101, DTD, UCF101, and EuroSAT, with the largest gain of 5.78 points on the domain-shifted EuroSAT dataset.

What carries the argument

The two-teacher ensemble that pre-computes logits from a domain-finetuned PromptSRC teacher and a zero-shot EVA-CLIP teacher, then applies confidence-weighted averaging to form the distillation target for each unlabeled image.

If this is right

Confidence-weighted ensembling outperforms equal averaging on the tested datasets.
Multi-teacher distillation produces the largest accuracy lift precisely when the second teacher is applied to domain-shifted data.
The method requires no additional training of the teachers, only pre-computation of their logits per dataset.
Single-teacher results from prior work are improved by adding the zero-shot teacher under domain shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested with three or more teachers to check whether further complementary signals continue to add value.
If the zero-shot teacher already captures most of the needed signal, domain-specific fine-tuning of the first teacher might become unnecessary in some settings.
The pre-computed logits make the ensemble cheap to apply at distillation time, suggesting it could scale to larger numbers of teachers without extra cost.

Load-bearing premise

The zero-shot EVA-CLIP teacher supplies genuinely complementary supervision that the domain-finetuned PromptSRC teacher does not already capture, especially on domain-shifted data.

What would settle it

Re-running the 12-run single-seed experiments on EuroSAT with the same teachers but different random seeds yields no gain for the confidence-weighted ensemble over the single PromptSRC teacher.

Figures

Figures reproduced from arXiv: 2606.23897 by Ahmad Algadhi, Ahmed Alzuhair, Muzammil Behzad, Omar Alkhulaif.

**Figure 1.** Figure 1: TheProfessor architecture. Stage I pre-trains one PromptSRC teacher per source dataset using labeled images. Stage II trains a ViT-B/16 student on unlabeled images from the same dataset using a frozen PromptSRC teacher and cached EVA-CLIP-L/14 logits. The ensemble target is formed by equal averaging or confidence weighting, and the student is optimized with KL divergence. At inference time, only the traine… view at source ↗

read the original abstract

Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Incremental multi-teacher extension to PromptKD that shows the biggest lift on domain-shifted data but stays within the existing empirical paradigm.

read the letter

The new piece is the two-teacher setup: a domain-finetuned PromptSRC ViT-L/14 plus a fixed zero-shot EVA-CLIP-L/14, with logits combined either by equal averaging or by confidence weighting before distilling into the ViT-B/16 student. They run this on Caltech-101, DTD, UCF101, and EuroSAT and report that confidence weighting raises average HM by 1.77 points over the single-teacher PromptKD baseline (equal averaging gets 1.37), with the largest move on EuroSAT (+5.78).

The paper does the straightforward thing of testing whether the second teacher supplies useful signal the first one misses, and the numbers line up with that story on the domain-shifted set. The protocol is a 12-run single-seed sweep on public data, which keeps the comparison clean and avoids obvious self-citation loops.

The soft spots are exactly what you would expect from an abstract-level report: no error bars, single seed, and no breakdown of how much the teachers actually disagree on each dataset. That makes the +1.77 average harder to interpret as a stable improvement rather than run-to-run noise. The claim that the zero-shot teacher is complementary under shift is consistent with the results but would be tighter if the paper showed logit correlation or per-class disagreement stats.

This is for the small group already working on prompt distillation and VLM compression. A reader who needs another data point on multi-teacher ensembling for domain shift could use the numbers, but it does not change the broader method. The work is coherent on its own terms and cites the relevant baseline, so it clears the bar for peer review even though the scope is narrow.

Referee Report

2 major / 1 minor

Summary. The paper proposes The Professor, a multi-teacher extension of PromptKD for unsupervised prompt distillation in VLMs. It distills from an ensemble of a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher (with pre-computed logits), comparing single-teacher, equal-probability, and confidence-weighted ensembling. On Caltech-101, DTD, UCF101, and EuroSAT, a 12-run single-seed sweep shows confidence-weighted ensembling raises average HM from 87.52 to 89.28 (+1.77), with equal averaging at 88.88 (+1.37) and the largest gain on domain-shifted EuroSAT (+5.78 HM). The work claims the second teacher supplies complementary supervision under domain shift.

Significance. If the directional gains hold under proper variance estimation, the result indicates that multi-teacher logit ensembling can extract complementary zero-shot and domain-finetuned signals, extending the single-teacher PromptKD paradigm most effectively on domain-shifted data. The concrete HM deltas and dataset-specific pattern provide a falsifiable empirical claim, though the single-seed protocol and missing reproducibility details weaken the strength of the reported improvements.

major comments (2)

[Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.
[Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.

minor comments (1)

[Abstract] The abstract states the results 'update our earlier Caltech-only analysis' but provides no citation or pointer to that prior work, which would help readers assess novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive suggestions. We address the concerns about the statistical presentation of results and the reproducibility of the experimental protocol below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported HM improvements (+1.77 for confidence-weighted ensembling, +5.78 on EuroSAT) come from a 12-run single-seed sweep with no error bars, standard deviations, or statistical tests; this directly affects whether the central claim of complementary supervision is load-bearing or could be noise.

Authors: We acknowledge that the lack of error bars and statistical tests in the reported results limits the strength of the evidence for complementary supervision. Although the experiments consist of a 12-run single-seed sweep, which provides some measure of consistency, we agree that reporting standard deviations would be valuable. In the revised version, we will update the abstract and main text to include standard deviations across the 12 runs and discuss the implications of the single-seed protocol. We believe the directional gains, particularly the larger improvement on EuroSAT, still support the claim, but we will strengthen the presentation accordingly. revision: yes
Referee: [Abstract] Abstract: no verification is given that the HM metric and evaluation protocol match the PromptKD baseline exactly, and full training details (hyperparameters, data preprocessing, logit pre-computation) are absent; these omissions are load-bearing for reproducing the claimed gains and attributing them to the second teacher.

Authors: We agree that full details are necessary for reproducibility. The evaluation follows the exact base-to-novel protocol and HM metric as in PromptKD. In the revised manuscript, we will add explicit verification of protocol matching and include a comprehensive appendix or section with all hyperparameters, data preprocessing procedures, and details on pre-computing the logits for the zero-shot EVA-CLIP-L/14 teacher. This will allow readers to fully reproduce and attribute the gains to the multi-teacher setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports an empirical comparison of single-teacher vs. multi-teacher prompt distillation on four public datasets using fixed teacher models and standard HM metrics. No derivation, ansatz, fitted parameter, or uniqueness theorem is invoked; the claimed gains are measured directly from 12-run experiments on held-out splits. The reference to an earlier Caltech-only analysis is a minor self-citation that does not support any load-bearing premise. All reported deltas are externally falsifiable on the same public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the choice of two fixed teachers and the confidence-weighting rule.

pith-pipeline@v0.9.1-grok · 5800 in / 1181 out tokens · 22988 ms · 2026-06-26T08:40:19.559381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 linked inside Pith

[1]

Knowledge distillation with the reused teacher classifier

Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. In CVPR, 2022

2022
[2]

BAM! Born-again multi-task networks for natural language understanding

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D.\ Manning, and Quoc V.\ Le. BAM! Born-again multi-task networks for natural language understanding. In ACL, 2019

2019
[3]

EVA: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023

2023
[4]

Learning generative visual models from few training examples

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples. In CVPR Workshop, 2004

2004
[5]

Efficient knowledge distillation from an ensemble of teachers

Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017

2017
[6]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[7]

MaPLe: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. In CVPR, 2023

2023
[8]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023

2023
[9]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

Pith/arXiv arXiv 2021
[10]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

Pith/arXiv arXiv 2021
[11]

PromptKD: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. PromptKD: Unsupervised prompt distillation for vision-language models. In CVPR, 2024

2024
[12]

Adaptive multi-teacher multi-level knowledge distillation

Yuang Liu, Wei Zhang, and Jun Wang. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 2020

2020
[13]

Ensemble distribution distillation

Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In ICLR, 2020

2020
[14]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019

2019
[15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

2021
[16]

FitNets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015

2015
[17]

EVA-CLIP: Improved training techniques for CLIP at scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EVA-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389, 2023

Pith/arXiv arXiv 2023
[18]

TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance

Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In ICCV, 2023

2023
[19]

CLIP-KD: An empirical study of distilling CLIP models

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, and Yongjun Xu. CLIP-KD: An empirical study of distilling CLIP models. arXiv preprint arXiv:2307.12732, 2023

arXiv 2023
[20]

Learning from multiple teacher networks

Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In KDD, 2017

2017
[21]

Revisiting knowledge distillation via label smoothing regularization

Li Yuan, Francis E.\ H.\ Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020

2020
[22]

Decoupled knowledge distillation

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, 2022

2022
[23]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 130(9):2337--2348, 2022

2022
[24]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022. enumerate Additional Implementation Details app:impl PromptKD configuration. We use the official PromptKD ViT-B/16 configuration vit\_b16\_c2\_ep20\_batch8\_4+4ctx.yaml : 4 vision context tokens, 4 text context tokens, prompt...

2022

[1] [1]

Knowledge distillation with the reused teacher classifier

Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. In CVPR, 2022

2022

[2] [2]

BAM! Born-again multi-task networks for natural language understanding

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D.\ Manning, and Quoc V.\ Le. BAM! Born-again multi-task networks for natural language understanding. In ACL, 2019

2019

[3] [3]

EVA: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023

2023

[4] [4]

Learning generative visual models from few training examples

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples. In CVPR Workshop, 2004

2004

[5] [5]

Efficient knowledge distillation from an ensemble of teachers

Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017

2017

[6] [6]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[7] [7]

MaPLe: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. In CVPR, 2023

2023

[8] [8]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023

2023

[9] [9]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

Pith/arXiv arXiv 2021

[10] [10]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

Pith/arXiv arXiv 2021

[11] [11]

PromptKD: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. PromptKD: Unsupervised prompt distillation for vision-language models. In CVPR, 2024

2024

[12] [12]

Adaptive multi-teacher multi-level knowledge distillation

Yuang Liu, Wei Zhang, and Jun Wang. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 2020

2020

[13] [13]

Ensemble distribution distillation

Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In ICLR, 2020

2020

[14] [14]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019

2019

[15] [15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

2021

[16] [16]

FitNets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015

2015

[17] [17]

EVA-CLIP: Improved training techniques for CLIP at scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EVA-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389, 2023

Pith/arXiv arXiv 2023

[18] [18]

TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance

Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In ICCV, 2023

2023

[19] [19]

CLIP-KD: An empirical study of distilling CLIP models

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, and Yongjun Xu. CLIP-KD: An empirical study of distilling CLIP models. arXiv preprint arXiv:2307.12732, 2023

arXiv 2023

[20] [20]

Learning from multiple teacher networks

Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In KDD, 2017

2017

[21] [21]

Revisiting knowledge distillation via label smoothing regularization

Li Yuan, Francis E.\ H.\ Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020

2020

[22] [22]

Decoupled knowledge distillation

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, 2022

2022

[23] [23]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. IJCV, 130(9):2337--2348, 2022

2022

[24] [24]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022. enumerate Additional Implementation Details app:impl PromptKD configuration. We use the official PromptKD ViT-B/16 configuration vit\_b16\_c2\_ep20\_batch8\_4+4ctx.yaml : 4 vision context tokens, 4 text context tokens, prompt...

2022