Recognition: 2 theorem links
· Lean TheoremCLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation
Pith reviewed 2026-05-15 00:16 UTC · model grok-4.3
The pith
CLIP-RD aligns student embeddings to the teacher's geometry by enforcing vertical consistency and cross-modal symmetry in relational distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.
What carries the argument
Vertical Relational Distillation (VRD), which enforces distribution-level consistency of teacher-student distillation strength across modalities, and Cross Relational Distillation (XRD), which imposes bidirectional symmetry on cross-modal teacher-student similarity distributions.
If this is right
- Distilled students retain more of the teacher's structural relationships and therefore achieve higher zero-shot performance.
- The same relational constraints can be added on top of existing distillation losses without needing new labels or modalities.
- Embedding geometry alignment improves without increasing model size or inference cost.
- The method scales to different student architectures while keeping the same teacher.
Where Pith is reading between the lines
- The symmetry and consistency ideas could transfer to distilling other contrastive multimodal models such as those trained on audio or video.
- Similar relational terms might help single-modality distillation by preserving neighborhood structures in the embedding space.
- If the relational terms prove robust, they could be combined with quantization or pruning for even smaller deployable CLIP variants.
Load-bearing premise
Enforcing distribution-level consistency of distillation strength across modalities and bidirectional symmetry on cross-modal similarities is both necessary and sufficient to preserve the teacher's embedding geometry without introducing compensating distortions or requiring extra supervision.
What would settle it
A side-by-side evaluation in which the CLIP-RD student shows no measurable gain, or an actual loss, in zero-shot image-text retrieval or classification accuracy relative to a standard distillation baseline on common benchmarks.
Figures
read the original abstract
CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CLIP-RD, a relational knowledge distillation framework for compressing CLIP models. It introduces Vertical Relational Distillation (VRD) to enforce distribution-level consistency of teacher-student distillation strength across modalities and Cross Relational Distillation (XRD) to impose bidirectional symmetry on cross-modal teacher-student similarity distributions. The central claim is that jointly modeling these multi-directional relational structures produces faithful alignment of student embedding geometry with the teacher, yielding a 0.8 percentage point improvement over prior distillation methods.
Significance. If the reported gains are robust and the VRD/XRD terms are shown to be the operative mechanism, the work would offer a targeted improvement to CLIP distillation by explicitly preserving relational structure rather than relying solely on standard contrastive losses. This could be useful for deploying smaller CLIP variants while retaining zero-shot capabilities. The absence of ablations, dataset details, and baseline comparisons in the manuscript prevents a full assessment of whether the contribution is incremental or substantive.
major comments (2)
- [Abstract] Abstract: The claim that CLIP-RD 'outperforms existing methods by 0.8%p' is presented without any description of the datasets, evaluation metrics, baseline methods, training hyperparameters, or experimental protocol. This information is required to determine whether the central performance claim is supported.
- [Method] Method section (VRD and XRD definitions): The paper asserts that VRD and XRD together promote faithful geometry alignment, yet no ablation results are supplied that isolate the contribution of these two terms (e.g., performance when VRD or XRD is removed while holding all other loss components and training details fixed). Without such controls, it is impossible to verify that the multi-directional relational modeling is the load-bearing factor behind the reported gain rather than other unmentioned changes in training.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that additional details and controls are needed to strengthen the presentation of our results and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that CLIP-RD 'outperforms existing methods by 0.8%p' is presented without any description of the datasets, evaluation metrics, baseline methods, training hyperparameters, or experimental protocol. This information is required to determine whether the central performance claim is supported.
Authors: We agree that the abstract should supply sufficient context for the reported gain. In the revised version we will expand the abstract with a concise statement of the primary datasets (ImageNet and COCO for zero-shot evaluation), the metric (top-1 accuracy), the main baselines (standard KD, Distill-CLIP, and related relational methods), and a reference to the training protocol described in Section 4. This change will make the 0.8 percentage point improvement claim directly interpretable. revision: yes
-
Referee: [Method] Method section (VRD and XRD definitions): The paper asserts that VRD and XRD together promote faithful geometry alignment, yet no ablation results are supplied that isolate the contribution of these two terms (e.g., performance when VRD or XRD is removed while holding all other loss components and training details fixed). Without such controls, it is impossible to verify that the multi-directional relational modeling is the load-bearing factor behind the reported gain rather than other unmentioned changes in training.
Authors: We acknowledge the necessity of isolating the contributions of VRD and XRD. The revised manuscript will include a new ablation table that reports performance when VRD is removed, when XRD is removed, and when both are removed, while keeping the base contrastive loss, optimizer, and all other hyperparameters identical. These controlled experiments will confirm that the multi-directional relational terms are responsible for the observed improvement in embedding geometry alignment. revision: yes
Circularity Check
No significant circularity; loss terms introduced independently
full rationale
The paper defines VRD and XRD as explicit additive terms in the distillation objective (distribution-level consistency across modalities and bidirectional symmetry on cross-modal similarities). These are not defined in terms of the final performance metric, nor do they reduce to a fitted parameter renamed as prediction. No self-citation chain is invoked to justify uniqueness or to smuggle an ansatz; the central claim rests on the empirical outcome of the joint loss rather than on a definitional equivalence. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level... XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Pro- ceedings of the European conference on computer vision (ECCV)
Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Pro- ceedings of the European conference on computer vision (ECCV). pp. 384–400 (2018)
2018
-
[2]
In: European Conference on Computer Vision (2014)
Bossard, L., Guillaumin, M., Van Gool, L.: Food- 101 – mining discriminative components with ran- dom forests. In: European Conference on Computer Vision (2014)
2014
-
[3]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3558–3568 (2021)
2021
-
[4]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition
Chen, K., Wu, X.: Vtqa: Visual text question an- swering via entity alignment and cross-media rea- soning. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 27218–27227 (2024)
2024
-
[5]
Proceedings of the IEEE105(10), 1865–1883 (2017)
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE105(10), 1865–1883 (2017)
2017
-
[6]
arXiv preprint arXiv:2505.21549 (2025)
Csizmadia, D., Codreanu, A., Sim, V., Prabhu, V., Lu, M., Zhu, K., O’Brien, S., Sharma, V.: Dis- till clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation. arXiv preprint arXiv:2505.21549 (2025)
-
[7]
In: Find- ings of the Association for Computational Linguis- tics: ACL 2022
Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., Fung, P.: Enabling multimodal generation on clip via vision-language knowledge distillation. In: Find- ings of the Association for Computational Linguis- tics: ACL 2022. pp. 2383–2395 (2022)
2022
-
[8]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Dancette, C., Whitehead, S., Maheshwary, R., Vedantam, R., Scherer, S., Chen, X., Cord, M., Rohrbach, M.: Improving selective visual question answering by learning from your peers. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24049–24059 (2023)
2023
-
[9]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei- Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
2009
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis- senborn, D., Zhai, X., Unterthiner, T., Dehghani, M., 8 Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recog- nition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
arXiv preprint arXiv:2308.15273 (2023)
Eom, S., Ho, N., Oh, J., Yun, S.Y.: Cross-modal retrieval meets inference: Improving zero-shot clas- sification with cross-modal retrieval. arXiv preprint arXiv:2308.15273 (2023)
-
[12]
arXiv preprint arXiv:2406.17639 (2024)
Eslami, S., de Melo, G.: Mitigate the gap: Investigat- ing approaches for improving cross-modal alignment in clip. arXiv preprint arXiv:2406.17639 (2024)
-
[13]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowl- edge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1428–1438 (2021)
2021
-
[14]
Computer vision and Image understand- ing106(1), 59–70 (2007)
Fei-Fei, L., Fergus, R., Perona, P.: Learning genera- tive visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understand- ing106(1), 59–70 (2007)
2007
-
[15]
In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition
Ge, Y., Ren, J., Gallagher, A., Wang, Y., Yang, M.H., Adam, H., Itti, L., Lakshminarayanan, B., Zhao, J.: Improving zero-shot generalization and robust- ness of multi-modal models. In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition. pp. 11093–11101 (2023)
2023
-
[16]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
2016
-
[17]
IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217–2226 (2019)
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eu- rosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217–2226 (2019)
2019
-
[18]
In: Proceedings of the IEEE/CVF international con- ference on computer vision
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., et al.: The many faces of robustness: A critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 8340–8349 (2021)
2021
-
[19]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Advances in Neural Information Processing Systems 37, 81077–81104 (2024)
Huang, C., Seto, S., Abnar, S., Grangier, D., Jaitly, N., Susskind, J.: Aggregate-and-adapt natural lan- guage prompts for downstream generalization of clip. Advances in Neural Information Processing Systems 37, 81077–81104 (2024)
2024
-
[21]
In: Findings of the association for computational linguistics: EMNLP
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: Distilling bert for natural language understanding. In: Findings of the association for computational linguistics: EMNLP
-
[22]
4163–4174 (2020)
pp. 4163–4174 (2020)
2020
-
[23]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[24]
In: Proceedings of the IEEE conference on computer vi- sion and pattern recognition
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vi- sion and pattern recognition. pp. 3128–3137 (2015)
2015
-
[25]
arXiv preprint arXiv:2106.14681 (2021)
Kim, J., Chang, S., Kwak, N.: Pqk: model com- pression via pruning, quantization, and knowledge distillation. arXiv preprint arXiv:2106.14681 (2021)
-
[26]
arXiv preprint arXiv:2105.08919 (2021)
Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.Y.: Com- paring kullback-leibler divergence and mean squared error loss in knowledge distillation. arXiv preprint arXiv:2105.08919 (2021)
-
[27]
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Tech. rep., De- partment of Computer Science, University of Toronto (2009)
2009
-
[28]
arXiv preprint arXiv:2209.15639 (2022)
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A., Angelova, A.: F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639 (2022)
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., Yang, J.: Promptkd: Unsupervised prompt distilla- tion for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26617–26626 (2024)
2024
-
[30]
In: European con- ference on computer vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European con- ference on computer vision. pp. 740–755. Springer (2014)
2014
-
[31]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight de- cay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
In: Proceed- ings of the AAAI conference on artificial intelli- gence
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowl- edge distillation via teacher assistant. In: Proceed- ings of the AAAI conference on artificial intelli- gence. vol. 34, pp. 5191–5198 (2020) 9
2020
-
[33]
arXiv preprint arXiv:2208.06366 (2022)
Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366 (2022)
-
[34]
Neurocomputing555, 126658 (2023)
Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A.W., Yu, J., Chen, Y.T., Luong, M.T., Wu, Y., et al.: Combined scaling for zero-shot transfer learning. Neurocomputing555, 126658 (2023)
2023
-
[35]
In: Interna- tional conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual mod- els from natural language supervision. In: Interna- tional conference on machine learning. pp. 8748–
-
[36]
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International conference on machine learning. pp. 5389–5400. PMLR (2019)
2019
-
[37]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[38]
In: Thirty-sixth Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track (2022)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kun- durthy, S.R., Crowson, K., Schmidt, L., Kaczmar- czyk, R., Jitsev, J.: LAION-5b: An open large-scale dataset for training next generation image-text mod- els. In: Thirty-sixth Conference on Neural Informa-...
2022
-
[39]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmar- czyk, R., Mullis, C., Katta, A., Coombes, T., Jit- sev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)
2018
-
[41]
Cognitive Robotics 1, 159–167 (2021)
Tan, C., Xu, X., Shen, F.: A survey of zero shot detec- tion: Methods and applications. Cognitive Robotics 1, 159–167 (2021)
2021
-
[42]
In: Inter- national conference on machine learning
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: Inter- national conference on machine learning. pp. 6105–
-
[43]
In: International conference on machine learning
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J ´egou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357. PMLR (2021)
2021
-
[44]
Advances in neural information processing systems30(2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)
2017
-
[45]
Efficient large language models: A survey
Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., et al.: Efficient large language models: A survey. arXiv preprint arXiv:2312.03863 (2023)
-
[46]
Advances in neural information pro- cessing systems32(2019)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local pre- dictive power. Advances in neural information pro- cessing systems32(2019)
2019
-
[47]
arXiv preprint arXiv:2201.05729 (2022)
Wang, Z., Codella, N., Chen, Y.C., Zhou, L., Yang, J., Dai, X., Xiao, B., You, H., Chang, S.F., Yuan, L.: Clip-td: Clip targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022)
-
[48]
arXiv preprint arXiv:2307.07397 (2023)
Wang, Z., Liang, J., He, R., Xu, N., Wang, Z., Tan, T.: Improving zero-shot generalization for clip with syn- thesized prompts. arXiv preprint arXiv:2307.07397 (2023)
-
[49]
In: Euro- pean conference on computer vision
Wei, L., Xie, L., Zhou, W., Li, H., Tian, Q.: Mvp: Multimodality-guided visual pre-training. In: Euro- pean conference on computer vision. pp. 337–353. Springer (2022)
2022
-
[50]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., et al.: Tinyclip: Clip distillation via affinity mimick- ing and weight inheritance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21970–21980 (2023)
2023
-
[51]
In: European conference on computer vision
Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretraining distillation for small vision transformers. In: European conference on computer vision. pp. 68–85. Springer (2022)
2022
-
[52]
In: 2010 IEEE computer society conference on computer vision and pattern recognition
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Tor- ralba, A.: Sun database: Large-scale scene recog- nition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3485–3492. IEEE (2010)
2010
-
[53]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: Clip-kd: An empirical study of clip model distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 15952–15962 (2024) 10
2024
-
[54]
Transactions of the association for com- putational linguistics2, 67–78 (2014)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the association for com- putational linguistics2, 67–78 (2014)
2014
-
[55]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123–18133 (2022)
2022
-
[56]
arXiv preprint arXiv:2510.21879 (2025)
Zhang, S.H., Tang, W.C., Wu, C., Hu, P., Li, N., Zhang, L.J., Zhang, Q., Zhang, S.Q.: Ternaryclip: Efficiently compressing vision-language models with ternary weights and distilled knowledge. arXiv preprint arXiv:2510.21879 (2025)
-
[57]
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learn- ing to prompt for vision-language models. Interna- tional journal of computer vision130(9), 2337–2348 (2022) 11 A Implementation Details Table 5 presents the configurations of the image and text encoders used in our experiments. It includes the networks used in both the main experiments and the additional exp...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.