pith. machine review for the scientific record. sign in

arxiv: 2604.03841 · v1 · submitted 2026-04-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Pardis Taghavi, Renjie Li, Reza Langari, Tian Liu, Zhengzhong Tu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords semi-supervised knowledge distillationvision foundation modelsinstance segmentationcontrastive lossmodel compressionself-trainingdomain adaptationpseudo-labeling
0
0 comments X

The pith

Semi-supervised distillation compresses vision foundation models into compact experts that surpass their teachers on instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a three-stage semi-supervised knowledge distillation framework that adapts vision foundation models via self-training with contrastive calibration, transfers knowledge through a unified multi-objective loss, and refines the student to correct residual bias. The central mechanism is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to select informative negatives and enforce inter-instance margins, preserving this signal from adaptation through distillation. On Cityscapes and ADE20K the resulting student, roughly 11 times smaller, exceeds both zero-shot and adapted teachers while beating prior semi-supervised distillation baselines. A sympathetic reader would care because the method reduces the annotation burden for per-pixel tasks and produces deployable models without sacrificing accuracy.

Core claim

The authors establish that maintaining an instance-aware pixel-wise contrastive loss across self-training adaptation and distillation stages aligns teacher and student embeddings, enabling a compact student to leverage abundant unlabeled images and achieve higher instance segmentation accuracy than its larger vision foundation model teachers on Cityscapes and ADE20K.

What carries the argument

An instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins, applied consistently in both adaptation and distillation.

If this is right

  • The approximately 11 times smaller student improves over zero-shot VFM teachers by +11.9 AP on Cityscapes and +8.6 AP on ADE20K.
  • It surpasses the adapted teachers by +3.4 AP on Cityscapes and +1.5 AP on ADE20K.
  • It outperforms state-of-the-art semi-supervised knowledge distillation methods on the evaluated benchmarks.
  • The framework enables effective exploitation of unlabeled images to offset the high cost of per-pixel annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive calibration and distillation stages could be tested on other dense prediction tasks such as semantic segmentation or monocular depth estimation.
  • Consistent embedding alignment might reduce sensitivity to domain shifts when the student is deployed on data distributions differing from the adaptation set.
  • The refinement stage could be examined for its effect on even smaller model scales or on foundation models with different backbone architectures.

Load-bearing premise

The pseudo-labels generated during self-training with contrastive calibration remain sufficiently accurate and unbiased across domains so that distillation and refinement can improve upon them rather than amplify errors.

What would settle it

Running the full pipeline on a new dataset where contrastive calibration produces sharply inaccurate pseudo-labels and checking whether the final student accuracy falls below the teacher's zero-shot performance.

Figures

Figures reproduced from arXiv: 2604.03841 by Pardis Taghavi, Renjie Li, Reza Langari, Tian Liu, Zhengzhong Tu.

Figure 1
Figure 1. Figure 1: Framework overview. Top: Three-stage pipeline: (1) adapt a pre-trained VFM teacher to the target domain via self-training with pixel-level contrastive calibration; (2) distill knowledge into a compact student using instance-aware contrastive sampling; (3) fine-tune the student on labeled data to correct residual pseudo-label bias. Bottom: Detailed view of stage (2): fused mask and class score maps produce … view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency comparison (log scale). knowledge distillation approaches [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Empirical margin (NegMean − PosMean) mea￾sured every 10k iterations for different values of λpxl. Center: False negative rate (FNR) for λpxl = 0.1, with the dashed line marking p = 0.5. Right: Contrastive loss for λpxl = 0.1. 4.4. Ablation Studies We perform ablation experiments to isolate the contribution of each component in the proposed pipeline. Unless other￾wise noted, all ablations are conducte… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Cityscapes. Guided dist. [3] (top) versus our method (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative bias reduction in stage-wise distillation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on ADE20K. Cityscapes maskAP ADE20K maskAP Cityscapes AP50 ADE20K AP50 Params (M) FPS 0.2 0.4 0.6 0.8 1.0 Overview of CAST Variants: Accuracy vs. Efficiency Self-training Self-training + pixel-loss Student + pixel-loss Student fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance–complexity radar chart (normalized) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention maps of the teacher model before and after adaptation. Left: zero-shot VFM teacher before adaptation. Right: teacher after adaptation with self-training and contrastive super￾vision. The adapted teacher exhibits more localized attention on target objects (person, bus, car, truck, rider) and reduced back￾ground activation, indicating improved spatial discrimination that leads to higher-quality pse… view at source ↗
Figure 9
Figure 9. Figure 9: More Qualitative Results [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge￾offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 1 [8] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised mod￾els are strong semi-supervised learners. Advan… view at source ↗
read the original abstract

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a three-stage semi-supervised knowledge distillation framework to compress vision foundation models into compact instance segmentation experts. Stage 1 performs domain adaptation of the teacher(s) via self-training with an instance-aware pixel-wise contrastive loss that fuses mask and class scores; stage 2 transfers knowledge through a unified multi-objective loss; stage 3 refines the student to mitigate residual pseudo-label bias. On Cityscapes and ADE20K the ~11× smaller student is reported to improve over zero-shot teachers by +11.9 and +8.6 AP, over adapted teachers by +3.4 and +1.5 AP, and to surpass prior SSKD methods.

Significance. If the central assumption holds, the work would demonstrate a practical route to adapting and distilling large VFMs for dense prediction under limited annotation budgets, with the cross-stage contrastive signal as a potentially reusable mechanism for controlling label noise.

major comments (2)
  1. [§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.
  2. [§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.
minor comments (2)
  1. [Abstract] Abstract: the phrase '≈11× smaller student' should be accompanied by explicit parameter counts or FLOPs for both teacher and student to permit immediate assessment of the compression ratio.
  2. [§2] §2 (Related Work): the discussion of prior SSKD methods could more explicitly contrast the proposed cross-stage contrastive calibration against existing pixel-wise or mask-aware contrastive losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.

    Authors: We agree that direct quantitative validation of pseudo-label quality would make the contribution of the contrastive loss more transparent. In the revised manuscript we will add a new table reporting pseudo-label mAP (and per-class precision/recall) on a held-out portion of the labeled training data for the adapted teacher, comparing the full instance-aware contrastive loss against an ablation that removes the contrastive term. This will directly quantify the reduction in label noise and support the claim that the loss keeps pseudo-labels within the recovery range of the later stages. revision: yes

  2. Referee: [§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.

    Authors: We acknowledge that the current experimental section lacks statistical robustness indicators. In the revision we will rerun the main Cityscapes and ADE20K experiments across five random seeds and report mean AP together with standard deviation. We will also add a dedicated ablation subsection that varies the loss-weighting coefficients, contrastive temperature, and margin over reasonable ranges, showing that the reported gains remain stable and that the chosen operating point is not an outlier. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains are measured outcomes, not reductions to fitted inputs or self-citations

full rationale

The paper outlines a three-stage SSKD framework (domain adaptation via self-training with contrastive calibration, unified multi-objective distillation, and refinement) centered on an instance-aware pixel-wise contrastive loss. However, the headline claims consist of measured AP improvements on held-out Cityscapes and ADE20K test sets (+11.9, +8.6, +3.4, +1.5 AP), presented as experimental results rather than quantities derived by construction from parameters fitted inside the same chain. No equations, self-citations, or uniqueness theorems are invoked that would reduce the reported gains to tautological redefinitions of the inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that self-training pseudo-labels can be made reliable enough via contrastive calibration and that the multi-objective loss can transfer useful knowledge without explicit supervision on most images. No new physical entities are postulated.

free parameters (2)
  • loss weighting coefficients
    The unified multi-objective loss requires coefficients that balance the contrastive term against segmentation and classification losses; these are chosen during training.
  • contrastive temperature and margin
    Hyperparameters inside the instance-aware pixel-wise contrastive loss that control negative sampling and inter-instance separation.
axioms (1)
  • domain assumption Pseudo-labels generated by the adapted teacher are sufficiently accurate on unlabeled data to serve as supervision for the student.
    Invoked in stage 1 self-training and stage 2 distillation; if false the entire pipeline amplifies errors.

pith-pipeline@v0.9.0 · 5526 in / 1496 out tokens · 38117 ms · 2026-05-13T16:48:23.531331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

  1. [1]

    Semi-supervised semantic segmenta- tion with pixel-level contrastive learning from a class-wise memory bank

    Inigo Alonso, Alberto Sabater, David Ferstl, Luis Montesano, and Ana C Murillo. Semi-supervised semantic segmenta- tion with pixel-level contrastive learning from a class-wise memory bank. InProceedings of the IEEE/CVF international conference on computer vision, pages 8219–8228, 2021. 2

  2. [2]

    Foundation models defining a new era in vision: a survey and outlook

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 1

  3. [3]

    Guided distillation for semi-supervised instance segmentation

    Tariq Berrada, Camille Couprie, Karteek Alahari, and Jakob Verbeek. Guided distillation for semi-supervised instance segmentation. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 475–483,

  4. [4]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 2

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 1

  6. [6]

    Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning

    Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, pages 6912–6920, 2021. 2 Figure 9.More Qualitative Results

  7. [7]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 1

  8. [8]

    Big self-supervised mod- els are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised mod- els are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020. 1

  9. [9]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9640–9649, 2021. 2

  10. [10]

    Semi-supervised semantic segmentation with cross pseudo supervision

    Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. Semi-supervised semantic segmentation with cross pseudo supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2613–2622,

  11. [11]

    Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024

    Xin Chen, Jie Hu, Xiawu Zheng, Jianghang Lin, Liujuan Cao, and Rongrong Ji. Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024. 6

  12. [12]

    Masked-attention mask trans- former for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 5

  13. [13]

    The cityscapes dataset for se- mantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for se- mantic urban scene understanding. InProc. of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR),

  14. [14]

    Augmentation-free dense contrastive knowledge distilla- tion for efficient semantic segmentation.Advances in Neural Information Processing Systems, 36:51359–51370, 2023

    Jiawei Fan, Chao Li, Xiaolong Liu, Meina Song, and Anbang Yao. Augmentation-free dense contrastive knowledge distilla- tion for efficient semantic segmentation.Advances in Neural Information Processing Systems, 36:51359–51370, 2023. 2

  15. [15]

    Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021

    Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021. 2

  16. [16]

    Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, page 02783649241281508, 2023

    Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, page 02783649241281508, 2023. 1

  17. [17]

    Erasing the bias: Fine-tuning foun- dation models for semi-supervised learning.arXiv preprint arXiv:2405.11756, 2024

    Kai Gan and Tong Wei. Erasing the bias: Fine-tuning foun- dation models for semi-supervised learning.arXiv preprint arXiv:2405.11756, 2024. 2

  18. [18]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1

  19. [19]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2

  20. [20]

    Pseudo-label alignment for semi-supervised instance segmentation

    Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, An- nan Shu, Guannan Jiang, and Rongrong Ji. Pseudo-label alignment for semi-supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16337–16347, 2023. 1, 6, 8

  21. [21]

    Pixel-wise contrastive distillation

    Junqiang Huang and Zichao Guo. Pixel-wise contrastive distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16359–16369, 2023. 2

  22. [22]

    Vl2lite: Task-specific knowledge distillation from large vision- language models to lightweight networks

    Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision- language models to lightweight networks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 30073–30083, 2025. 1

  23. [23]

    Mtkd: Multi-teacher knowledge distillation for image super- resolution

    Yuxuan Jiang, Chen Feng, Fan Zhang, and David Bull. Mtkd: Multi-teacher knowledge distillation for image super- resolution. InEuropean Conference on Computer Vision, pages 364–382. Springer, 2024. 2

  24. [24]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023. 1

  25. [25]

    Customkd: Customiz- ing large vision foundation for edge model improvement via knowledge distillation

    Jungsoo Lee, Debasmit Das, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, and Fatih Porikli. Customkd: Customiz- ing large vision foundation for edge model improvement via knowledge distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25176–25186,

  26. [26]

    Task-specific knowledge distillation from the vision foundation model for enhanced medical im- age segmentation.arXiv preprint arXiv:2503.06976, 2025

    Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xi- ang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, and Qing Chang. Task-specific knowledge distillation from the vision foundation model for enhanced medical im- age segmentation.arXiv preprint arXiv:2503.06976, 2025. 1

  27. [27]

    Pseudo- label quality decoupling and correction for semi-supervised instance segmentation.arXiv preprint arXiv:2505.11075,

    Jianghang Lin, Yilin Lu, Yunhang Shen, Chaoyang Zhu, Shengchuan Zhang, Liujuan Cao, and Rongrong Ji. Pseudo- label quality decoupling and correction for semi-supervised instance segmentation.arXiv preprint arXiv:2505.11075,

  28. [28]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Con- ference on Computer Vision, pages 38–55. Springer, 2024. 1, 2, 5

  29. [29]

    Liu, C.-Y

    Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021. 2, 6

  30. [30]

    S. Lu, Y . Chen, Y . Chen, et al. General lightweight framework for vision foundation model supporting multi-task and multi- center medical image analysis.Nature Communications, 16: 2097, 2025. 1

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5

  32. [32]

    Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020

    Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fa- had Shahbaz Khan, and Mubarak Shah. Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020. 2

  33. [33]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 5

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

  35. [35]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  36. [36]

    Channel-wise knowledge distillation for dense prediction

    Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5311–5320,

  37. [37]

    Foundation versus domain-specific models: Perfor- mance comparison, fusion, and explainability in face recogni- tion.arXiv preprint arXiv:2507.03541, 2025

    Redwan Sony, Parisa Farmanifard, Arun Ross, and Anil K Jain. Foundation versus domain-specific models: Perfor- mance comparison, fusion, and explainability in face recogni- tion.arXiv preprint arXiv:2507.03541, 2025. 1

  38. [38]

    Dime-fm: Distilling multi- modal and efficient foundation models

    Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, and Xide Xia. Dime-fm: Distilling multi- modal and efficient foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15521–15533, 2023. 2

  39. [39]

    Navidrivevlm: Decoupling high-level reasoning and motion planning for autonomous driving.arXiv preprint arXiv:2603.07901, 2026

    Ximeng Tao, Pardis Taghavi, Dimitar Filev, Reza Langari, and Gaurav Pandey. Navidrivevlm: Decoupling high-level reasoning and motion planning for autonomous driving.arXiv preprint arXiv:2603.07901, 2026. 1

  40. [40]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 2

  41. [41]

    Contrastive representation distilla- tion,

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation.arXiv preprint arXiv:1910.10699,

  42. [42]

    Knowledge transfer from vision foundation models for efficient training of small task-specific models.ICML2024,

    Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, and Oncel Tuzel. Knowledge transfer from vision foundation models for efficient training of small task-specific models.ICML2024,

  43. [43]

    Sam-clip: Merging vision foundation models towards semantic and spatial understanding

    Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–3647,...

  44. [44]

    Dense contrastive learning for self-supervised visual pre-training

    Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033,

  45. [45]

    Contrastmask: Contrastive learning to segment every thing

    Xuehui Wang, Kai Zhao, Ruixin Zhang, Shouhong Ding, Yan Wang, and Wei Shen. Contrastmask: Contrastive learning to segment every thing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11604–11613, 2022. 2

  46. [46]

    Detco: Unsuper- vised contrastive learning for object detection

    Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper- vised contrastive learning for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8392–8401, 2021. 2

  47. [47]

    Self-training with noisy student improves imagenet clas- sification

    Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- sification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698,

  48. [48]

    Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning

    Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021. 2

  49. [49]

    Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

    Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. 1

  50. [50]

    Forging vision foundation models for autonomous driving: Challenges, methodologies, and oppor- tunities.arXiv preprint arXiv:2401.08045, 2024

    Xu Yan, Haiming Zhang, Yingjie Cai, Jingming Guo, We- ichao Qiu, Bin Gao, Kaiqiang Zhou, Yue Zhao, Huan Jin, Jiantao Gao, et al. Forging vision foundation models for autonomous driving: Challenges, methodologies, and oppor- tunities.arXiv preprint arXiv:2401.08045, 2024. 1

  51. [51]

    Cross-image relational knowl- edge distillation for semantic segmentation

    Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowl- edge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12319–12328, 2022. 2

  52. [52]

    Clip-kd: An empirical study of clip model distillation

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 2

  53. [53]

    Multi-teacher knowledge distillation with reinforcement learning for visual recognition.arXiv preprint arXiv:2502.18510, 2025

    Chuanguang Yang, Xinqiang Yu, Han Yang, Zhulin An, Chengqing Yu, Libo Huang, and Yongjun Xu. Multi-teacher knowledge distillation with reinforcement learning for visual recognition.arXiv preprint arXiv:2502.18510, 2025. 2

  54. [54]

    Revisiting weak-to-strong consistency in semi- supervised semantic segmentation

    Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi- supervised semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 7236–7246, 2023. 2

  55. [55]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024. 2

  56. [56]

    G-detkd: Towards general distillation frame- work for object detectors via contrastive and semantic-guided feature imitation

    Lewei Yao, Renjie Pi, Hang Xu, Wei Zhang, Zhenguo Li, and Tong Zhang. G-detkd: Towards general distillation frame- work for object detectors via contrastive and semantic-guided feature imitation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3591–3600,

  57. [57]

    Sˆ 4m: Boosting semi-supervised instance segmentation with sam

    Heeji Yoon, Heeseong Shin, Eunbeen Hong, Hyunwook Choi, Hansang Cho, Daun Jeong, and Seungryong Kim. Sˆ 4m: Boosting semi-supervised instance segmentation with sam. arXiv preprint arXiv:2504.05301, 2025. 1, 6, 8

  58. [58]

    arXiv preprint arXiv:2501.04001 (2025)

    Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming- Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 1, 2

  59. [59]

    Accessing vision foundation models via imagenet-1k

    Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu. Accessing vision foundation models via imagenet-1k. InThe Thirteenth International Conference on Learning Representa- tions, 2025. 2

  60. [60]

    An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024

    Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and comprehensive pipeline for unified object grounding and de- tection.arXiv preprint arXiv:2401.02361, 2024. 5

  61. [61]

    Pixel contrastive-consistent semi-supervised semantic segmentation

    Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan, Jian Peng, and Yu-Xiong Wang. Pixel contrastive-consistent semi-supervised semantic segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7273–7282, 2021. 2

  62. [62]

    Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 5

  63. [63]

    Com- plementary relation contrastive distillation

    Jinguo Zhu, Shixiang Tang, Dapeng Chen, Shijie Yu, Yakun Liu, Mingzhe Rong, Aijun Yang, and Xiaohua Wang. Com- plementary relation contrastive distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9260–9269, 2021. 2

  64. [64]

    Ar- gus: A compact and versatile foundation model for vision

    Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 1