arxiv: 2604.03841 · v1 · submitted 2026-04-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Pardis Taghavi, Renjie Li, Reza Langari, Tian Liu, Zhengzhong Tu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords semi-supervised knowledge distillationvision foundation modelsinstance segmentationcontrastive lossmodel compressionself-trainingdomain adaptationpseudo-labeling

0 comments

The pith

Semi-supervised distillation compresses vision foundation models into compact experts that surpass their teachers on instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a three-stage semi-supervised knowledge distillation framework that adapts vision foundation models via self-training with contrastive calibration, transfers knowledge through a unified multi-objective loss, and refines the student to correct residual bias. The central mechanism is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to select informative negatives and enforce inter-instance margins, preserving this signal from adaptation through distillation. On Cityscapes and ADE20K the resulting student, roughly 11 times smaller, exceeds both zero-shot and adapted teachers while beating prior semi-supervised distillation baselines. A sympathetic reader would care because the method reduces the annotation burden for per-pixel tasks and produces deployable models without sacrificing accuracy.

Core claim

The authors establish that maintaining an instance-aware pixel-wise contrastive loss across self-training adaptation and distillation stages aligns teacher and student embeddings, enabling a compact student to leverage abundant unlabeled images and achieve higher instance segmentation accuracy than its larger vision foundation model teachers on Cityscapes and ADE20K.

What carries the argument

An instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins, applied consistently in both adaptation and distillation.

If this is right

The approximately 11 times smaller student improves over zero-shot VFM teachers by +11.9 AP on Cityscapes and +8.6 AP on ADE20K.
It surpasses the adapted teachers by +3.4 AP on Cityscapes and +1.5 AP on ADE20K.
It outperforms state-of-the-art semi-supervised knowledge distillation methods on the evaluated benchmarks.
The framework enables effective exploitation of unlabeled images to offset the high cost of per-pixel annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive calibration and distillation stages could be tested on other dense prediction tasks such as semantic segmentation or monocular depth estimation.
Consistent embedding alignment might reduce sensitivity to domain shifts when the student is deployed on data distributions differing from the adaptation set.
The refinement stage could be examined for its effect on even smaller model scales or on foundation models with different backbone architectures.

Load-bearing premise

The pseudo-labels generated during self-training with contrastive calibration remain sufficiently accurate and unbiased across domains so that distillation and refinement can improve upon them rather than amplify errors.

What would settle it

Running the full pipeline on a new dataset where contrastive calibration produces sharply inaccurate pseudo-labels and checking whether the final student accuracy falls below the teacher's zero-shot performance.

Figures

Figures reproduced from arXiv: 2604.03841 by Pardis Taghavi, Renjie Li, Reza Langari, Tian Liu, Zhengzhong Tu.

**Figure 1.** Figure 1: Framework overview. Top: Three-stage pipeline: (1) adapt a pre-trained VFM teacher to the target domain via self-training with pixel-level contrastive calibration; (2) distill knowledge into a compact student using instance-aware contrastive sampling; (3) fine-tune the student on labeled data to correct residual pseudo-label bias. Bottom: Detailed view of stage (2): fused mask and class score maps produce … view at source ↗

**Figure 2.** Figure 2: Efficiency comparison (log scale). knowledge distillation approaches [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Empirical margin (NegMean − PosMean) measured every 10k iterations for different values of λpxl. Center: False negative rate (FNR) for λpxl = 0.1, with the dashed line marking p = 0.5. Right: Contrastive loss for λpxl = 0.1. 4.4. Ablation Studies We perform ablation experiments to isolate the contribution of each component in the proposed pipeline. Unless otherwise noted, all ablations are conducte… view at source ↗

**Figure 4.** Figure 4: Qualitative results on Cityscapes. Guided dist. [3] (top) versus our method (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative bias reduction in stage-wise distillation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on ADE20K. Cityscapes maskAP ADE20K maskAP Cityscapes AP50 ADE20K AP50 Params (M) FPS 0.2 0.4 0.6 0.8 1.0 Overview of CAST Variants: Accuracy vs. Efficiency Self-training Self-training + pixel-loss Student + pixel-loss Student fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Performance–complexity radar chart (normalized) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Attention maps of the teacher model before and after adaptation. Left: zero-shot VFM teacher before adaptation. Right: teacher after adaptation with self-training and contrastive supervision. The adapted teacher exhibits more localized attention on target objects (person, bus, car, truck, rider) and reduced background activation, indicating improved spatial discrimination that leads to higher-quality pse… view at source ↗

**Figure 9.** Figure 9: More Qualitative Results [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 1 [8] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Advan… view at source ↗

read the original abstract

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a compact student beating its VFM teacher on Cityscapes and ADE20K instance segmentation via a three-stage SSKD pipeline with a new instance-aware contrastive loss, but the gains rest on unmeasured pseudo-label quality.

read the letter

The main thing to know is that this work gets an 11x smaller student to improve over its zero-shot VFM teacher by roughly 12 and 9 AP on Cityscapes and ADE20K, and even over the adapted teacher by 3.4 and 1.5 AP. They do it with limited labeled data plus unlabeled images through a three-stage process that keeps a contrastive signal alive from adaptation through distillation and refinement.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a three-stage semi-supervised knowledge distillation framework to compress vision foundation models into compact instance segmentation experts. Stage 1 performs domain adaptation of the teacher(s) via self-training with an instance-aware pixel-wise contrastive loss that fuses mask and class scores; stage 2 transfers knowledge through a unified multi-objective loss; stage 3 refines the student to mitigate residual pseudo-label bias. On Cityscapes and ADE20K the ~11× smaller student is reported to improve over zero-shot teachers by +11.9 and +8.6 AP, over adapted teachers by +3.4 and +1.5 AP, and to surpass prior SSKD methods.

Significance. If the central assumption holds, the work would demonstrate a practical route to adapting and distilling large VFMs for dense prediction under limited annotation budgets, with the cross-stage contrastive signal as a potentially reusable mechanism for controlling label noise.

major comments (2)

[§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.
[§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.

minor comments (2)

[Abstract] Abstract: the phrase '≈11× smaller student' should be accompanied by explicit parameter counts or FLOPs for both teacher and student to permit immediate assessment of the compression ratio.
[§2] §2 (Related Work): the discussion of prior SSKD methods could more explicitly contrast the proposed cross-stage contrastive calibration against existing pixel-wise or mask-aware contrastive losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.

Authors: We agree that direct quantitative validation of pseudo-label quality would make the contribution of the contrastive loss more transparent. In the revised manuscript we will add a new table reporting pseudo-label mAP (and per-class precision/recall) on a held-out portion of the labeled training data for the adapted teacher, comparing the full instance-aware contrastive loss against an ablation that removes the contrastive term. This will directly quantify the reduction in label noise and support the claim that the loss keeps pseudo-labels within the recovery range of the later stages. revision: yes
Referee: [§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.

Authors: We acknowledge that the current experimental section lacks statistical robustness indicators. In the revision we will rerun the main Cityscapes and ADE20K experiments across five random seeds and report mean AP together with standard deviation. We will also add a dedicated ablation subsection that varies the loss-weighting coefficients, contrastive temperature, and margin over reasonable ranges, showing that the reported gains remain stable and that the chosen operating point is not an outlier. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains are measured outcomes, not reductions to fitted inputs or self-citations

full rationale

The paper outlines a three-stage SSKD framework (domain adaptation via self-training with contrastive calibration, unified multi-objective distillation, and refinement) centered on an instance-aware pixel-wise contrastive loss. However, the headline claims consist of measured AP improvements on held-out Cityscapes and ADE20K test sets (+11.9, +8.6, +3.4, +1.5 AP), presented as experimental results rather than quantities derived by construction from parameters fitted inside the same chain. No equations, self-citations, or uniqueness theorems are invoked that would reduce the reported gains to tautological redefinitions of the inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that self-training pseudo-labels can be made reliable enough via contrastive calibration and that the multi-objective loss can transfer useful knowledge without explicit supervision on most images. No new physical entities are postulated.

free parameters (2)

loss weighting coefficients
The unified multi-objective loss requires coefficients that balance the contrastive term against segmentation and classification losses; these are chosen during training.
contrastive temperature and margin
Hyperparameters inside the instance-aware pixel-wise contrastive loss that control negative sampling and inter-instance separation.

axioms (1)

domain assumption Pseudo-labels generated by the adapted teacher are sufficiently accurate on unlabeled data to serve as supervision for the student.
Invoked in stage 1 self-training and stage 2 distillation; if false the entire pipeline amplifies errors.

pith-pipeline@v0.9.0 · 5526 in / 1496 out tokens · 38117 ms · 2026-05-13T16:48:23.531331+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
instance-aware pixel-wise contrastive loss that fuses mask and class scores... NT-Xent objective... Assumption 3.1 (Negative Sampling Guarantee)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear
three-stage pipeline: (1) domain adaptation via self-training with contrastive calibration, (2) unified multi-objective loss, (3) student refinement

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

[1]

Semi-supervised semantic segmenta- tion with pixel-level contrastive learning from a class-wise memory bank

Inigo Alonso, Alberto Sabater, David Ferstl, Luis Montesano, and Ana C Murillo. Semi-supervised semantic segmenta- tion with pixel-level contrastive learning from a class-wise memory bank. InProceedings of the IEEE/CVF international conference on computer vision, pages 8219–8228, 2021. 2

work page 2021
[2]

Foundation models defining a new era in vision: a survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 1

work page 2025
[3]

Guided distillation for semi-supervised instance segmentation

Tariq Berrada, Camille Couprie, Karteek Alahari, and Jakob Verbeek. Guided distillation for semi-supervised instance segmentation. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 475–483,

work page
[4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 2

work page internal anchor Pith review arXiv 2024
[5]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning

Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, pages 6912–6920, 2021. 2 Figure 9.More Qualitative Results

work page 2021
[7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 1

work page 2020
[8]

Big self-supervised mod- els are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised mod- els are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020. 1

work page 2020
[9]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9640–9649, 2021. 2

work page 2021
[10]

Semi-supervised semantic segmentation with cross pseudo supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. Semi-supervised semantic segmentation with cross pseudo supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2613–2622,

work page
[11]

Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024

Xin Chen, Jie Hu, Xiawu Zheng, Jianghang Lin, Liujuan Cao, and Rongrong Ji. Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024. 6

work page arXiv 2024
[12]

Masked-attention mask trans- former for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 5

work page 2022
[13]

The cityscapes dataset for se- mantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for se- mantic urban scene understanding. InProc. of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR),

work page
[14]

Augmentation-free dense contrastive knowledge distilla- tion for efficient semantic segmentation.Advances in Neural Information Processing Systems, 36:51359–51370, 2023

Jiawei Fan, Chao Li, Xiaolong Liu, Meina Song, and Anbang Yao. Augmentation-free dense contrastive knowledge distilla- tion for efficient semantic segmentation.Advances in Neural Information Processing Systems, 36:51359–51370, 2023. 2

work page 2023
[15]

Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021

Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021. 2

work page arXiv 2021
[16]

Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, page 02783649241281508, 2023

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, page 02783649241281508, 2023. 1

work page 2023
[17]

Erasing the bias: Fine-tuning foun- dation models for semi-supervised learning.arXiv preprint arXiv:2405.11756, 2024

Kai Gan and Tong Wei. Erasing the bias: Fine-tuning foun- dation models for semi-supervised learning.arXiv preprint arXiv:2405.11756, 2024. 2

work page arXiv 2024
[18]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1

work page 2017
[19]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Pseudo-label alignment for semi-supervised instance segmentation

Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, An- nan Shu, Guannan Jiang, and Rongrong Ji. Pseudo-label alignment for semi-supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16337–16347, 2023. 1, 6, 8

work page 2023
[21]

Pixel-wise contrastive distillation

Junqiang Huang and Zichao Guo. Pixel-wise contrastive distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16359–16369, 2023. 2

work page 2023
[22]

Vl2lite: Task-specific knowledge distillation from large vision- language models to lightweight networks

Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision- language models to lightweight networks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 30073–30083, 2025. 1

work page 2025
[23]

Mtkd: Multi-teacher knowledge distillation for image super- resolution

Yuxuan Jiang, Chen Feng, Fan Zhang, and David Bull. Mtkd: Multi-teacher knowledge distillation for image super- resolution. InEuropean Conference on Computer Vision, pages 364–382. Springer, 2024. 2

work page 2024
[24]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Customkd: Customiz- ing large vision foundation for edge model improvement via knowledge distillation

Jungsoo Lee, Debasmit Das, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, and Fatih Porikli. Customkd: Customiz- ing large vision foundation for edge model improvement via knowledge distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25176–25186,

work page
[26]

Task-specific knowledge distillation from the vision foundation model for enhanced medical im- age segmentation.arXiv preprint arXiv:2503.06976, 2025

Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xi- ang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, and Qing Chang. Task-specific knowledge distillation from the vision foundation model for enhanced medical im- age segmentation.arXiv preprint arXiv:2503.06976, 2025. 1

work page arXiv 2025
[27]

Pseudo- label quality decoupling and correction for semi-supervised instance segmentation.arXiv preprint arXiv:2505.11075,

Jianghang Lin, Yilin Lu, Yunhang Shen, Chaoyang Zhu, Shengchuan Zhang, Liujuan Cao, and Rongrong Ji. Pseudo- label quality decoupling and correction for semi-supervised instance segmentation.arXiv preprint arXiv:2505.11075,

work page arXiv
[28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Con- ference on Computer Vision, pages 38–55. Springer, 2024. 1, 2, 5

work page 2024
[29]

Liu, C.-Y

Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021. 2, 6

work page arXiv 2021
[30]

S. Lu, Y . Chen, Y . Chen, et al. General lightweight framework for vision foundation model supporting multi-task and multi- center medical image analysis.Nature Communications, 16: 2097, 2025. 1

work page 2097
[31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020

Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fa- had Shahbaz Khan, and Mubarak Shah. Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020. 2

work page arXiv 2006
[33]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 5

work page 2021
[34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Channel-wise knowledge distillation for dense prediction

Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5311–5320,

work page
[37]

Foundation versus domain-specific models: Perfor- mance comparison, fusion, and explainability in face recogni- tion.arXiv preprint arXiv:2507.03541, 2025

Redwan Sony, Parisa Farmanifard, Arun Ross, and Anil K Jain. Foundation versus domain-specific models: Perfor- mance comparison, fusion, and explainability in face recogni- tion.arXiv preprint arXiv:2507.03541, 2025. 1

work page arXiv 2025
[38]

Dime-fm: Distilling multi- modal and efficient foundation models

Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, and Xide Xia. Dime-fm: Distilling multi- modal and efficient foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15521–15533, 2023. 2

work page 2023
[39]

Navidrivevlm: Decoupling high-level reasoning and motion planning for autonomous driving.arXiv preprint arXiv:2603.07901, 2026

Ximeng Tao, Pardis Taghavi, Dimitar Filev, Reza Langari, and Gaurav Pandey. Navidrivevlm: Decoupling high-level reasoning and motion planning for autonomous driving.arXiv preprint arXiv:2603.07901, 2026. 1

work page arXiv 2026
[40]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[41]

Contrastive representation distilla- tion,

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation.arXiv preprint arXiv:1910.10699,

work page arXiv 1910
[42]

Knowledge transfer from vision foundation models for efficient training of small task-specific models.ICML2024,

Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, and Oncel Tuzel. Knowledge transfer from vision foundation models for efficient training of small task-specific models.ICML2024,

work page
[43]

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–3647,...

work page 2024
[44]

Dense contrastive learning for self-supervised visual pre-training

Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033,

work page
[45]

Contrastmask: Contrastive learning to segment every thing

Xuehui Wang, Kai Zhao, Ruixin Zhang, Shouhong Ding, Yan Wang, and Wei Shen. Contrastmask: Contrastive learning to segment every thing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11604–11613, 2022. 2

work page 2022
[46]

Detco: Unsuper- vised contrastive learning for object detection

Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper- vised contrastive learning for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8392–8401, 2021. 2

work page 2021
[47]

Self-training with noisy student improves imagenet clas- sification

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- sification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698,

work page
[48]

Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning

Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021. 2

work page 2021
[49]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. 1

work page arXiv 2024
[50]

Forging vision foundation models for autonomous driving: Challenges, methodologies, and oppor- tunities.arXiv preprint arXiv:2401.08045, 2024

Xu Yan, Haiming Zhang, Yingjie Cai, Jingming Guo, We- ichao Qiu, Bin Gao, Kaiqiang Zhou, Yue Zhao, Huan Jin, Jiantao Gao, et al. Forging vision foundation models for autonomous driving: Challenges, methodologies, and oppor- tunities.arXiv preprint arXiv:2401.08045, 2024. 1

work page arXiv 2024
[51]

Cross-image relational knowl- edge distillation for semantic segmentation

Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowl- edge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12319–12328, 2022. 2

work page 2022
[52]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 2

work page 2024
[53]

Multi-teacher knowledge distillation with reinforcement learning for visual recognition.arXiv preprint arXiv:2502.18510, 2025

Chuanguang Yang, Xinqiang Yu, Han Yang, Zhulin An, Chengqing Yu, Libo Huang, and Yongjun Xu. Multi-teacher knowledge distillation with reinforcement learning for visual recognition.arXiv preprint arXiv:2502.18510, 2025. 2

work page arXiv 2025
[54]

Revisiting weak-to-strong consistency in semi- supervised semantic segmentation

Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi- supervised semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 7236–7246, 2023. 2

work page 2023
[55]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

G-detkd: Towards general distillation frame- work for object detectors via contrastive and semantic-guided feature imitation

Lewei Yao, Renjie Pi, Hang Xu, Wei Zhang, Zhenguo Li, and Tong Zhang. G-detkd: Towards general distillation frame- work for object detectors via contrastive and semantic-guided feature imitation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3591–3600,

work page
[57]

Sˆ 4m: Boosting semi-supervised instance segmentation with sam

Heeji Yoon, Heeseong Shin, Eunbeen Hong, Hyunwook Choi, Hansang Cho, Daun Jeong, and Seungryong Kim. Sˆ 4m: Boosting semi-supervised instance segmentation with sam. arXiv preprint arXiv:2504.05301, 2025. 1, 6, 8

work page arXiv 2025
[58]

arXiv preprint arXiv:2501.04001 (2025)

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming- Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 1, 2

work page arXiv 2025
[59]

Accessing vision foundation models via imagenet-1k

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu. Accessing vision foundation models via imagenet-1k. InThe Thirteenth International Conference on Learning Representa- tions, 2025. 2

work page 2025
[60]

An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024

Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and comprehensive pipeline for unified object grounding and de- tection.arXiv preprint arXiv:2401.02361, 2024. 5

work page arXiv 2024
[61]

Pixel contrastive-consistent semi-supervised semantic segmentation

Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan, Jian Peng, and Yu-Xiong Wang. Pixel contrastive-consistent semi-supervised semantic segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7273–7282, 2021. 2

work page 2021
[62]

Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 5

work page 2019
[63]

Com- plementary relation contrastive distillation

Jinguo Zhu, Shixiang Tang, Dapeng Chen, Shijie Yu, Yakun Liu, Mingzhe Rong, Aijun Yang, and Xiaohua Wang. Com- plementary relation contrastive distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9260–9269, 2021. 2

work page 2021
[64]

Ar- gus: A compact and versatile foundation model for vision

Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 1

work page 2025