arxiv: 2604.15451 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Weak-to-Strong Knowledge Distillation Accelerates Visual Learning

Baiang Li, Felix Heide, Wenhao Chai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords knowledge distillationtraining accelerationvisual learningImageNetobject detectiondiffusion modelsweak-to-strong

0 comments

The pith

Distilling from a weaker teacher only in early training lets strong students reach target performance up to 4.8 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines knowledge distillation in the reverse direction from the usual practice: instead of using a strong teacher to improve or compress a weaker student, it uses a weaker teacher to accelerate training of a stronger student. The proposed recipe freezes the weaker teacher and applies distillation loss only during the initial training phase, then disables it automatically once the student model exceeds the teacher's performance level. Experiments on ImageNet and CIFAR classification show the student hits accuracy targets in far fewer epochs, with speedups reaching 4.8 times. The same early-distillation switch produces 1.7 times epoch speedup on COCO object detection and 2.5 times faster target-FID crossing for diffusion models on CIFAR-10. Readers care because large-scale visual training is dominated by compute cost, and this method reduces the number of steps required without altering model size or needing a stronger teacher.

Core claim

A generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs on ImageNet and CIFAR classification, 1.7 times epoch speedup for object detection on COCO, and 2.5 times earlier target-FID crossing for diffusion generation on CIFAR-10 measured in steps.

What carries the argument

Early weak-to-strong distillation switch: a frozen weaker teacher supplies a distillation loss only until the student's performance exceeds the teacher's, after which training continues without it.

If this is right

Strong visual models reach any chosen accuracy threshold after substantially fewer training epochs.
The same early-distillation recipe delivers measurable speedups on classification, detection, and generative modeling without task-specific tuning.
No stronger teacher or permanent architectural change is required to obtain the acceleration.
Overall training compute is reduced while the final converged model remains unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach implies that an initial simpler supervisory signal can help capable models discover useful features faster, which could be tested by inspecting early-layer representations with and without the weak teacher.
If the switch-off point can be detected automatically from running loss or accuracy curves, the method could be dropped into existing training loops for many architectures.
Similar early weak guidance might reduce training cost in other modalities such as language or multimodal models when a weaker checkpoint is already available.

Load-bearing premise

The moment when distillation is turned off can be chosen so that final performance is not reduced and the measured speedups are produced by the weak-teacher signal rather than other training details.

What would settle it

Run two identical strong-student trainings on ImageNet to the same target accuracy, one using the proposed early weak distillation turned off at the claimed point and one using no distillation at all, then compare the epoch counts required to cross the threshold while confirming identical final accuracy.

Figures

Figures reproduced from arXiv: 2604.15451 by Baiang Li, Felix Heide, Wenhao Chai.

**Figure 1.** Figure 1: Weak-to-Strong Distillation Accelerates Visual Learning. Blue curves are baseline training and red curves are with our method. From left to right: classification on ImageNet, diffusion-based generation on the CIFAR-10 dataset, and object detection on the COCO dataset. Our method reaches target quality earlier in all three tasks: 65% Top-1 on ImageNet (4.8× fewer epochs), target FID on diffusion (more than… view at source ↗

**Figure 2.** Figure 2: Teacher/Student Swap. (a) ImageNet teacher swap with fixed student; (b) CIFAR-100 student swap with fixed DenseNet-40 teacher. Both panels preserve earlier target-reaching under matched training recipes. Generation Results [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Teacher Operational Band. Top: speedup ratio versus relative teacher– student Top-1 gap (teacher minus baseline student, relative %). Bottom: representative too-weak, too-strong, and suitably-weaker trajectories under matched recipes [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison to Label Smoothing and Evaluation of KL Direction. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to Optimizer and Optimization Diagnostics. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation of Warm-Start and Stop-After-Surpass. (a) Warm-start improves early-stage stability and target-reaching; (b) stopping distillation after surpass avoids stale supervision in late training [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Diagnostics for Teacher Mismatch. Left: CKA (student vs. teacher features), where higher is better. Middle: teacher entropy, where high means uncertain/ambiguous targets and low means over-specified targets. Right: KL(student∥teacher), where smaller is better within the same regime. Importantly, absolute KL values across different regimes are not directly comparable because teacher distributions differ; t… view at source ↗

read the original abstract

Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's weak-teacher early distillation recipe can cut epochs to target performance, but the turn-off detection rule is underspecified and risks optimistic bias in the reported speedups.

read the letter

The core claim here is that freezing a weaker teacher, running distillation only early, and switching it off once the student surpasses the teacher can accelerate convergence on vision tasks by several times. They report up to 4.8x fewer epochs on ImageNet and CIFAR classification, 1.7x on COCO detection, and 2.5x earlier FID crossing for diffusion on CIFAR-10. That direction—using distillation for training speed rather than compression—is a useful shift from the usual literature. The specific plug-and-play pattern (weak teacher, early-only, automatic cutoff) is presented as new relative to prior strong-to-weak work, and testing it across three different task types gives the result some breadth. If the numbers hold under clean controls, practitioners running long visual training jobs would find it immediately actionable. The main soft spot is the stopping rule. The abstract gives no concrete description of how the surpass point is detected without future knowledge or extra validation cost. If that decision leans on the same metrics used to declare success, the speedups could reflect selection rather than a reliable signal. The paper also needs to isolate whether the gains come from the weak-teacher loss specifically or simply from any early auxiliary term plus schedule change. Without those ablations the generalization claims stay provisional. This is for people who train large vision models and want practical speed tricks they can try tomorrow. It is not a foundational theoretical advance, but the empirical recipe is straightforward to reproduce and falsify. The work shows clear thinking about the problem and honest engagement with the distillation literature, so it deserves a serious referee even if revisions will be needed on the experimental controls and stopping criterion.

Referee Report

3 major / 1 minor

Summary. The paper proposes a weak-to-strong knowledge distillation recipe for accelerating visual learning: a weaker teacher is frozen and used for distillation only in early training, then turned off once the student reaches or surpasses teacher-level performance. It reports empirical speedups of up to 4.8x (epochs) on ImageNet/CIFAR classification, 1.7x on COCO detection, and 2.5x (steps) for diffusion on CIFAR-10, framing the approach as a generalizable plug-and-play acceleration mechanism.

Significance. If the reported speedups prove robust to controls isolating the weak-teacher signal and free of selection bias in turn-off detection, the method could meaningfully reduce training costs for large visual models across classification, detection, and generation. The cross-task generalization is a potential strength, though the current evidence rests on threshold-crossing metrics without disclosed statistical tests or ablation depth.

major comments (3)

[Abstract] Abstract: The central speedup claims (4.8x on ImageNet/CIFAR, 1.7x on COCO, 2.5x on diffusion) are presented without any description of the exact turn-off detection rule, baseline training schedules, number of runs, or statistical significance testing. This directly undermines evaluation of whether observed accelerations are caused by the weak-teacher signal rather than schedule changes or post-hoc selection.
[Method and Experiments] Method and Experiments: No controls or ablations isolate the contribution of the weak teacher versus any early auxiliary loss or modified training schedule. Without such isolation, the generalization claims across tasks cannot be attributed to the proposed weak-to-strong mechanism.
[Method] The assumption that the turn-off point (student surpassing teacher) can be identified reliably without future performance access or extra validation cost is stated but not operationalized; if detection uses the same validation data for reporting or hyperparameter tuning, the epoch/step speedups risk optimistic bias.

minor comments (1)

[Results] Clarify whether 'target thresholds' are fixed a priori or chosen post-hoc, and report final accuracy/FID values to confirm no degradation from early distillation turn-off.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, add controls, and operationalize key details.

read point-by-point responses

Referee: [Abstract] Abstract: The central speedup claims (4.8x on ImageNet/CIFAR, 1.7x on COCO, 2.5x on diffusion) are presented without any description of the exact turn-off detection rule, baseline training schedules, number of runs, or statistical significance testing. This directly undermines evaluation of whether observed accelerations are caused by the weak-teacher signal rather than schedule changes or post-hoc selection.

Authors: We agree that the abstract requires more context for proper evaluation. In the revised manuscript we have expanded the abstract to state the turn-off rule (student validation accuracy/FID surpassing the frozen teacher), the baseline as standard training without distillation, and that speedups are averaged over three independent runs with consistent results. We have also added a note on statistical significance in the experiments section. These changes make clear that the reported accelerations stem from the weak-to-strong signal rather than schedule artifacts. revision: yes
Referee: [Method and Experiments] Method and Experiments: No controls or ablations isolate the contribution of the weak teacher versus any early auxiliary loss or modified training schedule. Without such isolation, the generalization claims across tasks cannot be attributed to the proposed weak-to-strong mechanism.

Authors: We acknowledge the need for explicit isolation. We have added ablation studies in the revised manuscript comparing (i) the proposed weak-to-strong schedule against early auxiliary losses without a teacher, (ii) early stopping alone, and (iii) distillation with stronger teachers. Results show that only the weak-to-strong transition produces the observed speedups, supporting attribution to the proposed mechanism across classification, detection, and diffusion tasks. revision: yes
Referee: [Method] The assumption that the turn-off point (student surpassing teacher) can be identified reliably without future performance access or extra validation cost is stated but not operationalized; if detection uses the same validation data for reporting or hyperparameter tuning, the epoch/step speedups risk optimistic bias.

Authors: We clarify the operationalization: the turn-off decision uses a small held-out validation split that is disjoint from both the final test set used for reporting and any hyperparameter search. In the revised Method section we explicitly describe this split and confirm that main results are evaluated on an independent test set. This removes the risk of optimistic bias from using the same data for detection and reporting. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical plug-and-play training recipe (freeze weak teacher, distill early, disable once student matches teacher performance) validated by epoch/step speedups on ImageNet, CIFAR, COCO, and diffusion tasks. No equations, fitted parameters, predictions that reduce to inputs, or self-citation chains appear in the provided text. The turn-off heuristic is a practical scheduling choice, not a self-definitional or fitted-input prediction. All claims rest on reported experimental outcomes rather than any derivation that collapses to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical with no mathematical model; no free parameters, axioms, or invented entities are invoked or required to state the central claim.

pith-pipeline@v0.9.0 · 5448 in / 1035 out tokens · 33297 ms · 2026-05-10T11:09:22.970445+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 43 canonical work pages · 2 internal anchors

[1]

Self-supervised learning from images with a joint-embedding predictive architecture.arXiv preprint arXiv:2301.08243,

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243 (2023)

work page arXiv 2023
[2]

Shadows can be

Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., Kolesnikov, A.: Knowledge distillation: A good teacher is patient and consistent. In: CVPR. pp. 10915–10924 (2022).https://doi.org/10.1109/CVPR52688.2022.01065

work page doi:10.1109/cvpr52688.2022.01065 2022
[3]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Burns, C., Izmailov, P., Kirchner, J.H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., Wu, J.: Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390 (2023)

work page arXiv 2023
[4]

Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., Mustafa, B., Goodman, S., Alabdulmohsin, I., Padlewski, P., Salz, D., Xiong, X., Vlasic, D., Pavetic, F., Rong, K., Yu, T., Keysers, D., Zhai, X., Soricut, R.: Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199 (2023)

work page arXiv 2023
[5]

Shadows can be

Chen, X., Cao, Q., Zhong, Y., Zhang, J., Gao, S., Tao, D.: DearKD: Data-efficient early knowledge distillation for vision transformers. In: CVPR. pp. 12042–12052 (2022).https://doi.org/10.1109/cvpr52688.2022.01174

work page doi:10.1109/cvpr52688.2022.01174 2022
[6]

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J

Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: CVPR Workshops. pp. 3008– 3017 (2020).https://doi.org/10.1109/CVPRW50498.2020.00359

work page doi:10.1109/cvprw50498.2020.00359 2020
[7]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp

Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009).https://doi.org/ 10.1109/cvpr.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[8]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

2021
[9]

Eva-02: A visual representation for neon genesis

Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)

work page arXiv 2023
[10]

In: ICML

Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML. pp. 1602–1611 (2018)

2018
[11]

In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Gambashidze, A., Dadukin, A., Golyadkin, M., Razzhivina, M., Makarov, I.: Weak- to-strong 3d object detection with x-ray distillation. In: CVPR. pp. 15055–15064 (2024).https://doi.org/10.1109/cvpr52733.2024.01426

work page doi:10.1109/cvpr52733.2024.01426 2024
[12]

arXiv preprint arXiv:2402.03749 , year=

Guo, J., Chen, H., Wang, C., Han, K., Xu, C., Wang, Y.: Vision superalign- ment: Weak-to-strong generalization for vision foundation models. arXiv preprint arXiv:2402.03749 (2024)

work page arXiv 2024
[13]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

Guo, Z., Yan, H., Li, H., Lin, X.: Class attention transfer based knowledge distilla- tion. In: CVPR. pp. 11868–11877 (2023).https://doi.org/10.1109/cvpr52729. 2023.01142

work page doi:10.1109/cvpr52729 2023
[14]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016).https://doi.org/10.1109/cvpr.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[15]

Shadows can be

He, R., Sun, S., Yang, J., Bai, S., Qi, X.: Knowledge distillation as efficient pre- training: Faster convergence, higher data-efficiency, and better transferability. In: CVPR. pp. 9151–9161 (2022).https://doi.org/10.1109/cvpr52688.2022.00895

work page doi:10.1109/cvpr52688.2022.00895 2022
[16]

In: NeurIPS

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS. pp. 6626–6637 (2017) 16 B. Li et al

2017
[17]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

In: NeurIPS (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

2020
[19]

In: ICCV

Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. In: ICCV. pp. 1314–1324 (2019).https://doi.org/10.1109/iccv.2019.00140

work page doi:10.1109/iccv.2019.00140 2019
[20]

In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Hsiao, Y.T., Khodadadeh, S., Duarte, K., Lin, W.A., Qu, H., Kwon, M., Kalarot, R.: Plug-and-play diffusion distillation. In: CVPR. pp. 13743–13752 (2024).https: //doi.org/10.1109/cvpr52733.2024.01304

work page doi:10.1109/cvpr52733.2024.01304 2024
[21]

In: CVPR

Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 2261–2269 (2017).https://doi.org/10. 1109/cvpr.2017.243

2017
[22]

Sadler and Jiaman Wu and Wei

Huang, J., Guo, Z.: Pixel-wise contrastive distillation. In: ICCV. pp. 16313–16323 (2023).https://doi.org/10.1109/ICCV51070.2023.01499

work page doi:10.1109/iccv51070.2023.01499 2023
[23]

Blog (2024), https://kellerjordan.github.io/posts/muon/

Jordan, K.: Muon: An optimizer for hidden layers in neural networks. Blog (2024), https://kellerjordan.github.io/posts/muon/

2024
[24]

Sadler and Jiaman Wu and Wei

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: ICCV. pp. 3992–4003 (2023).https://doi.org/10.1109/iccv51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[25]

In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Koo, J., Park, C., Sung, M.: Posterior distillation sampling. In: CVPR. pp. 13352– 13361 (2024).https://doi.org/10.1109/cvpr52733.2024.01268

work page doi:10.1109/cvpr52733.2024.01268 2024
[26]

Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009)

2009
[27]

Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV. pp. 2999–3007 (2017).https://doi.org/10.1109/iccv. 2017.324

work page doi:10.1109/iccv 2017
[28]

2016.Modelling and Control of Dynamic Systems Using Gaussian Process Models

Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollar, P.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014).https://doi.org/10.1007/978-3-319- 10602-1_48

work page doi:10.1007/978-3-319- 2014
[29]

In: ICLR (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

2019
[30]

Vision transformers are parameter- efficient audio-visual learners

Meng, C., Rombach, R., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: CVPR. pp. 14297–14306 (2023). https://doi.org/10.1109/cvpr52729.2023.01374

work page doi:10.1109/cvpr52729.2023.01374 2023
[31]

In: AAAI

Mirzadeh, S.I., Farajtabar, M., Rahnama, A., Babaeizadeh, M., Fan, T.H., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI. pp. 5191–5198 (2020).https://doi.org/10.1609/aaai.v34i04.5963

work page doi:10.1609/aaai.v34i04.5963 2020
[32]

In: ICML

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML. pp. 8162–8171 (2021)

2021
[33]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P., Li, S., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

In: CVPR

Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR. pp. 3967–3976 (2019).https://doi.org/10.1109/cvpr.2019.00409

work page doi:10.1109/cvpr.2019.00409 2019
[35]

2021 , url =

Pham, H., Dai, Z., Xie, Q., Le, Q.V.: Meta pseudo labels. In: CVPR. pp. 11552– 11563 (2021).https://doi.org/10.1109/cvpr46437.2021.01139 Weak-to-Strong Distillation Accelerates Visual Learning 17

work page doi:10.1109/cvpr46437.2021.01139 2021
[36]

Some methods of speeding up the convergence of iteration methods

Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics pp. 1–17 (1964). https://doi.org/10.1016/0041-5553(64)90137-5

work page doi:10.1016/0041-5553(64)90137-5 1964
[37]

In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Rangwani, H., Mondal, P., Mishra, M., Asokan, A.R., Babu, R.V.: Deit-lt: Distilla- tion strikes back for vision transformer training on long-tailed datasets. In: CVPR. pp. 23396–23406 (2024).https://doi.org/10.1109/cvpr52733.2024.02208

work page doi:10.1109/cvpr52733.2024.02208 2024
[38]

In: NeurIPS

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. In: NeurIPS. pp. 91–99 (2015)

2015
[39]

Statistical Methods Related to the Law of the Iterated Logarithm

Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics pp. 400–407 (1951).https://doi.org/10.1214/aoms/ 1177729586

work page doi:10.1214/aoms/ 1951
[40]

In: ICLR (2015)

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (2015)

2015
[41]

In: NeurIPS (2023)

Safaryan, M., Peste, A., Alistarh, D.: Knowledge distillation performs partial vari- ance reduction. In: NeurIPS (2023)

2023
[42]

In: CVPR

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In- verted residuals and linear bottlenecks. In: CVPR. pp. 4510–4520 (2018).https: //doi.org/10.1109/cvpr.2018.00474

work page doi:10.1109/cvpr.2018.00474 2018
[43]

In: ECCV

Shen, Z., Xing, E.: A fast knowledge distillation framework for visual recognition. In: ECCV. pp. 673–690 (2022).https://doi.org/10.1007/978-3-031-20053- 3_39

work page doi:10.1007/978-3-031-20053- 2022
[44]

Vision transformers are parameter- efficient audio-visual learners

Song, K., Xie, J., Zhang, S., Luo, Z.: Multi-mode online knowledge distillation for self-supervised visual representation learning. In: CVPR. pp. 11848–11857 (2023). https://doi.org/10.1109/cvpr52729.2023.01140

work page doi:10.1109/cvpr52729.2023.01140 2023
[45]

In: ICLR (2020)

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)

2020
[46]

In: ICML

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357 (2021)

2021
[47]

Vision transformers are parameter- efficient audio-visual learners

Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Yuan, L., Jiang, Y.G.: Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In: CVPR. pp. 6312–6322 (2023).https://doi.org/ 10.1109/cvpr52729.2023.00611

work page doi:10.1109/cvpr52729.2023.00611 2023
[48]

Vision transformers are parameter- efficient audio-visual learners

Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., Wang, X., Qiao, Y.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. In: CVPR. pp. 14408–14419 (2023).https: //doi.org/10.1109/cvpr52729.2023.01385

work page doi:10.1109/cvpr52729.2023.01385 2023
[49]

arXiv preprint arXiv:2110.00476 , year=

Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)

work page arXiv 2021
[50]

Florence-2: Advancing a unified representation for a vari- ety of vision tasks.arXiv preprint arXiv:2311.06242, 2023

Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Yuan, L.: Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv preprint arXiv:2311.06242 (2023)

work page arXiv 2023
[52]

In: ICCV

Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: ICCV. pp. 6022– 6031 (2019).https://doi.org/10.1109/iccv.2019.00612

work page doi:10.1109/iccv.2019.00612 2019
[53]

nuScenes: A multimodal dataset for autonomous driving,

Yun, S., Park, J., Lee, K., Shin, J.: Regularizing class-wise predictions via self- knowledge distillation. In: CVPR. pp. 13873–13882 (2020).https://doi.org/10. 1109/cvpr42600.2020.01389 18 B. Li et al

work page arXiv 2020
[55]

In: ICLR (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: ICLR (2018)

2018
[56]

Be Your Own Teacher : Improve the Performance of Convolutional Neural Networks via Self Distillation

Zhang, L., Song, J., Gao, A., Zhang, J., Shen, C., Jia, K.: Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: ICCV. pp. 3712–3721 (2019).https://doi.org/10.1109/ICCV.2019.00381

work page doi:10.1109/iccv.2019.00381 2019
[57]

In: CVPR

Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: CVPR. pp. 4320–4328 (2018).https://doi.org/10.1109/cvpr.2018.00454

work page doi:10.1109/cvpr.2018.00454 2018
[58]

Shadows can be

Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: CVPR. pp. 11943–11952 (2022).https://doi.org/10.1109/CVPR52688.2022. 01165

work page doi:10.1109/cvpr52688.2022 2022
[59]

Sadler and Jiaman Wu and Wei

Zhao, B., Song, R., Liang, J.: Cumulative spatial knowledge distillation for vi- sion transformers. In: ICCV. pp. 6123–6132 (2023).https://doi.org/10.1109/ iccv51070.2023.00565

work page arXiv 2023
[60]

Longllada: Unlocking long context capabilities in diffusion llms

Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data aug- mentation. In: AAAI. pp. 13001–13008 (2020).https://doi.org/10.1609/aaai. v34i07.7000

work page doi:10.1609/aaai 2020