Knowledge Diffusion for Distillation

Chang Xu; Chen Qian; Fei Wang; Mingkai Zheng; Shan You; Tao Huang; Yuan Zhang

arxiv: 2305.15712 · v2 · pith:RAZMIADGnew · submitted 2023-05-25 · 💻 cs.CV · cs.AI

Knowledge Diffusion for Distillation

Tao Huang , Yuan Zhang , Mingkai Zheng , Shan You , Fei Wang , Chen Qian , Chang Xu This is my paper

classification 💻 cs.CV cs.AI

keywords featuresdiffusionfeaturestudentteacherdiffkddistillationmodel

0 comments

read the original abstract

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adaptive noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at https://github.com/hunto/DiffKD.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 6.0

LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...