Deep Mutual Learning

Huchuan Lu; Tao Xiang; Timothy M. Hospedales; Ying Zhang

arxiv: 1706.00384 · v1 · pith:DJOD6QDJnew · submitted 2017-06-01 · 💻 cs.CV

Deep Mutual Learning

Ying Zhang , Tao Xiang , Timothy M. Hospedales , Huchuan Lu This is my paper

classification 💻 cs.CV

keywords networklearningmutualteacherpowerfulstudenttransferdeep

0 comments

read the original abstract

Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
cs.IR 2026-03 unverdicted novelty 4.0

OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.