Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Jeff Dean, Oriol Vinyals

classification 📊 stat.ML cs.LGcs.NE

keywords modelsensembleknowledgemodeldifferentdistillingfullimprove

read the original abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
cs.AI 2026-05 conditional novelty 8.0

PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus l...
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
cs.LG 2026-05 unverdicted novelty 8.0

Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity a...
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
quant-ph 2026-05 unverdicted novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
cs.AR 2026-05 unverdicted novelty 7.0

Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
q-bio.NC 2026-05 unverdicted novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression
stat.ME 2026-05 unverdicted novelty 7.0

Attenuation bias from confidence thresholding in pseudo-labelled regression equals a closed-form function of residual score variance V* after partialling out controls X, yielding a (V*, κ) safety rule computable befor...
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Minimax Rates and Spectral Distillation for Tree Ensembles
stat.ML 2026-05 unverdicted novelty 7.0

Spectral analysis of tree ensembles produces minimax rates for random forests governed by kernel eigenvalue decay and enables distillation of RFs and GBMs into compact models via leading eigenfunctions and singular vectors.
DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
cs.CV 2026-05 unverdicted novelty 7.0

DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
cs.CV 2026-05 unverdicted novelty 7.0

SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, Universit...
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer
cs.LG 2026-05 unverdicted novelty 7.0

GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
cs.LG 2026-05 unverdicted novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
cs.LG 2026-05 unverdicted novelty 7.0

Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
stat.ML 2026-05 unverdicted novelty 7.0

In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
cs.LG 2026-05 unverdicted novelty 7.0

Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
cs.CL 2026-05 unverdicted novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Characterizing and Correcting Effective Target Shift in Online Learning
stat.ML 2026-05 unverdicted novelty 7.0

Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Stochastic Transition-Map Distillation for Fast Probabilistic Inference
cs.LG 2026-05 unverdicted novelty 7.0

STMD distills the full transition map of diffusion sampling SDEs into a conditional Mean Flow model to enable fast one- or few-step stochastic sampling without teacher models or bi-level optimization.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
cs.LG 2026-05 unverdicted novelty 7.0

SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
stat.ML 2026-05 unverdicted novelty 7.0

ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is min...
SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
A Testable Certificate for Constant Collapse in Teacher-Guided VAEs
cs.LG 2026-05 unverdicted novelty 7.0

For any fixed nonconstant teacher T, the best constant student has alignment cost exactly equal to the teacher mutual information I_T(X;T); a latent-only witness below this threshold with margin cannot be constant.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis
cs.CV 2026-05 unverdicted novelty 7.0

CT-Lite combines Feature Attention Style Transfer (FAST) and Structured Factorized Projections (SFP) with contrastive learning to reach AUROC within 5-7% of uncompressed baselines on compressed CT volumes across three...
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
cs.AI 2026-04 unverdicted novelty 7.0

S-SONDO distills general audio foundation models into students up to 61 times smaller while retaining up to 96% of teacher performance using only output embeddings.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
cs.CL 2026-04 unverdicted novelty 7.0

New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
cs.CL 2026-04 unverdicted novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?
cs.LG 2026-04 unverdicted novelty 7.0

Routing optimization for in-orbit federated learning is polynomial-time solvable under some settings like certain unicast or multicast flows and NP-hard under others, with rigorous proofs establishing the boundaries.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Rethinking Dataset Distillation: Hard Truths about Soft Labels
cs.LG 2026-04 conditional novelty 7.0

Soft labels hide the value of high-quality data subsets in dataset distillation, and a new compute-aware method outperforms existing approaches in hard-label settings on ImageNet-1K.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
cs.LG 2026-04 unverdicted novelty 7.0

A new utility-based framework optimizes performance-fairness trade-offs in decisions by modeling decision-maker and decision-subject utilities and using a social planner's utility to capture group inequalities under d...
Sparse Contrastive Learning for Content-Based Cold Item Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
Fast and accurate AI-based pre-decoders for surface codes
quant-ph 2026-04 unverdicted novelty 7.0

AI pre-decoders achieve O(1 μs) per round decoding runtimes on GPUs for surface codes while improving logical error rates over global decoding alone and enabling data-driven noise weight estimation.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Learning Robustness at Test-Time from a Non-Robust Teacher
cs.CV 2026-04 unverdicted novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection
cs.CR 2026-04 unverdicted novelty 7.0

BRIDGE creates the first formal heterogeneous multi-dataset benchmark for IoT botnet detection with LODO evaluation, and TCH-Net achieves mean LODO F1 of 0.5577 while reaching F1 0.8296 on standard tests, outperformin...
Teaching the Teachers: Boosting unsupervised domain adaptation in speech recognition by ensemble update
eess.AS 2026-04 unverdicted novelty 7.0

Simultaneous ensemble teacher update with the student model improves unsupervised domain adaptation for ASR, reducing WER by 4.6% on the Switchboard eval00 set.
Transferable FB-GNN-MBE Framework for Potential Energy Surfaces: Data-Adaptive Transfer Learning in Deep Learned Many-Body Expansion Theory
physics.chem-ph 2026-04 unverdicted novelty 7.0

FB-GNN-MBE integrates fragment-based graph neural networks into many-body expansion to predict two- and three-body energies for water, phenol, and mixture systems at chemical accuracy, with a teacher-student protocol ...
SenBen: Sensitive Scene Graphs for Explainable Content Moderation
cs.CV 2026-04 unverdicted novelty 7.0

SenBen is the first large-scale scene graph benchmark for sensitive content, paired with a 241M distilled model that outperforms most VLMs and safety APIs on grounded detection while running much faster.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
cs.AI 2026-04 unverdicted novelty 7.0

WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while makin...
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
cs.SD 2026-04 unverdicted novelty 7.0

A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.