super hub Canonical reference

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals · 2015 · stat.ML · arXiv 1503.02531

Canonical reference. 79% of citing Pith papers cite this work as background.

686 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 686 citing papers more from Geoffrey Hinton arXiv PDF

abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 70 method 14 other 2 dataset 1

citation-polarity summary

background 69 use method 13 unclear 3 support 1 use dataset 1

claims ledger

abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using

authors

and Jeff Dean Geoffrey Hinton Oriol Vinyals

co-cited works

representative citing papers

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

Proofs of Ownership for Machine Learning Models

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

A formal game-based study establishes that black-box proofs of ownership for ML classifiers are possible precisely when the concept class is not self-correctable.

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

cs.AI · 2026-05-11 · conditional · novelty 8.0

PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus limits of standard metrics on retention versus forgetting.

Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Purified OPSD subtracts a reference-only teacher's signal from standard OPSD supervision and applies PMI to create a cleaner distillation target, yielding gains on long-CoT models while preserving epistemic behavior.

Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

Empirical study on production-scale clinical NLP shows direct learning from verifier rejections fails due to sparse data while fixed ontology and evidence-support filters succeed, with selectivity determined by matching verifier evidence.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

RPM-Distill: Physiology-guided Adaptive Cross-modal Distillation for Robust Remote Physiological Measurement

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

RPM-Distill uses synchronized radar only at training time to distill spectral periodic features into a video model via adaptive per-sample gating, yielding 81% lower MAE on remote physiological measurement tasks.

Learning 1-Bit LiDAR-based Localization with Auxiliary Objective

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.

REDI-Match: Rotation-Equivariant Distillation for Efficient and Robust Dense Matching

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

REDI-Match uses rotation-equivariant distillation to transfer VFM semantics into a strictly equivariant encoder plus an entropy-driven alignment module, claiming SOTA accuracy and 1.9x speed on rotation-heavy benchmarks.

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

cs.SD · 2026-06-17 · unverdicted · novelty 7.0

S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Learning from the Self-future: On-policy Self-distillation for dLLMs

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

d-OPSD reframes on-policy self-distillation for dLLMs via suffix conditioning from self-generated answers and step-level supervision, outperforming RLVR and SFT on reasoning benchmarks with ~10% of the optimization steps.

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

cs.LG · 2026-06-12 · unverdicted · novelty 7.0

Linear recoverability of transformer FFN blocks varies widely across depth, is learned during training, and is independent of the activation function.

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

citing papers explorer

Showing 7 of 7 citing papers after filters.

PGUDA: Pressure-Guided Unsupervised Domain Adaptation with Cross-Modal Knowledge Distillation for sEMG-Based Gesture Recognition eess.SP · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
PGUDA uses pressure signals to train a teacher network that distills modality-invariant knowledge into an sEMG student via cross-modal distillation, reaching 58.08% cross-subject accuracy with only 5% labeled data for the teacher.
Topology-Aware Two-Stage Federated Learning via Proxy Models for Sub-THz Heterogeneous LEO Communications eess.SP · 2026-05-06 · unverdicted · none · ref 37 · internal anchor
A two-stage FL aggregation method with proxy models for heterogeneous LEO networks extends contact windows and achieves 86.59-90.57% accuracy with 1.5-2.2x faster convergence than baselines.
Knowledge Distillation for mmWave Beam Prediction Using Sub-6 GHz Channels eess.SP · 2026-02-04 · unverdicted · none · ref 21 · internal anchor
Knowledge distillation produces compact student models that match large teacher models in mmWave beam prediction accuracy from sub-6 GHz channels while cutting parameters and complexity by 99%.
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions eess.SP · 2026-05-04 · conditional · none · ref 22
Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwater reconstruction error from 2.809 to 0.215 MAE and raises downstream CNS-OT AUROC.
A Case Study on Energy-Efficient Edge AI Crack Segmentation eess.SP · 2026-04-15 · unverdicted · none · ref 36 · internal anchor
Knowledge distillation plus quantization on U-Net variants, combined with custom FPGA hardware, yields 398 FPS at 204.99 Frames/J and raises mean IoU to 71.92% on CrackVision12K, an 8.82 pps gain over prior results.
Knowledge Distillation for Sensing-Assisted Long-Term Beam Tracking in mmWave Communications eess.SP · 2025-09-14 · unverdicted · none · ref 40 · internal anchor
Knowledge distillation creates a compact neural network for long-term beam tracking in mmWave communications that matches a larger teacher's accuracy with far fewer parameters and shorter input sequences.
Deep learning in ultrasound imaging eess.SP · 2019-07-05 · unverdicted · none · ref 122 · internal anchor
A review outlining deep learning strategies for adaptive beamforming, spectral Doppler, compressive color Doppler encodings, and structured signal recovery in ultrasound.

Distilling the Knowledge in a Neural Network

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer