PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus limits of standard metrics on retention versus forgetting.
super hub Canonical reference
Distilling the Knowledge in a Neural Network
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using
authors
co-cited works
representative citing papers
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
Slimmable ConvNeXt adapts ConvNeXt for width-adaptive inference using LayerNorm and inverted bottlenecks, reaching 80.8% top-1 at 4.5 GMACs and outperforming HydraViT, MatFormer, and SortedNet on ImageNet-1k.
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, medicine, manufacturing, and ecology.
DIPA learns preconditioning operators via distillation from a teacher with a better sensing matrix to improve reconstruction quality for the student's physically constrained matrix in imaging inverse problems.
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
Genetic programming evolves heterogeneous layer-specific scalar functions to approximate layer normalization in pre-trained ViTs, capturing 91.6% variance versus 70.2% for uniform baselines and recovering 84.25% ImageNet Top-1 accuracy after 20 epochs of adaptation.
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
citing papers explorer
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Minimax Rates and Spectral Distillation for Tree Ensembles
Spectral analysis of tree ensembles produces minimax rates for random forests governed by kernel eigenvalue decay and enables distillation of RFs and GBMs into compact models via leading eigenfunctions and singular vectors.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
Characterizing and Correcting Effective Target Shift in Online Learning
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
-
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
-
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
A model trained only by proposing and solving its own verifiable code tasks achieves state-of-the-art results on math and coding benchmarks without external data.
-
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Distribution Corrected Offline Data Distillation for Large Language Models
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
-
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.
-
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency trade-offs on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.
-
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and 3D-FRONT.
-
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
-
Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective
FedF-ADMM uses function-space ADMM updates projected via knowledge distillation plus a PI-like stabilization term to deliver faster, more stable convergence and higher accuracy than prior decentralized FL methods under severe non-IID conditions.
-
Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning
CORF unifies domain generalization and class-incremental learning via selective sample refinement with spatial maps and confidence weighting plus cascaded relational distillation.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
-
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
-
Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
ESFM is a single open foundation model that unifies heterogeneous Earth data sources and forecasts missing regions while preserving inter-variable physical relationships.
-
Extraction of linearized models from pre-trained networks via knowledge distillation
Koopman theory plus knowledge distillation yields linearized models from pre-trained nets that outperform standard least-squares Koopman approximations on MNIST and Fashion-MNIST in accuracy and stability.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
Compiling Code LLMs into Lightweight Executables
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
-
Depth Anything V2
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
Progressive Neural Networks
Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
Curriculum Learning-Guided Progressive Distillation in Large Language Models
CLPD improves LLM distillation for reasoning by combining explicit data curriculum with progressive teacher scheduling of increasing capacity.
-
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.
-
Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach
A neurosymbolic imitation learning approach uses privileged gaze data during training to handle high-dimensional inputs while achieving better generalization than pure neural or symbolic methods.
-
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.