PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus limits of standard metrics on retention versus forgetting.
super hub
Distilling the Knowledge in a Neural Network
249 Pith papers cite this work. Polarity classification is still indexing.
abstract
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using
authors
co-cited works
roles
background 3polarities
background 3representative citing papers
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Attenuation bias from confidence thresholding in pseudo-labelled regression equals a closed-form function of residual score variance V* after partialling out controls X, yielding a (V*, κ) safety rule computable before inference.
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Spectral analysis of tree ensembles produces minimax rates for random forests governed by kernel eigenvalue decay and enables distillation of RFs and GBMs into compact models via leading eigenfunctions and singular vectors.
DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.
SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
citing papers explorer
-
PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus limits of standard metrics on retention versus forgetting.
-
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.
-
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
TILT: Target-induced loss tilting under covariate shift
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
-
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
On the Generalization of Knowledge Distillation: An Information-Theoretic View
Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
-
When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression
Attenuation bias from confidence thresholding in pseudo-labelled regression equals a closed-form function of residual score variance V* after partialling out controls X, yielding a (V*, κ) safety rule computable before inference.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Minimax Rates and Spectral Distillation for Tree Ensembles
Spectral analysis of tree ensembles produces minimax rates for random forests governed by kernel eigenvalue decay and enables distillation of RFs and GBMs into compact models via leading eigenfunctions and singular vectors.
-
DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.
-
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
-
Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
-
Characterizing and Correcting Effective Target Shift in Online Learning
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
-
Stochastic Transition-Map Distillation for Fast Probabilistic Inference
STMD distills the full transition map of diffusion sampling SDEs into a conditional Mean Flow model to enable fast one- or few-step stochastic sampling without teacher models or bi-level optimization.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
-
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
A Testable Certificate for Constant Collapse in Teacher-Guided VAEs
For any fixed nonconstant teacher T, the best constant student has alignment cost exactly equal to the teacher mutual information I_T(X;T); a latent-only witness below this threshold with margin cannot be constant.
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis
CT-Lite combines Feature Attention Style Transfer (FAST) and Structured Factorized Projections (SFP) with contrastive learning to reach AUROC within 5-7% of uncompressed baselines on compressed CT volumes across three datasets while using far fewer parameters.
-
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
-
S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
S-SONDO distills general audio foundation models into students up to 61 times smaller while retaining up to 96% of teacher performance using only output embeddings.
-
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
-
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
-
Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?
Routing optimization for in-orbit federated learning is polynomial-time solvable under some settings like certain unicast or multicast flows and NP-hard under others, with rigorous proofs establishing the boundaries.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Rethinking Dataset Distillation: Hard Truths about Soft Labels
Soft labels hide the value of high-quality data subsets in dataset distillation, and a new compute-aware method outperforms existing approaches in hard-label settings on ImageNet-1K.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
-
First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
A new utility-based framework optimizes performance-fairness trade-offs in decisions by modeling decision-maker and decision-subject utilities and using a social planner's utility to capture group inequalities under different justice principles.
-
Sparse Contrastive Learning for Content-Based Cold Item Recommendation
SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
-
Fast and accurate AI-based pre-decoders for surface codes
AI pre-decoders achieve O(1 μs) per round decoding runtimes on GPUs for surface codes while improving logical error rates over global decoding alone and enabling data-driven noise weight estimation.