hub Canonical reference

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li · 2024 · cs.CL · arXiv 2402.13116

Canonical reference. 83% of citing Pith papers cite this work as background.

42 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 support 1

representative citing papers

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

When Context Returns: Toward Robust Internalization in On-Policy Distillation

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

A stop-gradient consistency regularizer mitigates context-induced degradation in on-policy distillation, improving robustness across 12 configurations.

Escaping the KL Agreement Trap in On-Policy Distillation

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

KAT detects persistent low-KL agreement traps in on-policy distillation via a dynamic threshold to filter weak supervision, improving avg@k by 2.66% and pass@k by 3.43% on four math benchmarks while shortening rollouts by 59.73%.

When Does Model Collapse Occur in Structured Interactive Learning?

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning benchmarks under strong/weak initialization.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

Training a Student Expert via Semi-Supervised Foundation Model Distillation

cs.CV · 2026-04-04 · conditional · novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

cs.SE · 2025-10-21 · conditional · novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

DuoMem: Towards Capable On-Device Memory Agents via Dual-Space Distillation

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

DuoMem distills from a 72B teacher to 4B student via context and parameter space, achieving 77.9% success on ALFWorld vs 4.3% baseline.

ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

ARKD uses an RL policy network to adaptively balance FKL and RKL in LLM distillation, claiming gains of 0.4-0.6 points on Rouge-L and BertScore over baselines.

Labeling Training Data for Entity Matching Using Large Language Models

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

LLM-labeled training sets for entity matching produce student models with F1 scores within 2 points of benchmark-trained models on five datasets at a cost of $28-41 versus 470 hours of manual work.

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.

Llamion Technical Report

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

A new conversion method (KEPT) transforms Orion-14B into Llama-format models while preserving benchmark performance using ~123M tokens of distillation.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.

OpenJarvis: Personal AI, On Personal Devices

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Training a Student Expert via Semi-Supervised Foundation Model Distillation cs.CV · 2026-04-04 · conditional · none · ref 49 · internal anchor
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CV · 2026-04-03 · unverdicted · none · ref 20 · 2 links · internal anchor
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction cs.CV · 2026-07-02 · unverdicted · none · ref 33 · internal anchor
Distillation of a 688M-parameter MASt3R teacher yields up to 7x smaller students that retain most lunar reconstruction accuracy and outperform sparse-supervised baselines.
Cross-Resolution Semantic Transfer for Robust Text-to-Image Retrieval in Low-Resolution Surveillance cs.CV · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
CRST improves ultra-low-resolution text-to-image person retrieval by 5.7% Rank-1 and 5.3% mAP on average across three datasets while stabilizing mixed-resolution galleries.
A drone-based framework for coral habitat mapping via weakly supervised segmentation cs.CV · 2025-08-26 · unverdicted · none · ref 38 · internal anchor
A multi-scale weakly supervised framework converts underwater point classifications into coarse masks to train UAV coral segmentation models, then refines them via self-training to reach 86.07% pixel accuracy and 52.23% mIoU without pixel-level labels.

A Survey on Knowledge Distillation of Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer