Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaomi · 2025 · arXiv 2504.14772

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

cs.CL · 2026-05-11 · conditional · novelty 6.0

EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

citing papers explorer

Showing 3 of 3 citing papers.

Evolving Knowledge Distillation for Lightweight Neural Machine Translation cs.CL · 2026-05-11 · conditional · none · ref 14
EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 32
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 10
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions

fields

years

verdicts

representative citing papers

citing papers explorer