EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3representative citing papers
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
citing papers explorer
-
Evolving Knowledge Distillation for Lightweight Neural Machine Translation
EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.