Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
arXiv preprint arXiv:2502.08606 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7representative citing papers
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.