Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
arXiv preprint arXiv:2502.08606 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7representative citing papers
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
DOPD: Dual On-policy Distillation
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
-
GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.