Sequence-level knowledge distillation

Yoon Kim, Alexander M Rush · 2016

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

cs.CL · 2026-02-03 · unverdicted · novelty 7.0

A learned transformation matrix minimizes CMI in teacher logits to degrade distillation performance while preserving task accuracy.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.

citing papers explorer

Showing 3 of 3 citing papers.

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective cs.CL · 2026-02-03 · unverdicted · none · ref 5
A learned transformation matrix minimizes CMI in teacher logits to degrade distillation performance while preserving task accuracy.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 8
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation cs.LG · 2026-05-21 · unverdicted · none · ref 13
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.

Sequence-level knowledge distillation

fields

years

verdicts

representative citing papers

citing papers explorer