Sequence-level knowledge distillation

Yoon Kim, Alexander M Rush · 2016 · arXiv 1606.07947

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Accelerating Large Language Model Decoding with Speculative Sampling

cs.CL · 2023-02-02 · accept · novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

TIP: Token Importance in On-Policy Distillation

cs.LG · 2026-04-15 · conditional · novelty 6.0

In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

citing papers explorer

Showing 3 of 3 citing papers.

Accelerating Large Language Model Decoding with Speculative Sampling cs.CL · 2023-02-02 · accept · none · ref 11
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
TIP: Token Importance in On-Policy Distillation cs.LG · 2026-04-15 · conditional · none · ref 8
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing cs.LG · 2026-05-07 · unverdicted · none · ref 10
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

Sequence-level knowledge distillation

fields

years

verdicts

representative citing papers

citing papers explorer