pith. machine review for the scientific record. sign in

TIP: Token Importance in On-Policy Distillation

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it
abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

years

2026 6

verdicts

UNVERDICTED 6

representative citing papers

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

citing papers explorer

Showing 6 of 6 citing papers.

  • TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.

  • The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 42 · internal anchor

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  • Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

  • Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 104 · 2 links · internal anchor

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  • SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 41 · internal anchor

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  • SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 39 · internal anchor

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.