pith. sign in

Canonical reference

TIP: Token Importance in On-Policy Distillation

Canonical reference. 80% of citing Pith papers cite this work as background.

9 Pith papers citing it
Background 80% of classified citations
abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

citation-role summary

background 4 method 1

citation-polarity summary

years

2026 9

verdicts

UNVERDICTED 9

representative citing papers

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

citing papers explorer

Showing 9 of 9 citing papers.

  • Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation cs.LG · 2026-05-26 · unverdicted · none · ref 1 · internal anchor

    Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.

  • Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 27 · internal anchor

    Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

  • TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.

  • The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 42 · internal anchor

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  • Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

  • ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents cs.AI · 2026-06-26 · unverdicted · none · ref 22 · internal anchor

    ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.

  • SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 41 · internal anchor

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  • SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 39 · 2 links · internal anchor

    SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.

  • Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 23 · 3 links · internal anchor

    On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.