pith. machine review for the scientific record. sign in

hub

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it
abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

hub tools

years

2026 16

representative citing papers

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Co-Evolving Policy Distillation

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 3 refs

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly 2x faster convergence on video reasoning benchmarks.

citing papers explorer

Showing 16 of 16 citing papers.