hub Canonical reference

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

· 2026 · cs.LG · arXiv 2604.13016

Canonical reference. 100% of citing Pith papers cite this work as background.

53 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 1

citation-polarity summary

background 10

representative citing papers

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Learning from the Self-future: On-policy Self-distillation for dLLMs

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

d-OPSD reframes on-policy self-distillation for dLLMs via suffix conditioning from self-generated answers and step-level supervision, outperforming RLVR and SFT on reasoning benchmarks with ~10% of the optimization steps.

Escaping the KL Agreement Trap in On-Policy Distillation

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

KAT detects persistent low-KL agreement traps in on-policy distillation via a dynamic threshold to filter weak supervision, improving avg@k by 2.66% and pass@k by 3.43% on four math benchmarks while shortening rollouts by 59.73%.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Reinforcement Learning from Rich Feedback with Distributional DAgger

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

cs.AI · 2026-05-01 · accept · novelty 7.0 · 2 refs

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

SEAD applies entropy-guided token selection, KL annealing, and easy-to-hard curriculum to on-policy distillation and reports +4.8 average accuracy gain over vanilla OPD on six math benchmarks with OLMo-3 models.

On the Position Bias of On-Policy Distillation

cs.LG · 2026-06-21 · unverdicted · novelty 6.0 · 2 refs

Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

ViCuR introduces recoverable visual cues as teacher privilege in multimodal on-policy distillation, yielding +1.19 to +1.24 average gains over answer-based baselines across seven benchmarks with Qwen3-VL students.

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 24 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer