hub

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng · 2026 · cs.LG · arXiv 2604.00626

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open full Pith review browse 19 citing papers arXiv PDF

abstract

As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as $f$-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained RL. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agentic distillation, and the growing overlap between knowledge distillation and RL.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

cs.AI · 2026-05-01 · accept · novelty 7.0 · 2 refs

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

GRAFT: Graph-Tokenized LLMs for Tool Planning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

cs.AI · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

Knowledge Distillation Must Account for What It Loses

cs.LG · 2026-04-28 · unverdicted · novelty 4.0 · 2 refs

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

citing papers explorer

Showing 19 of 19 citing papers.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning cs.CL · 2026-05-09 · unverdicted · none · ref 31 · internal anchor
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 35 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization cs.LG · 2026-05-06 · unverdicted · none · ref 17 · internal anchor
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 33 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI · 2026-05-01 · accept · none · ref 29 · 2 links · internal anchor
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation cs.CL · 2026-05-13 · unverdicted · none · ref 30 · internal anchor
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 70 · 2 links · internal anchor
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
GRAFT: Graph-Tokenized LLMs for Tool Planning cs.LG · 2026-05-12 · unverdicted · none · ref 26 · internal anchor
GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CV · 2026-05-06 · unverdicted · none · ref 89 · internal anchor
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe cs.LG · 2026-05-05 · unverdicted · none · ref 33 · internal anchor
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair cs.AI · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 39 · internal anchor
UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 91 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 15 · 2 links · internal anchor
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

A Survey of On-Policy Distillation for Large Language Models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer