Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
hub Canonical reference
A Survey of On-Policy Distillation for Large Language Models
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as $f$-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained RL. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agentic distillation, and the growing overlap between knowledge distillation and RL.
hub tools
citation-role summary
citation-polarity summary
years
2026 28representative citing papers
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
citing papers explorer
No citing papers match the current filters.