OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
hub
Hdpo: Hybrid distillation policy optimization via privileged self-distillation
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14roles
background 3polarities
background 3representative citing papers
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
SocraticPO adds Socratic-style teacher guidance and reward decay to RL rollouts for LLMs, improving performance on scientific reasoning benchmarks over baselines.
Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe
MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.
ATESD introduces a Beta-policy controller that adapts teacher exposure ratio during LLM self-distillation training and reports gains over fixed-exposure baselines on math benchmarks.
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
SDAR gates on-policy self-distillation signals into RL training to stabilize and improve multi-turn LLM agent performance on ALFWorld, WebShop, and Search-QA.
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
citing papers explorer
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
ATESD introduces a Beta-policy controller that adapts teacher exposure ratio during LLM self-distillation training and reports gains over fixed-exposure baselines on math benchmarks.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.