pith. sign in

arxiv: 2605.11182 · v2 · pith:VFLFREZPnew · submitted 2026-05-11 · 💻 cs.AI

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Pith reviewed 2026-05-13 02:13 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-policy distillationon-policy self-distillationlarge language modelsdistribution mismatchprivileged informationreverse KLmathematical reasoningmodel alignment
0
0 comments X

The pith

On-policy distillation fails in LLMs due to distribution mismatch, biased gradients, and privileged information aggregation but targeted fixes restore effectiveness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies on-policy distillation and self-distillation, methods that supervise large language models using trajectories sampled from the model itself. It shows these approaches produce mixed results because of three concrete failure mechanisms rather than working reliably across tasks. The mechanisms are a mismatch when the student conditions on its own prefixes, unstable optimization from certain gradient estimates, and the student's inability to retain instance-specific information during self-distillation. The authors demonstrate that simple changes to the loss, teacher adaptation, and student initialization address these problems in their tested settings. Readers care because the findings give practical rules for deciding when and how to apply distillation without external data.

Core claim

On-policy distillation on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas on-policy self-distillation fails due to the test-time absence of instance-specific privileged information. The three failure mechanisms are distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, optimization instability from biased TopK reverse-KL gradients, and an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers. In contrast, OPSD succeeds when PI represents a shared latent rule such as a system prompt. Stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabl

What carries the argument

The three failure mechanisms in on-policy distillation—distribution mismatch from student-generated prefixes, biased TopK reverse-KL gradients, and PI-free policy aggregation in OPSD—together with the mitigations of stop-gradient TopK, RLVR teachers, and SFT stabilization.

If this is right

  • OPD performance varies sharply with the choice of teacher and the exact loss formulation in reasoning tasks.
  • OPSD succeeds for shared latent rules like system prompts or alignment preferences but cannot capture instance-specific PI.
  • Stop-gradient applied to TopK objectives removes the source of optimization instability.
  • RLVR-adapted teachers and SFT-stabilized students prevent the identified failure modes from appearing.
  • The methods internalize shared information reliably but require additional handling when PI varies per instance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mismatch and gradient issues may appear in other on-policy training loops that mix teacher and student outputs.
  • Combining the fixes with existing post-training pipelines could reduce reliance on large supervised datasets for model improvement.
  • Repeating the experiments at larger model scales would test whether the three mechanisms remain dominant or new interactions emerge.
  • Training pipelines could adopt SFT stabilization as a default first step before attempting on-policy distillation steps.

Load-bearing premise

The tested settings of mathematical reasoning trajectories and system-prompt or alignment privileged information are representative enough that the three failure mechanisms and fixes will apply to other LLM tasks, model scales, and data distributions.

What would settle it

Apply the proposed fixes to a new task requiring instance-specific privileged information, such as personalized multi-turn dialogue, and measure whether performance still degrades relative to a teacher baseline or improves as predicted.

Figures

Figures reproduced from arXiv: 2605.11182 by Ge Liu, Hongyu Lu, Siqi Zhu, Weiye Shi, Xuyan Ye.

Figure 1
Figure 1. Figure 1: Overview. We map the OP(S)D design space (left, top) and its task-dependent success/fail￾ure behavior (left, bottom), identify three failure mechanisms—prefix-distorted teacher state, biased Top-K reverse-KL, and PI-marginalized OPSD policy (middle), and propose practical fixes: stable Top-K losses, SFT stabilization, and RLVR-adapted teachers (right). In this paper, we present a comprehensive empirical st… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) On-Policy (Self-)Distillation. In OPSD, the teacher is constructed from the student itself and privileged information (PI) is necessary. In OPD, the teacher is a stronger model and PI is optional. (Right) p: teacher distribution, q: student distribution. Reverse KL is mode-seeking, whereas forward KL is mode-covering. Reinforcement Learning from Textual Feedback. Another related direction augments r… view at source ↗
Figure 3
Figure 3. Figure 3: Qwen3-1.7B, trained on OpenThoughts. OPSD fails to improve student. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Collapse under unnormalized Top-20 reverse KL. The model first becomes verbose, then degenerates into repetitive “maybe” outputs as response length reaches the limit and evaluation accuracy drops. Token statistics show that repetitive tokens dominate as the repeat ratio approaches one. 4 Experiments We evaluate OPD and OPSD on reasoning, system-prompt internalization, and alignment, covering both failure a… view at source ↗
Figure 5
Figure 5. Figure 5: Training reward (left) and evaluation score (right) curves for OPSD, GRPO, and PPO on [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of GRPO and OPSD on Qwen3-8B (thinking mode) trained with DAPO-Math [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Train and evaluate Qwen3-1.7B (nothink) on Wildguardmix using their original train and [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effectiveness of OPSD depends on the structure of privileged information I. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PI does not improve OPD on math reasoning with a stronger teacher. Using a Qwen3- 8B teacher and a Qwen3-1.7B student on OpenThoughts, both final-answer PI and full-response PI underperform vanilla OPD. PI-conditioned OPD leads to higher KL loss. This form indicates that OPSD can distill behavior that is consistently supported under different PI. Outputs that receive high probability under some PI but low… view at source ↗
Figure 11
Figure 11. Figure 11: Teacher: Qwen3-1.7B-GRPO (nothink), Student: Qwen3-1.7B (nothink), DAPO, TopK=5. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Whether to put the distillation loss in policy gradient? sampled token KL in policy gradient [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: dataset: OpenThoughts. Left: Qwen3-8B and Qwen3-1.7B-GRPO have similar math reasoning performance. Middle: In OPD, Qwen3-1.7B-GRPO is a more effective teacher. Right: Qwen3-1.7B-GRPO’s Top20 vocabulary distribution is more aligned with the Qwen3-1.7B student. 0 50000 100000 150000 200000 250000 Number of Training Samples 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Task Reward (Pass@1) MATH-500 Task Reward Direct OPD … view at source ↗
Figure 14
Figure 14. Figure 14: Qwen3-4B teacher, Qwen3-1.7B-Base student, OpenThoughts. [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Teacher: Qwen3-1.7B-GRPO (nothink), Student: Qwen3-1.7B (nothink), training data: [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of teacher signal on responses generated by different student models. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of token-level KL supervision distributions for correct and incorrect student [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: We show token-level heatmap of ∆logprob on last 128 tokens. The experiment is based on openthoughts [22], we show an example question. PI strengthens supervision for the same teacher, yet the sampled-token supervision distribution is based more on teacher capability (as shown in the figure, 3 experiments using Qwen3-8B teacher show similar distribution, while 2 experiments using Qwen3-1.7B teacher show an… view at source ↗
Figure 19
Figure 19. Figure 19: General reasoning results of OPD training. The experiment uses the Science subset of [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of teacher signals on general reasoning trajectories. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Next-token log probs (left), truncated ratio (middle) and evaluation results (right) curves [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: An example of thinking mode hacking during OPSD. The student is trained with thinking mode disabled, while the teacher is queried with reasoning enabled. During training, the student gradually learns to emit explicit thinking-mode control tokens in its response, even though such tokens are not intended to appear at inference time. 1.7b and is trained on dapo [20]. We observe a failure mode that we term th… view at source ↗
Figure 23
Figure 23. Figure 23: ∆logprob - token entropy. Teacher: Qwen3-8B w/ PI. and imagery to tell true human stories. I hold fast to poetic meter, seek no popular applause, and ask only that everyone who hears me feels understood. When asked why I do not try new forms, I say: true innovation is not breaking tradition, but letting tradition be reborn in new breath. Each of my recitations guards and transmits ancient wisdom -- not to… view at source ↗
read the original abstract

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a comprehensive empirical study of on-policy distillation (OPD) and on-policy self-distillation (OPSD) for LLMs. It identifies three failure mechanisms—distribution mismatch from student-generated prefixes, optimization instability from biased TopK reverse-KL gradients, and OPSD-specific aggregation of PI-conditioned teachers into a PI-free policy when PI is instance-specific—and shows that these explain mixed prior results. The work focuses on mathematical reasoning trajectories and shared-latent PI (e.g., system prompts or alignment preferences), proposing and validating fixes via stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students, with ablations on teacher choice, loss formulation, and PI type.

Significance. If the mechanisms and fixes hold, this provides mechanistic insight into why OPD/OPSD results have been inconsistent, offering practical guidance for LLM post-training. The structured ablations and identification of specific pitfalls represent a useful contribution to understanding dense token-level supervision on self-generated trajectories. However, the restriction to math reasoning and shared PI settings means the work's broader impact depends on whether these failure modes generalize.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results: The central claim that the three identified failure mechanisms explain mixed prior results on OPD/OPSD rests on the tested regimes (mathematical reasoning trajectories and system-prompt/alignment PI) being representative. No experiments are reported on other domains (e.g., general language modeling, code generation, or larger-scale models), leaving open the possibility that different token distributions or optimization landscapes produce distinct dominant failure modes.
  2. [Abstract] Abstract: The assertion that OPSD fails due to learning a PI-free policy that aggregates PI-conditioned teachers is load-bearing for the OPSD-specific limitation. However, the paper provides no quantitative measure (e.g., policy divergence or per-instance performance breakdown) of this aggregation effect, making it difficult to confirm that this is the primary cause rather than a symptom of other factors like data scale or conditioning.
minor comments (2)
  1. [Abstract] The abstract introduces OPD, OPSD, and PI without initial expansions or a brief definition, which reduces accessibility for readers outside the immediate subfield.
  2. [Abstract] The description of the fixes (stop-gradient TopK, RLVR teachers, SFT stabilization) would benefit from a short summary table comparing their effects across the ablations to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results: The central claim that the three identified failure mechanisms explain mixed prior results on OPD/OPSD rests on the tested regimes (mathematical reasoning trajectories and system-prompt/alignment PI) being representative. No experiments are reported on other domains (e.g., general language modeling, code generation, or larger-scale models), leaving open the possibility that different token distributions or optimization landscapes produce distinct dominant failure modes.

    Authors: We agree that the representativeness of our tested regimes is central to the broader claims. Mathematical reasoning was selected as the primary domain because it permits clean isolation of instance-specific versus shared privileged information, enabling precise diagnosis of the three failure mechanisms. We acknowledge that the absence of experiments on domains such as code generation or general language modeling leaves open the possibility of different dominant failure modes. In the revision we will expand the Limitations and Future Work section to explicitly discuss this scope limitation, qualify the central claim accordingly, and outline why the identified mechanisms (prefix mismatch, biased TopK gradients, and PI aggregation) are expected to be relevant beyond math while calling for targeted follow-up studies. revision: partial

  2. Referee: [Abstract] Abstract: The assertion that OPSD fails due to learning a PI-free policy that aggregates PI-conditioned teachers is load-bearing for the OPSD-specific limitation. However, the paper provides no quantitative measure (e.g., policy divergence or per-instance performance breakdown) of this aggregation effect, making it difficult to confirm that this is the primary cause rather than a symptom of other factors like data scale or conditioning.

    Authors: We thank the referee for this observation. The current manuscript supports the aggregation claim through comparative performance results and qualitative policy analysis in Section 4.3, but we agree that direct quantitative evidence would strengthen the argument. In the revised version we will add explicit metrics, including estimates of policy divergence (e.g., token-level KL between the student policy and each PI-conditioned teacher) and per-instance performance breakdowns that contrast shared-PI versus instance-specific-PI settings. These additions will help isolate the aggregation effect from confounding factors such as data scale. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical identification of failure modes

full rationale

The paper presents a comprehensive empirical study of on-policy distillation and self-distillation, identifying three failure mechanisms and mitigation strategies through direct experiments on mathematical reasoning trajectories and system-prompt/alignment settings. No derivation chain, first-principles prediction, or mathematical reduction is claimed; all central claims rest on observed experimental comparisons (e.g., sensitivity to teacher choice, loss formulation, and presence/absence of instance-specific PI). No self-citations, fitted parameters renamed as predictions, or ansatzes are load-bearing. The analysis is self-contained against the reported benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a purely empirical study; no new mathematical objects, fitted constants, or unverified theoretical entities are introduced. All claims rest on experimental observations under standard LLM training assumptions.

axioms (1)
  • domain assumption Standard assumptions in supervised fine-tuning, reinforcement learning with verifiable rewards, and KL-regularized distillation hold for the loss formulations and sampling procedures used.
    The study applies common loss functions and sampling without deriving or validating them from first principles.

pith-pipeline@v0.9.0 · 5542 in / 1400 out tokens · 50894 ms · 2026-05-13T02:13:55.930380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

    cs.LG 2026-06 unverdicted novelty 6.0

    RLCSD contrasts teacher-student distributional gaps under correct versus wrong hints to suppress privilege-induced style drift and concentrate supervision on task tokens, outperforming GRPO and prior OPSD on Qwen3 and...

  2. Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

    cs.CV 2026-06 unverdicted novelty 6.0

    Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

  3. A Formula-Driven Survey and Research Agenda for On-Policy Distillation

    cs.AI 2026-06 unverdicted novelty 4.0

    A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.