Recognition: no theorem link
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Pith reviewed 2026-05-12 04:25 UTC · model grok-4.3
The pith
On-policy distillation aligns better with ideal learning signals on the student's mistakes than on its successes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, 1
What carries the argument
The ideal per-node gradient, estimated via a scalable targeted-rollout algorithm, together with the cosine-similarity gradient alignment score that measures how closely any distillation gradient matches it at the level of individual tokens.
If this is right
- Distillation provides denser and more useful supervision on tokens where the student would otherwise fail.
- Aggregate end-task metrics conceal large token-level differences in signal quality between correct and incorrect rollouts.
- No fixed choice of teacher model or context works best for every student size and task.
- Per-token, per-question diagnostics are required to decide when and where distillation should be applied.
Where Pith is reading between the lines
- Training loops could mask or down-weight distillation loss on low-alignment tokens to avoid injecting noise.
- The same diagnostic could be run online to switch teacher contexts dynamically as the student improves on different questions.
- The pattern may extend beyond distillation to other forms of dense supervision in reasoning training.
Load-bearing premise
The targeted-rollout algorithm accurately and scalably estimates the ideal per-node gradient that would maximally increase the student's success probability, even for long chains of intermediate thoughts.
What would settle it
Train one student with distillation applied uniformly and another with distillation applied only on tokens whose alignment score exceeds a threshold; if the selective version shows no gain in final accuracy or even underperforms, the claim that higher alignment identifies more useful signals is undermined.
read the original abstract
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a training-free diagnostic framework for analyzing on-policy distillation in reasoning models. It defines an ideal per-node gradient as the parameter update that maximally increases the student's probability of success on a given rollout, develops a scalable targeted-rollout algorithm to estimate this gradient for long thought chains, and uses cosine similarity (gradient alignment score) between this ideal and actual distillation gradients to evaluate different teachers and contexts. Key observations are that alignment is substantially higher on incorrect rollouts than correct ones (where teacher signal becomes noisy) and that no single distillation context is universally optimal, depending instead on student capacity and task.
Significance. If the targeted-rollout estimator proves reliable, the framework offers a high-resolution, training-free tool for diagnosing when and why distillation helps or hurts at the per-token level. This could guide more efficient use of teacher signals in reasoning model training and reduce reliance on aggregate performance metrics from costly runs. The approach's emphasis on per-question, per-token analysis is a clear methodological strength.
major comments (2)
- [Targeted-rollout algorithm description] Targeted-rollout algorithm (described after the ideal gradient derivation): the central observations on alignment differences rest on this estimator accurately approximating the true ideal per-node gradient for long intermediate-thought chains. The manuscript provides no error analysis, truncation bounds, or validation against exact computation on short chains, leaving open the possibility that rollout truncation or reward sparsity introduces systematic bias that could render the reported alignment gaps uninterpretable.
- [Experimental results] Experimental results section (alignment score comparisons): the claim of 'substantially higher alignment on incorrect rollouts' lacks accompanying variance estimates, statistical tests, or controls for rollout length and question difficulty. Without these, it is unclear whether the difference is robust or driven by a subset of cases.
minor comments (2)
- [Ideal gradient derivation] The ideal per-node gradient definition would benefit from an explicit mathematical formulation (e.g., as an optimization problem over parameters) rather than a purely verbal description, to aid reproducibility.
- [Gradient alignment score] Notation for the gradient alignment score (cosine similarity) should be introduced with a clear equation and distinguished from any other similarity measures used in the experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and outlining specific revisions to strengthen the analysis and presentation.
read point-by-point responses
-
Referee: [Targeted-rollout algorithm description] Targeted-rollout algorithm (described after the ideal gradient derivation): the central observations on alignment differences rest on this estimator accurately approximating the true ideal per-node gradient for long intermediate-thought chains. The manuscript provides no error analysis, truncation bounds, or validation against exact computation on short chains, leaving open the possibility that rollout truncation or reward sparsity introduces systematic bias that could render the reported alignment gaps uninterpretable.
Authors: We acknowledge that the manuscript does not include a formal error analysis or validation for the targeted-rollout estimator. To address this, we will add a dedicated subsection deriving truncation error bounds based on reward sparsity and chain length, along with empirical validation experiments on short thought chains where exact gradients can be computed by enumeration. These additions will quantify the approximation error and confirm it does not systematically bias the observed alignment differences between correct and incorrect rollouts. revision: yes
-
Referee: [Experimental results] Experimental results section (alignment score comparisons): the claim of 'substantially higher alignment on incorrect rollouts' lacks accompanying variance estimates, statistical tests, or controls for rollout length and question difficulty. Without these, it is unclear whether the difference is robust or driven by a subset of cases.
Authors: We agree that the experimental section would benefit from greater statistical rigor. In the revision, we will report variance estimates (standard errors across questions and seeds), include paired statistical tests (e.g., Wilcoxon signed-rank) to assess the significance of alignment differences, and add controls by stratifying results according to rollout length and question difficulty. These changes will demonstrate that the higher alignment on incorrect rollouts is robust and not an artifact of specific subsets. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper defines the ideal per-node gradient directly from the objective of maximizing the student's success probability and quantifies alignment via standard cosine similarity with no fitted parameters or self-referential predictions. The targeted-rollout algorithm is introduced as a scalable estimator without reducing any central claim to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the framework remains self-contained against external benchmarks such as success probability and does not rename known results or smuggle assumptions via prior work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The parameter update that maximally increases the student's probability of success defines the ideal gradient.
- standard math Cosine similarity between gradients quantifies useful alignment for distillation.
invented entities (2)
-
ideal per-node gradient
no independent evidence
-
targeted-rollout algorithm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding the GRPO and Dr. GRPO , author=. arXiv preprint arXiv:2503.20783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
International Conference on Learning Representations , year=
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. International Conference on Learning Representations , year=
-
[5]
Distilling the Knowledge in a Neural Network
Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Thinking Tokens for Language Modeling , author=. arXiv preprint , year=
-
[7]
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM: Knowledge Distillation of Large Language Models , author=. arXiv preprint arXiv:2306.08543 , year=
work page internal anchor Pith review arXiv
-
[8]
Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[10]
American Invitational Mathematics Examination , author=
-
[11]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. NAACL , year=
-
[12]
Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Solving math word problems with process- and outcome-based feedback
Solving Math Word Problems with Process- and Outcome-based Feedback , author=. arXiv preprint arXiv:2211.14275 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
International Conference on Machine Learning , year=
Eligibility Traces for Off-Policy Policy Evaluation , author=. International Conference on Machine Learning , year=
-
[15]
ACM SIGOPS Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. ACM SIGOPS Symposium on Operating Systems Principles , year=
-
[16]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. arXiv preprint arXiv:2604.13016 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of
Kim, Jeonghye and Luo, Xufang and Kim, Minbeom and Lee, Sangmook and Kim, Dohyung and Jeon, Jiwon and Li, Dongsheng and Yang, Yuqing , journal=. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of
-
[18]
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2016
-
[19]
Advances in Neural Information Processing Systems , volume=
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=
-
[21]
Reinforcement Learning via Self-Distillation
Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
On-Policy Context Distillation for Language Models
On-Policy Context Distillation for Language Models , author=. arXiv preprint arXiv:2602.12275 , year=
work page internal anchor Pith review arXiv
-
[24]
Self-Distillation Enables Continual Learning
Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=
work page internal anchor Pith review arXiv
-
[25]
arXiv preprint arXiv:2602.12125 , year=
Learning Beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=
-
[26]
MiMo-V2-Flash Technical Report
MiMo-v2-Flash Technical Report , author=. arXiv preprint arXiv:2601.02780 , year=
work page internal anchor Pith review arXiv
-
[27]
Zeng, Aohan and Lv, Xin and Hou, Zhenyu and Du, Zhengxiao and Zheng, Qinkai and Chen, Bin and Yin, Da and Ge, Chendi and Xie, Chengxing and Wang, Cunxiang and others , journal=
-
[28]
Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
Privileged Information Distillation for Language Models , author=. arXiv preprint arXiv:2602.04942 , year=
-
[29]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
On the Efficacy of Knowledge Distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[30]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Improved Knowledge Distillation via Teacher Assistant , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[31]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
Distillation Scaling Laws , author=. arXiv preprint arXiv:2502.08606 , year=
-
[32]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Small Models Struggle to Learn from Strong Reasoners , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[33]
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.