arxiv: 2605.10889 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Ajay Jaiswal, David Harrison, Duc N.M Hoang, Fartash Faghri, Fatih Ilhan, Mehrdad Farajtabar, Minsik Cho, Mohammadreza Armandpour, Yizhe Zhang

Pith reviewed 2026-05-12 04:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationreasoning modelsgradient alignmentself-distillationtoken-level diagnosticstargeted rolloutper-token analysis

0 comments

The pith

On-policy distillation aligns better with ideal learning signals on the student's mistakes than on its successes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free diagnostic that measures per-token how well any distillation signal matches the update a student most needs. It defines an ideal per-node gradient as the change that would raise the student's probability of solving a question as much as possible, then estimates that gradient with a targeted-rollout procedure. Across self-distillation and external-teacher settings the measured alignment turns out markedly higher on rollouts the student gets wrong than on rollouts it already gets right, where the teacher signal grows noisy. The best teacher context also proves to depend jointly on student capacity and task rather than obeying any single rule.

Core claim

We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, 1

What carries the argument

The ideal per-node gradient, estimated via a scalable targeted-rollout algorithm, together with the cosine-similarity gradient alignment score that measures how closely any distillation gradient matches it at the level of individual tokens.

If this is right

Distillation provides denser and more useful supervision on tokens where the student would otherwise fail.
Aggregate end-task metrics conceal large token-level differences in signal quality between correct and incorrect rollouts.
No fixed choice of teacher model or context works best for every student size and task.
Per-token, per-question diagnostics are required to decide when and where distillation should be applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training loops could mask or down-weight distillation loss on low-alignment tokens to avoid injecting noise.
The same diagnostic could be run online to switch teacher contexts dynamically as the student improves on different questions.
The pattern may extend beyond distillation to other forms of dense supervision in reasoning training.

Load-bearing premise

The targeted-rollout algorithm accurately and scalably estimates the ideal per-node gradient that would maximally increase the student's success probability, even for long chains of intermediate thoughts.

What would settle it

Train one student with distillation applied uniformly and another with distillation applied only on tokens whose alignment score exceeds a threshold; if the selective version shows no gain in final accuracy or even underperforms, the claim that higher alignment identifies more useful signals is undermined.

read the original abstract

On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a per-token gradient alignment diagnostic for on-policy distillation but its claims depend on an unvalidated estimator for long reasoning chains.

read the letter

The core contribution is a training-free way to score how well any distillation signal matches the update that would most increase the student's chance of a correct final answer. They define that ideal per-node gradient from the success probability objective, estimate it via targeted rollouts, and use cosine similarity as the alignment metric. Across self-distillation and external teachers, they report stronger alignment on incorrect rollouts than correct ones, plus task- and capacity-dependent variation in the best context. No universal configuration stands out.

Referee Report

2 major / 2 minor

Summary. The paper introduces a training-free diagnostic framework for analyzing on-policy distillation in reasoning models. It defines an ideal per-node gradient as the parameter update that maximally increases the student's probability of success on a given rollout, develops a scalable targeted-rollout algorithm to estimate this gradient for long thought chains, and uses cosine similarity (gradient alignment score) between this ideal and actual distillation gradients to evaluate different teachers and contexts. Key observations are that alignment is substantially higher on incorrect rollouts than correct ones (where teacher signal becomes noisy) and that no single distillation context is universally optimal, depending instead on student capacity and task.

Significance. If the targeted-rollout estimator proves reliable, the framework offers a high-resolution, training-free tool for diagnosing when and why distillation helps or hurts at the per-token level. This could guide more efficient use of teacher signals in reasoning model training and reduce reliance on aggregate performance metrics from costly runs. The approach's emphasis on per-question, per-token analysis is a clear methodological strength.

major comments (2)

[Targeted-rollout algorithm description] Targeted-rollout algorithm (described after the ideal gradient derivation): the central observations on alignment differences rest on this estimator accurately approximating the true ideal per-node gradient for long intermediate-thought chains. The manuscript provides no error analysis, truncation bounds, or validation against exact computation on short chains, leaving open the possibility that rollout truncation or reward sparsity introduces systematic bias that could render the reported alignment gaps uninterpretable.
[Experimental results] Experimental results section (alignment score comparisons): the claim of 'substantially higher alignment on incorrect rollouts' lacks accompanying variance estimates, statistical tests, or controls for rollout length and question difficulty. Without these, it is unclear whether the difference is robust or driven by a subset of cases.

minor comments (2)

[Ideal gradient derivation] The ideal per-node gradient definition would benefit from an explicit mathematical formulation (e.g., as an optimization problem over parameters) rather than a purely verbal description, to aid reproducibility.
[Gradient alignment score] Notation for the gradient alignment score (cosine similarity) should be introduced with a clear equation and distinguished from any other similarity measures used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and outlining specific revisions to strengthen the analysis and presentation.

read point-by-point responses

Referee: [Targeted-rollout algorithm description] Targeted-rollout algorithm (described after the ideal gradient derivation): the central observations on alignment differences rest on this estimator accurately approximating the true ideal per-node gradient for long intermediate-thought chains. The manuscript provides no error analysis, truncation bounds, or validation against exact computation on short chains, leaving open the possibility that rollout truncation or reward sparsity introduces systematic bias that could render the reported alignment gaps uninterpretable.

Authors: We acknowledge that the manuscript does not include a formal error analysis or validation for the targeted-rollout estimator. To address this, we will add a dedicated subsection deriving truncation error bounds based on reward sparsity and chain length, along with empirical validation experiments on short thought chains where exact gradients can be computed by enumeration. These additions will quantify the approximation error and confirm it does not systematically bias the observed alignment differences between correct and incorrect rollouts. revision: yes
Referee: [Experimental results] Experimental results section (alignment score comparisons): the claim of 'substantially higher alignment on incorrect rollouts' lacks accompanying variance estimates, statistical tests, or controls for rollout length and question difficulty. Without these, it is unclear whether the difference is robust or driven by a subset of cases.

Authors: We agree that the experimental section would benefit from greater statistical rigor. In the revision, we will report variance estimates (standard errors across questions and seeds), include paired statistical tests (e.g., Wilcoxon signed-rank) to assess the significance of alignment differences, and add controls by stratifying results according to rollout length and question difficulty. These changes will demonstrate that the higher alignment on incorrect rollouts is robust and not an artifact of specific subsets. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper defines the ideal per-node gradient directly from the objective of maximizing the student's success probability and quantifies alignment via standard cosine similarity with no fitted parameters or self-referential predictions. The targeted-rollout algorithm is introduced as a scalable estimator without reducing any central claim to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the framework remains self-contained against external benchmarks such as success probability and does not rename known results or smuggle assumptions via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on defining an ideal gradient from success probability and estimating it via rollout; these are new constructs introduced here without external benchmarks or independent evidence mentioned.

axioms (2)

domain assumption The parameter update that maximally increases the student's probability of success defines the ideal gradient.
This is the core definition enabling the alignment score.
standard math Cosine similarity between gradients quantifies useful alignment for distillation.
Standard vector similarity applied to the new diagnostic.

invented entities (2)

ideal per-node gradient no independent evidence
purpose: Represents the optimal parameter update for increasing success probability at each token.
Newly defined construct central to the diagnostic; no external validation provided.
targeted-rollout algorithm no independent evidence
purpose: Efficiently estimates the ideal gradient for long reasoning chains.
New algorithm introduced to make the framework scalable.

pith-pipeline@v0.9.0 · 5609 in / 1481 out tokens · 45199 ms · 2026-05-12T04:25:23.651028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 14 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding the GRPO and Dr. GRPO , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. International Conference on Learning Representations , year=

work page
[5]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint , year=

Thinking Tokens for Language Modeling , author=. arXiv preprint , year=

work page
[7]

MiniLLM: On-Policy Distillation of Large Language Models

MiniLLM: Knowledge Distillation of Large Language Models , author=. arXiv preprint arXiv:2306.08543 , year=

work page internal anchor Pith review arXiv
[8]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page
[10]

American Invitational Mathematics Examination , author=

work page
[11]

NAACL , year=

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. NAACL , year=

work page
[12]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Solving math word problems with process- and outcome-based feedback

Solving Math Word Problems with Process- and Outcome-based Feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

International Conference on Machine Learning , year=

Eligibility Traces for Off-Policy Policy Evaluation , author=. International Conference on Machine Learning , year=

work page
[15]

ACM SIGOPS Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. ACM SIGOPS Symposium on Operating Systems Principles , year=

work page
[16]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. arXiv preprint arXiv:2604.13016 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of

Kim, Jeonghye and Luo, Xufang and Kim, Minbeom and Lee, Sangmook and Kim, Dohyung and Jeon, Jiwon and Li, Dongsheng and Yang, Yuqing , journal=. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of

work page
[18]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2016
[19]

Advances in Neural Information Processing Systems , volume=

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=

work page arXiv
[21]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

On-Policy Context Distillation for Language Models

On-Policy Context Distillation for Language Models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review arXiv
[24]

Self-Distillation Enables Continual Learning

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

work page internal anchor Pith review arXiv
[25]

arXiv preprint arXiv:2602.12125 , year=

Learning Beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=

work page arXiv
[26]

MiMo-V2-Flash Technical Report

MiMo-v2-Flash Technical Report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review arXiv
[27]

Zeng, Aohan and Lv, Xin and Hou, Zhenyu and Du, Zhengxiao and Zheng, Qinkai and Chen, Bin and Yin, Da and Ge, Chendi and Xie, Chengxing and Wang, Cunxiang and others , journal=

work page
[28]

Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

Privileged Information Distillation for Language Models , author=. arXiv preprint arXiv:2602.04942 , year=

work page arXiv
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

On the Efficacy of Knowledge Distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[30]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Improved Knowledge Distillation via Teacher Assistant , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[31]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

Distillation Scaling Laws , author=. arXiv preprint arXiv:2502.08606 , year=

work page arXiv
[32]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Small Models Struggle to Learn from Strong Reasoners , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[33]

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal=

work page