On the Position Bias of On-Policy Distillation

Bo Chen; Sijie Zhu; Tiansheng Wen; Yan Xie; Yifei Wang

arxiv: 2606.22600 · v3 · pith:B32IYG73new · submitted 2026-06-21 · 💻 cs.LG · cs.AI

On the Position Bias of On-Policy Distillation

Yan Xie , Sijie Zhu , Tiansheng Wen , Bo Chen , Yifei Wang This is my paper

Pith reviewed 2026-06-29 05:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationposition biasimportance weightingreinforcement learningtoken-level supervisiondistribution discrepancyknowledge distillationsequence generation

0 comments

The pith

Importance weighting by accumulated student-teacher discrepancy corrects position bias in on-policy distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation averages token losses uniformly, but later tokens in student rollouts suffer degraded supervision as the student distribution drifts away from the teacher. The paper shows that using only the first 30 percent of tokens performs nearly as well as all tokens while the last 30 percent barely learns. Through a constrained optimization analysis, the authors derive Importance-Weighted On-Policy Distillation, which assigns each token a weight based on the cumulative discrepancy up to that point. This naturally emphasizes early tokens and downweights later ones. The resulting method reaches higher performance faster than uniform OPD in both same-size and cross-scale settings.

Core claim

On-policy distillation exhibits position bias because student rollouts deviate progressively from the teacher distribution, so token-level supervision quality declines at later positions. Importance-Weighted On-Policy Distillation (IW-OPD) reweights each token by the accumulated discrepancy between the student's and teacher's distributions at that step, thereby upweighting earlier tokens with smaller deviations and downweighting later ones.

What carries the argument

Importance-Weighted On-Policy Distillation (IW-OPD), which sets the loss weight for each token to a function of the accumulated discrepancy between student and teacher output distributions.

If this is right

IW-OPD converges significantly faster than standard OPD.
IW-OPD achieves better final performance than OPD in same-size teacher-student pairs.
IW-OPD also improves performance in cross-scale distillation settings.
Gains reach up to 6.9 points on the AIME-2025 benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discrepancy-based reweighting could be applied to other sequence-level distillation or imitation settings where rollout length causes progressive drift.
Online computation of the discrepancy measure might allow the weighting to adapt during a single training run without extra passes.
If the bias is primarily a function of sequence position rather than task content, the weighting rule may transfer across different reinforcement-learning environments with long trajectories.

Load-bearing premise

The accumulated discrepancy between student and teacher distributions is a sufficient and unbiased proxy for the quality of token-level supervision.

What would settle it

A controlled run in which IW-OPD weights are replaced by random weights drawn from the same distribution while keeping all other factors fixed, yet the performance advantage over standard OPD disappears.

Figures

Figures reproduced from arXiv: 2606.22600 by Bo Chen, Sijie Zhu, Tiansheng Wen, Yan Xie, Yifei Wang.

**Figure 1.** Figure 1: Position Bias in OPD training. (a) With the same 30% token budget, training on the prefix part of each response matches or exceeds full token Standard OPD, whereas training on the suffix part fails to learn effectively. Student: Qwen3-0.6B, Teacher: Qwen3-4B-Instruct-2507. (b) Teacher and student accuracy are measured by the probability of reaching a correct answer from a given student-generated prefix. St… view at source ↗

**Figure 2.** Figure 2: IW-OPD improves both sample efficiency and final performance. (a) AIME 2025 accuracy during training: IW-OPD converges faster and achieves better final performance than Standard OPD. (b) Final accuracy across student scales distilled from the same teacher; the IW-OPD advantage grows from +4.0% at 1.0× compression to +14.9% at 6.7×. 1 Introduction On-Policy Distillation (OPD) trains a student on its own rol… view at source ↗

**Figure 3.** Figure 3: Position Bias phenomena in OPD. (a) The mean token-level KL decreases during OPD training but plateaus at a non-zero residual. (b) Token-level reverse KL before and after OPD training. (c) Sequence-level log-probabilities of student-sampled prefixes under the student and teacher. Student: Qwen3-0.6B; Teacher: Qwen3-4B-Instruct. divergence between these two only decreases by 20% even if training converges a… view at source ↗

**Figure 4.** Figure 4: From signed prefix ratio to unsigned prefix discrepancy. (a) Directly using the ideal prefix ratio is sensitive to α. (b) Token-wise visualization shows the desired overall downward trend, but also local rebounds caused by signed cancellation. (c) Replacing signed accumulation with the unsigned weight gives a more stable weighting signal. 4.2 Stable Token-level Importance Weight Estimate Eq. (12) provides … view at source ↗

read the original abstract

On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, we discover that not all tokens are created equal: as student rollouts grow longer, they deviate further from the teacher's distribution, leading to degraded supervision quality at later positions. As a result, OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything. In this work, we provide a principled understanding of this issue through the lens of constrained optimization. Based on these insights, we derive Importance-Weighted On-Policy Distillation (IW-OPD), in which the weight assigned to each token depends on the accumulated discrepancy between the student's and teacher's distributions, naturally upweighting earlier tokens and downweighting later ones with larger deviations. We show that IW-OPD converges significantly faster than OPD, with better learning efficiency, and achieves better final performance than standard OPD in both same-size and cross-scale settings, improving performance up to 6.9 points on AIME-2025.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a position bias in on-policy distillation for long rollouts and derives a discrepancy-based weighting that claims faster convergence plus gains on AIME-2025.

read the letter

The main thing to know is that standard OPD averages token losses uniformly, but later tokens in long student rollouts drift from the teacher and supply weaker supervision. The authors show this directly: the first 30% of tokens nearly match full-sequence performance while the last 30% barely learn.

They treat the problem as constrained optimization and derive IW-OPD, where each token's weight is set by the accumulated discrepancy between student and teacher distributions. This automatically favors early tokens. The ablation and the explicit weighting rule are the new pieces; they are not just a routine tweak to the usual KL objective.

The reported outcomes are faster convergence, better sample efficiency, and up to 6.9 points higher on AIME-2025 in both same-size and cross-scale distillation. That matches the practical need in reasoning-model post-training.

The soft spot is that the abstract gives no concrete statement of the constraint set or Lagrange handling, so it is unclear whether the weighting avoids new instabilities or simply trades position bias for something else such as gradient-scale effects. The performance numbers sit on an external benchmark rather than on the discrepancy quantity itself, which is normal but leaves the causal link to the bias fix unverified without the full methods and stats.

This is for people working on efficient on-policy distillation for long-sequence LLMs. A reader already running OPD experiments would get immediate value from the weighting idea and the 30%-split test. It is coherent enough on its own terms to deserve a serious referee, mainly to check the optimization details and experimental controls.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard On-Policy Distillation (OPD) exhibits position bias, with token-level supervision quality degrading for later positions in longer student rollouts due to increasing deviation from the teacher distribution. Evidence includes comparable performance when using only the first 30% of tokens versus all tokens, while the last 30% yields almost no learning. Through a constrained-optimization lens, the authors derive Importance-Weighted OPD (IW-OPD), where each token's weight depends on the accumulated discrepancy between student and teacher distributions (upweighting early tokens). They report that IW-OPD converges faster with better efficiency than OPD and achieves superior final performance in same-size and cross-scale settings, with gains up to 6.9 points on AIME-2025.

Significance. If the derivation is sound and the empirical gains are robust to the weighting assumption, this provides a principled correction for a practical bias in on-policy distillation, potentially improving learning efficiency in RL for language models. The constrained-optimization framing is a conceptual strength if it avoids embedding hidden biases or instabilities.

major comments (2)

[Derivation of IW-OPD] The derivation of IW-OPD from the constrained-optimization lens (as described in the abstract): the reweighting by accumulated discrepancy is presented as correcting position bias, but this rests on the unverified assumption that the discrepancy is a sufficient and unbiased proxy for supervision quality. No explicit statement of the constraint set or Lagrange multiplier handling is given to confirm the weighting is free of correlations with rollout length, variance, or gradient scale that could introduce instability rather than resolve bias. This assumption is load-bearing for the faster convergence and +6.9 point claims.
[Empirical evaluation] Empirical results section (implied by abstract claims): performance is reported on the external AIME-2025 benchmark rather than on quantities defined directly by the weighting function itself. This makes it difficult to confirm that the reported gains stem from the proposed mechanism without potential circularity or confounding factors.

minor comments (1)

[Abstract] The abstract states gains 'up to 6.9 points on AIME-2025' but does not specify the exact baseline comparison or whether this is in the same-size or cross-scale setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the derivation and empirical evaluation of IW-OPD. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Derivation of IW-OPD] The derivation of IW-OPD from the constrained-optimization lens (as described in the abstract): the reweighting by accumulated discrepancy is presented as correcting position bias, but this rests on the unverified assumption that the discrepancy is a sufficient and unbiased proxy for supervision quality. No explicit statement of the constraint set or Lagrange multiplier handling is given to confirm the weighting is free of correlations with rollout length, variance, or gradient scale that could introduce instability rather than resolve bias. This assumption is load-bearing for the faster convergence and +6.9 point claims.

Authors: Section 3 frames the token-level supervision as a constrained optimization problem where each token is subject to a quality constraint defined by its distributional discrepancy from the teacher. The accumulated discrepancy enters as the dual variable (Lagrange multiplier) for that constraint, yielding the importance weight. The manuscript's Section 2 analysis establishes the proxy validity via the observed degradation (first-30% vs. last-30% tokens). We agree the main text would benefit from an explicit Lagrangian statement and constraint set; we will add this formulation to Section 3.1. Appendix C already reports that weight-induced gradient variance remains comparable to OPD and does not grow with rollout length, mitigating the instability concern. revision: yes
Referee: [Empirical evaluation] Empirical results section (implied by abstract claims): performance is reported on the external AIME-2025 benchmark rather than on quantities defined directly by the weighting function itself. This makes it difficult to confirm that the reported gains stem from the proposed mechanism without potential circularity or confounding factors.

Authors: The AIME-2025 results demonstrate practical utility, but the paper already contains mechanism-specific diagnostics: Figure 2 quantifies position bias via token-subset ablations, Figure 3 shows the resulting weight distribution versus position, and Section 4.3 reports per-epoch convergence curves under the weighted objective. These are internal to the weighting function. To strengthen the link, we will add a supplementary plot of per-token weight versus loss reduction in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation independent of reported gains

full rationale

The paper observes position bias empirically (earlier tokens better than later), then derives IW-OPD weights from a constrained-optimization formulation that treats accumulated discrepancy as the reweighting signal. This discrepancy is computed directly from student-teacher rollout distributions and is not defined in terms of the final performance metric or fitted to AIME-2025 scores. Reported improvements (+6.9 points) are measured on an external benchmark outside the weighting function itself. No self-citations, fitted-input-as-prediction, or ansatz-smuggling steps appear in the derivation chain. The central claim therefore remains falsifiable against independent data and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that distribution discrepancy accumulates monotonically with rollout length and serves as a valid importance signal; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Supervision quality at a token degrades monotonically with the accumulated discrepancy between student and teacher distributions.
This premise underpins both the position-bias diagnosis and the weighting rule in IW-OPD.

pith-pipeline@v0.9.1-grok · 5764 in / 1268 out tokens · 31076 ms · 2026-06-29T05:12:12.312263+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 27 canonical work pages · 18 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,
[2]

URLhttps://openreview.net/forum?id=3zKtaqxLhW
[3]

Why exposure bias matters: An imitation learning perspective of error accumulation in language generation

Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. InFindings of the Association for Computational Linguistics: ACL 2022, pages 700–710, 2022

2022
[4]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, 2015

2015
[5]

Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman

Eric J. Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman. Forking paths in neural text generation. InInternational Conference on Learning Representations, 2025

2025
[6]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wenjie Peng, Jianhao Chen, Ning Chen, Zhiyuan Liu, and Maosong Sun. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing, pp

Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng, Wanying Xie, Jie Zhou, and Dong Yu. Token-level adaptive training for neural machine translation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1035–1046, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main

work page doi:10.18653/v1/2020.emnlp-main 2020
[11]

URLhttps://aclanthology.org/2020.emnlp-main.76/

2020
[12]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

2024
[13]

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, and Yongqi Zhang. Rethinking token-level credit assignment in RLVR: A polarity-entropy analysis.arXiv preprint arXiv:2604.11056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL https://arxiv. org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025. URL https://arxiv.org/abs/ 2510.24021

work page arXiv 2025
[17]

Sham M. Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems, volume 14, pages 1531–1538, 2001

2001
[18]

Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=zph7e5JaXc. 11

2026
[19]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. doi: 10.48550/arXiv.2604.13016. URL https://arxiv.org/abs/ 2604.13016

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13016 2026
[20]

Lipscomb, C

Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, and Tuo Zhao. Token-wise curriculum learning for neural machine translation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3658–3670, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. find...

work page doi:10.18653/v1/2021 2021
[21]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

2024
[22]

Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability

Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability. InInternational Conference on Machine Learning, 2025

2025
[23]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Code-r1: Reproducing r1 for code with reliable rewards

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

2025
[25]

Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

Lingyuan Liu and Mengxiang Zhang. Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

work page arXiv 2025
[26]

MiMo-V2-Flash Technical Report

LLM-Core, Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

On-policy distillation

Kevin Lu. On-policy distillation. Thinking Machines Lab Blog, 2025. URL https:// thinkingmachines.ai/blog/on-policy-distillation/

2025
[28]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[29]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

2011
[30]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889– 1897, 2015

2015
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

Tim van Erven and Peter Harremoës. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

2014
[35]

Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning

Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6123–6133, 2025. 12

2025
[36]

Xu, Damai Dai, Yifei Li, Deli Chen, Y

Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[37]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InAdvances in Neural Information...

2025
[38]

f-divergence minimization for sequence-level knowledge distillation

Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834, Toronto, Canada,
[39]

doi: 10.18653/v1/2023.acl-long.605

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/

work page doi:10.18653/v1/2023.acl-long.605 2023
[40]

org/abs/2405.21046

Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit Q*-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046, 2024

work page arXiv 2024
[41]

LLM-oriented token-adaptive knowledge distillation, 2025

Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, and Jiangning Zhang. LLM-oriented token-adaptive knowledge distillation, 2025. URL https: //arxiv.org/abs/2510.11615

work page arXiv 2025
[42]

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distilla- tion and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

arXiv preprint arXiv:2505.00662 , year =

Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

work page arXiv 2025
[45]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning

Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, and Fuli Feng. Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20939–20957. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1078

work page doi:10.18653/v1/2025.findings-acl.1078 2025
[47]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Geometric-mean policy optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InInternational Conference on Learning Representations, 2026

2026
[49]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. doi: 10.48550/arXiv.2507.18071. URL https://arxiv.org/abs/2507.18071. 13 A Proofs and Derivations The derivations below fix a prompt x u...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.18071 2025

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,

[2] [2]

URLhttps://openreview.net/forum?id=3zKtaqxLhW

[3] [3]

Why exposure bias matters: An imitation learning perspective of error accumulation in language generation

Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. InFindings of the Association for Computational Linguistics: ACL 2022, pages 700–710, 2022

2022

[4] [4]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, 2015

2015

[5] [5]

Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman

Eric J. Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman. Forking paths in neural text generation. InInternational Conference on Learning Representations, 2025

2025

[6] [6]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wenjie Peng, Jianhao Chen, Ning Chen, Zhiyuan Liu, and Maosong Sun. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing, pp

Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng, Wanying Xie, Jie Zhou, and Dong Yu. Token-level adaptive training for neural machine translation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1035–1046, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main

work page doi:10.18653/v1/2020.emnlp-main 2020

[11] [11]

URLhttps://aclanthology.org/2020.emnlp-main.76/

2020

[12] [12]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

2024

[13] [13]

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, and Yongqi Zhang. Rethinking token-level credit assignment in RLVR: A polarity-entropy analysis.arXiv preprint arXiv:2604.11056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL https://arxiv. org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025. URL https://arxiv.org/abs/ 2510.24021

work page arXiv 2025

[17] [17]

Sham M. Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems, volume 14, pages 1531–1538, 2001

2001

[18] [18]

Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=zph7e5JaXc. 11

2026

[19] [19]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. doi: 10.48550/arXiv.2604.13016. URL https://arxiv.org/abs/ 2604.13016

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13016 2026

[20] [20]

Lipscomb, C

Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, and Tuo Zhao. Token-wise curriculum learning for neural machine translation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3658–3670, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. find...

work page doi:10.18653/v1/2021 2021

[21] [21]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

2024

[22] [22]

Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability

Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability. InInternational Conference on Machine Learning, 2025

2025

[23] [23]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Code-r1: Reproducing r1 for code with reliable rewards

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

2025

[25] [25]

Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

Lingyuan Liu and Mengxiang Zhang. Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

work page arXiv 2025

[26] [26]

MiMo-V2-Flash Technical Report

LLM-Core, Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

On-policy distillation

Kevin Lu. On-policy distillation. Thinking Machines Lab Blog, 2025. URL https:// thinkingmachines.ai/blog/on-policy-distillation/

2025

[28] [28]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[29] [29]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

2011

[30] [30]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889– 1897, 2015

2015

[31] [31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

Tim van Erven and Peter Harremoës. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

2014

[35] [35]

Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning

Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6123–6133, 2025. 12

2025

[36] [36]

Xu, Damai Dai, Yifei Li, Deli Chen, Y

Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[37] [37]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InAdvances in Neural Information...

2025

[38] [38]

f-divergence minimization for sequence-level knowledge distillation

Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834, Toronto, Canada,

[39] [39]

doi: 10.18653/v1/2023.acl-long.605

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/

work page doi:10.18653/v1/2023.acl-long.605 2023

[40] [40]

org/abs/2405.21046

Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit Q*-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046, 2024

work page arXiv 2024

[41] [41]

LLM-oriented token-adaptive knowledge distillation, 2025

Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, and Jiangning Zhang. LLM-oriented token-adaptive knowledge distillation, 2025. URL https: //arxiv.org/abs/2510.11615

work page arXiv 2025

[42] [42]

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distilla- tion and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

arXiv preprint arXiv:2505.00662 , year =

Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

work page arXiv 2025

[45] [45]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning

Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, and Fuli Feng. Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20939–20957. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1078

work page doi:10.18653/v1/2025.findings-acl.1078 2025

[47] [47]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Geometric-mean policy optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InInternational Conference on Learning Representations, 2026

2026

[49] [49]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. doi: 10.48550/arXiv.2507.18071. URL https://arxiv.org/abs/2507.18071. 13 A Proofs and Derivations The derivations below fix a prompt x u...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.18071 2025