arxiv: 2605.05040 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: unknown

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Xin Yu , Liuchen Liao , Yiwen Zhang , Yingchen Yu , Lingzhou Xue , Qinzhen Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-distillationpreference learningreward regularizationon-policy learninglanguage model distillationmathematical reasoning

0 comments

The pith

PBSD derives a reward-regularized objective whose optimum is a reward-reweighted teacher distribution superior to the original teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Preference-Based Self-Distillation to improve on-policy self-distillation in language models. Instead of KL matching to the teacher, it uses a reward-regularized objective that has an analytic optimum as a reward-reweighted version of the teacher. The resulting target policy is provably better than the original teacher under this objective. Practically, it optimizes preference gaps between teacher and student samples while using on-policy sampling from the student. A statistical analysis establishes when such self-distillation is preferable to using an external teacher, and experiments show improved performance and stability on reasoning tasks.

Core claim

The paper establishes that a reward-regularized objective for self-distillation has the reward-reweighted teacher distribution as its analytic optimum, which yields a target policy that is provably superior to the original teacher. In practice, this is achieved by optimizing the preference gaps between samples from the teacher and the student while maintaining on-policy sampling for the student. This framework is supported by a statistical analysis of the induced preference-learning problem that identifies conditions under which on-policy self-distillation outperforms learning from an external teacher.

What carries the argument

The reward-regularized objective whose analytic optimum is the reward-reweighted teacher distribution.

If this is right

The target policy is provably superior to the original teacher under the reward-regularized objective.
Training optimizes preference gaps between teacher and student while maintaining on-policy sampling.
On-policy self-distillation is preferable to external teacher under certain statistical conditions.
Consistent strongest average performance and improved stability on math and tool-use benchmarks across model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable iterative self-improvement in models without needing larger external teachers.
The preference gap optimization might be adapted for other on-policy learning tasks.
Choosing different reward functions could lead to different superiority margins in the target policy.

Load-bearing premise

A suitable reward function exists and can be used to reweight the teacher distribution without introducing biases that invalidate the provable superiority.

What would settle it

An experiment showing that the PBSD student policy is not superior to the teacher or that it underperforms KL-based self-distillation on the benchmarks would challenge the claim.

Figures

Figures reproduced from arXiv: 2605.05040 by Lingzhou Xue, Liuchen Liao, Qinzhen Guo, Xin Yu, Yingchen Yu, Yiwen Zhang.

**Figure 1.** Figure 1: Comparison of three on-policy distillation paradigms. Left: Standard on-policy distillation view at source ↗

**Figure 2.** Figure 2: AIME25 ablations comparing GRPO, OPSD, and PBSD. (A) Avg@12 versus training steps. view at source ↗

read the original abstract

On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation (\textbf{PBSD}), which revisits on-policy self-distillation through a reward-regularized perspective. Instead of directly matching the teacher distribution, we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective. Practically, PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy student sampling. We support this framework with a statistical analysis of the induced preference-learning problem, formally establishing when on policy self-distillation is preferable to learning from an external teacher in our setting. Experiments on mathematical reasoning and tool-use benchmarks across multiple model scales demonstrate that PBSD consistently achieves the strongest average performance among comparable baselines, showing improved training stability over prior self-distillation baselines while preserving token efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PBSD gives a reward-regularized self-distillation objective with a closed-form reweighted optimum and a statistical condition for when self-distillation beats an external teacher, plus experiments showing stability gains on reasoning tasks.

read the letter

The paper's main move is to drop plain KL matching in on-policy self-distillation and instead use a reward-regularized objective whose analytic solution is a reweighted teacher distribution. They claim this target is provably better under the new objective, optimize preference gaps between teacher and student samples while keeping sampling on-policy, and add a statistical analysis that spells out when self-distillation is preferable to an external teacher. That combination is new relative to the self-distillation work they cite.

Referee Report

2 major / 2 minor

Summary. The paper proposes Preference-Based Self-Distillation (PBSD) for on-policy self-distillation in LLMs. It derives a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, claimed to yield a target policy provably superior to the original teacher under this objective. PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy sampling. A statistical analysis formally establishes conditions under which on-policy self-distillation is preferable to an external teacher. Experiments on mathematical reasoning and tool-use benchmarks across model scales show PBSD achieving the strongest average performance with improved stability over KL-based self-distillation baselines.

Significance. If the derivation is free of circularity and the reward is independent, the framework offers a principled alternative to KL matching in self-distillation, potentially improving stability and reasoning performance in efficient LLM training. The statistical analysis of the induced preference-learning problem is a strength, providing formal conditions for preferring self-distillation. Experimental gains on multiple benchmarks and scales, if robust, indicate practical value for token-efficient training.

major comments (2)

[§3.2] §3.2 (Reward-Regularized Objective): The claim that the analytic optimum is a reward-reweighted teacher distribution yielding a 'provably superior' target policy holds only under the assumption that the reward function is fixed and independent of the student policy. In the self-distillation setting, where the same model generates both teacher and student samples, the paper must explicitly state how the reward is obtained (e.g., from an external preference model or fixed dataset) to ensure the superiority is not tautological due to reweighting by construction.
[§4] §4 (Statistical Analysis): The formal conditions establishing when on-policy self-distillation outperforms external-teacher learning depend on assumptions about preference gaps and distribution shift. These conditions should be checked against the experimental setups (e.g., math reasoning tasks); if the reward model is fitted on the same data distribution as the student, the analysis risks violating the independence required for the preference-learning guarantees.

minor comments (2)

[Abstract] Abstract and §5 (Experiments): The claim of 'improved training stability' is not quantified (e.g., via loss variance or performance fluctuation over epochs); add a specific metric or plot reference.
[§5] §5 (Experiments): Report standard deviations over at least 3 random seeds for all methods and confirm that token budgets and sampling temperatures are matched exactly across PBSD and baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on the framework's assumptions and commit to revisions that strengthen the exposition without altering the core contributions.

read point-by-point responses

Referee: [§3.2] §3.2 (Reward-Regularized Objective): The claim that the analytic optimum is a reward-reweighted teacher distribution yielding a 'provably superior' target policy holds only under the assumption that the reward function is fixed and independent of the student policy. In the self-distillation setting, where the same model generates both teacher and student samples, the paper must explicitly state how the reward is obtained (e.g., from an external preference model or fixed dataset) to ensure the superiority is not tautological due to reweighting by construction.

Authors: We agree that the 'provably superior' claim is with respect to the fixed reward-regularized objective and requires an independent reward. In PBSD, the reward is obtained from an external preference model trained on a fixed, separate dataset of human preferences (distinct from the self-distillation data). This model is held fixed during training and does not depend on the student policy, so the reweighting is not tautological. We will revise §3.2 to explicitly describe the reward acquisition process, state the independence assumption, and clarify that the analytic optimum is taken with respect to this external reward. revision: yes
Referee: [§4] §4 (Statistical Analysis): The formal conditions establishing when on-policy self-distillation outperforms external-teacher learning depend on assumptions about preference gaps and distribution shift. These conditions should be checked against the experimental setups (e.g., math reasoning tasks); if the reward model is fitted on the same data distribution as the student, the analysis risks violating the independence required for the preference-learning guarantees.

Authors: We acknowledge the importance of verifying the independence and distribution-shift assumptions in the statistical analysis. In the reported experiments on mathematical reasoning and tool-use benchmarks, the preference model is trained on a held-out preference dataset that does not overlap with the task data used for on-policy sampling and distillation. This preserves the required independence. We will add explicit discussion in §4 (and the experimental appendix) that checks the conditions against the benchmarks, reports the data separation, and notes any limitations under mild violations of the assumptions. revision: yes

Circularity Check

1 steps flagged

Superiority of reweighted teacher holds by construction under self-defined reward-regularized objective

specific steps

self definitional [Abstract]
"we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective"

The target policy is defined to be the analytic optimum of the newly introduced reward-regularized objective; therefore its superiority to the original teacher follows immediately from the mathematical property that an optimum cannot be worse than the starting distribution under the objective being optimized. The 'provable superiority' is thus equivalent to the definition of the objective rather than an independent first-principles result.

full rationale

The paper's core derivation introduces a reward-regularized objective and shows its analytic optimum is a reward-reweighted teacher that is superior under that same objective. This superiority is tautological once the objective is posited, as any optimum is at least as good as the starting point by definition of optimality. The practical PBSD implementation optimizes preference gaps on-policy, and the statistical analysis of when self-distillation beats external teachers may add independent content, but the load-bearing 'provably superior' claim reduces to the framework's own construction without external grounding or independent verification of the reward's unbiasedness. No explicit self-citations or fitted-parameter circularity beyond this definitional step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a reward function that can be used for regularization and on the validity of the derived analytic optimum; the abstract provides no explicit free parameters but implies reliance on reward modeling choices whose details are unspecified.

axioms (2)

domain assumption The reward-regularized objective admits an analytic optimum that is exactly a reward-reweighted version of the teacher distribution
Presented as derived in the abstract without further justification shown.
domain assumption On-policy self-distillation is preferable to external-teacher learning under identifiable statistical conditions
Formally established via the statistical analysis referenced in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1484 out tokens · 61633 ms · 2026-05-08T16:19:32.191355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 23 canonical work pages · 14 internal anchors

[1]

On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning

Amirhossein Afsharrad, Amirhesam Abedsoltan, Ahmadreza Moradipari, and Sanjay Lall. On- policy distillation of language models for autonomous vehicle motion planning.arXiv preprint arXiv:2604.07944,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, and Yu Wang

doi: 10.1214/09-EJS521. Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, and Yu Wang. Towards better optimization for listwise preference in diffusion models.arXiv preprint arXiv:2510.01540,

work page doi:10.1214/09-ejs521
[3]

Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026

Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, et al. Onesearch-v2: The latent reasoning enhanced self-distillation generative search framework.arXiv preprint arXiv:2603.24422,

work page internal anchor Pith review arXiv
[4]

arXiv preprint arXiv:2603.23871 , year =

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,

work page arXiv
[5]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Zhiyuan Liu. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review arXiv
[7]

doi:10.48550/arXiv.2502.07750 , abstract =

Jonas Hübotter, Frederike Lübeck, Lejs Deen Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Test-time self-distillation.arXiv preprint arXiv:2502.07750,

work page arXiv
[8]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review arXiv
[9]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

10 Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026a. Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan- ang Gao, Wen...

work page arXiv
[11]

https://thinkingmachines.ai/blog/ on-policy-distillation/

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Sahand Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers.Statistical Science, 27 (4):538–557,

work page doi:10.64434/tml.20251026
[12]

Statistical Science , author =

doi: 10.1214/12-STS400. Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733,

work page doi:10.1214/12-sts400
[13]

Online dpo: Online direct preference optimization with fast-slow chasing.arXiv preprint arXiv:2406.05534,

Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, and Bowen Zhou. Online dpo: Online direct preference optimization with fast-slow chasing.arXiv preprint arXiv:2406.05534,

work page arXiv
[14]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yi Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review arXiv
[16]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review arXiv
[17]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

work page internal anchor Pith review arXiv
[18]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Chaoqi Wang, Yunchu Wang, Wei Zheng, Yunzhi Li, Yuwei Ye, Xiaolong Wang, and Jingfeng Yang. Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents.arXiv preprint arXiv:2604.10674, 2026a. Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026b...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

11 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, and Jiangjie Chen. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXi...

work page internal anchor Pith review arXiv
[21]

Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193,

work page arXiv
[22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026a. Ziyu Zhao, Yixiao Zhou, Xin Yu, Zhi Zhang, Didi Zhu, Tao Shen, Zexi Li, Jinluan Yang, Xuwu Wang, Jing Su, et al. Each rank could be an expert: Sin...

work page internal anchor Pith review arXiv 1998
[23]

Jordan, and Jiantao Jiao

Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf.arXiv preprint arXiv:2401.16335,

work page arXiv
[24]

Appendix E develops the technical details behind the statistical analysis in the main text

This appendix therefore justifies the objective-level motivation for replacing pure KL matching with reward-aware reweighting. Appendix E develops the technical details behind the statistical analysis in the main text. Section E.1 derives the sample-level gradient and the local Hessian form of the online PBSD objective, which are the ingredients needed to...

2024
[25]

Therefore, F(π teach) =E x∼DEy∼π teach(·|x) [r(x, y)].(25) To compare the two values, fixxand write Z(x) =E y∼π teach(·|x) [exp(r(x, y)/β)]. Sincelog(·)is concave, Jensen’s inequality implies βlogE y∼π teach(·|x) [exp(r(x, y)/β)]≥E y∼π teach(·|x) [r(x, y)],(26) where equality holds only when r(x, y) is constant over the support of πteach(· |x) . Combining...

2000
[26]

[2026a] whenever applicable so that the comparison against prior baselines isolates the effect of the proposed PBSD objective as cleanly as possible

Our implementation follows the OPSD protocol of Zhao et al. [2026a] whenever applicable so that the comparison against prior baselines isolates the effect of the proposed PBSD objective as cleanly as possible. The appendix is organized as follows. Appendix F.1 summarizes the datasets used in the mathematical reasoning and tool-use experiments. Appendix F....

2025
[27]

Tool-use data.For the additional tool-use study, we follow the setup in Shenfeld et al

In the main paper, all reported mathematical reasoning numbers use the same Avg@12 evaluation protocol. Tool-use data.For the additional tool-use study, we follow the setup in Shenfeld et al. [2026], which uses ToolAlpaca as the underlying domain. Each example consists of a user query together with tool or API information, and the model must generate the ...

2026
[28]

Base (Student)

The shared LoRA configuration is rank r= 64 , LoRA alpha α= 128 , and target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. To accelerate rollout generation for on-policy methods, we use vLLM for inference. We keep the optimization hyperparameters of the baselines aligned with OPSD [Zhao et al., 2026a] as closely as possible so...

2025