A Brief Overview: On-Policy Self-Distillation In Large Language Models

Fangming Cui; Jiahong Li; Sunan Li

arxiv: 2605.18141 · v2 · pith:AE72JTNInew · submitted 2026-05-18 · 💻 cs.HC

A Brief Overview: On-Policy Self-Distillation In Large Language Models

Fangming Cui , Sunan Li , Jiahong Li This is my paper

Pith reviewed 2026-05-22 09:48 UTC · model grok-4.3

classification 💻 cs.HC

keywords On-Policy Self-DistillationLarge Language ModelsKnowledge DistillationReasoning AlignmentMemory ReductionSelf-Teaching Frameworks

0 comments

The pith

On-policy self-distillation lets one large language model serve as both teacher and student to align reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper overviews On-Policy Self-Distillation (OPSD), a framework where a single LLM plays dual roles during training. The teacher role accesses verified reasoning traces while the student role sees only the problem. Training minimizes per-token divergence on trajectories from the student role to align the model with better rationalizations. This removes the need for a separate teacher model and addresses distribution mismatch issues. The overview explains the foundations for newcomers to the field.

Core claim

What carries the argument

Dual-role single model with teacher granted verified reasoning traces and student limited to problem statement, trained by minimizing distributional divergence on student-sampled trajectories.

Load-bearing premise

That verified reasoning traces are reliably available to the teacher role and that the divergence minimization on student trajectories produces stable alignment without amplifying errors.

What would settle it

A test where removing access to verified traces or using noisy traces causes the model to perform worse than baseline training methods on reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.18141 by Fangming Cui, Jiahong Li, Sunan Li.

read the original abstract

On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short beginner overview of on-policy self-distillation that explains the single-model setup and memory claims but adds no new experiments or checks.

read the letter

This paper is a short overview aimed at beginners on On-Policy Self-Distillation in large language models. It describes a framework where one model serves as both teacher and student to align reasoning without needing a separate larger model. The paper does well in laying out the key differences from standard approaches. It notes that the teacher role gets privileged access to verified reasoning traces, while the student sees only the problem. Training then minimizes per-token distributional divergence on trajectories sampled from the student policy. This setup is said to avoid off-policy mismatches and cut GPU memory use by about 40 to 60 percent. The discussion of methodological innovations and design principles is clear and organized, which should make the ideas accessible to people new to the area. The soft spots are mainly around the lack of supporting evidence. The memory reduction claim is stated directly but without any experimental results, baselines, or even a simple ablation to show how it was measured. There is also no analysis of potential downsides, such as whether the student sampling could introduce or amplify errors that then get reinforced through the divergence minimization. The overview presents these elements as advantages by design, yet without stability checks or counterexamples, it's difficult to gauge how robust the method really is in practice. This work is best suited for readers who are just entering the field of LLM alignment and distillation techniques and want a high-level summary of emerging patterns. It does not have the empirical depth or original contributions that would typically warrant sending it to peer review at a major venue. I would recommend against full peer review unless the authors expand it with concrete validation and address the open questions around trace availability and error stability.

Referee Report

2 major / 1 minor

Summary. The manuscript is a brief overview of On-Policy Self-Distillation (OPSD) for large language models. It defines OPSD as a unified framework in which a single LLM simultaneously serves as teacher (with privileged access to verified reasoning traces) and student (observing only the problem statement). Training minimizes per-token distributional divergence between the two roles on trajectories sampled from the student policy. The paper claims this eliminates external teachers, leverages ground-truth solutions directly, resolves off-policy distribution mismatch, and reduces GPU memory consumption by 40%-60% relative to standard On-Policy Distillation (OPD). The discussion covers conceptual foundations, methodological innovations, and design principles aimed at beginners.

Significance. If the described benefits hold, OPSD could provide a practical route to memory-efficient, self-contained alignment of LLMs without separate teacher models. The claimed 40-60% memory reduction would be a concrete engineering advantage for scaling distillation. However, because the manuscript supplies only descriptive synthesis and no new derivations, experiments, or stability analysis, its significance is limited to potential utility as an introductory summary rather than a substantive advance.

major comments (2)

[Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.
[Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.

minor comments (1)

The manuscript would benefit from explicit section headings or numbered subsections to improve readability for the intended beginner audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our brief overview manuscript. We appreciate the identification of areas where claims require better qualification given the paper's scope as a conceptual synthesis rather than a source of new empirical results. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.

Authors: We agree that the manuscript, being an overview without new experiments, should not present the 40-60% figure as an unsubstantiated assertion. This range is intended to reflect reported outcomes from prior OPSD implementations in the literature that the paper synthesizes. In the revised version we will either insert citations to the specific studies documenting these memory reductions or rephrase the statement to indicate that such savings have been observed in existing OPSD work, thereby removing any implication that the figure is a new result of this overview. revision: yes
Referee: [Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.

Authors: The referee is correct that the paper supplies no new derivations or formal stability analysis; this is consistent with its purpose as a beginner-oriented overview of existing design principles. The on-policy sampling combined with privileged access to verified traces is presented as the mechanism intended to limit distribution shift and error accumulation. We will revise the abstract and framework sections to state these assumptions more explicitly and to include a concise paragraph noting potential limitations, such as dependence on trace quality and the desirability of empirical checks for drift in particular domains, while pointing readers to the referenced empirical studies for further investigation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive overview without derivations or fitted predictions

full rationale

The paper is explicitly framed as a 'brief overview' and 'brief analysis of the conceptual foundations' of OPSD. It defines the framework in prose (single model as teacher/student with privileged traces, minimize per-token divergence on student trajectories) but supplies no equations, no parameter fitting, no predictions, and no self-citation chains that bear the central claim. The memory-reduction claim is stated as an empirical observation rather than a derived result. No load-bearing step reduces to its own inputs by construction; the content remains self-contained description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an overview paper the work introduces no new mathematical objects, fitted parameters, or invented entities; it describes an existing framework.

pith-pipeline@v0.9.0 · 5746 in / 1116 out tokens · 42495 ms · 2026-05-22T09:48:09.204199+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself (eq. 6, JSD_β in eq. 7)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

single large language model acts simultaneously as both teacher and student... privileged access to verified reasoning traces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 50 internal anchors

[1]

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2024. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. arXiv:2306.13649 [cs.LG] https://arxiv.org/abs/2306.13649

work page arXiv 2024
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, and Haoliang Li. 2026. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models. arXiv:2605.11854 [cs.CL] https://arxiv.org/abs/2605.11854

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Ganqu Cui, Liangyuan Yuan, Ning Ding, Zhiwei Yao, Wei Ye, Yujia Wang, Yue Zhang, Jing Xu, Han Zhang, Zini Chen, et al. 2023. UltraFeedback: Boosting Language Models with High-quality Feedback.arXiv preprint arXiv:2310.01377(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Ken Ding. 2026. HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation.arXiv preprint arXiv:2603.23871(2026)

work page arXiv 2026
[8]

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Zihao Han, Tiangang Zhang, Huaibin Wang, and Yilun Sun. 2026. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning. arXiv:2605.11458 [cs.AI] https://arxiv.org/abs/2605.11458

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. 2026. Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision.arXiv preprint arXiv:2604.12002(2026). https: //arxiv.org/abs/2604.12002

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset.Advances in Neural Information Processing Systems34 (2021), 25774–25786

work page 2021
[13]

HuggingFaceH4. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024
[14]

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. Reinforcement Learning via Self-Distillation. arXiv:2601.20802 [cs.LG] https://arxiv.org/abs/2601.20802

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint(2024)

work page 2024
[16]

Minbyul Jeong. 2026. Healthcare AI GYM for Medical Agents. arXiv:2605.02943 [cs.LG] https://arxiv.org/abs/2605.02943

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Jiaming Ji, Meng Liu, Juntao Dai, Xuehai Pan, Ce Zhang, Chi Bian, Botao Chen, Rui Sun, Yashi Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.Advances in Neural Information Processing Systems36 (2023), 24621–24658

work page 2023
[18]

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, and Steven Hoi. 2026. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models. arXiv:2605.05204 [cs.CV] https://arxiv.org/abs/2605.05204

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, and Hao Henry Wang. 2025. Importance-Aware Data Selection for Efficient LLM Instruction Tuning. arXiv:2511.07074 [cs.CL] https://arxiv.org/abs/2511.07074

work page arXiv 2025
[20]

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, and Srijan Kumar. 2026. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models. arXiv:2605.06597 [cs.CL] https://arxiv.org/abs/2605.06597

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. 2026. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning. arXiv:2605.13255 [cs.AI] https://arxiv.org/abs/2605.13255

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Jeonghye Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR. arXiv:2605.10781 [cs.LG] https://arxiv.org/abs/2605.10781 A Brief Overview: On-Policy Self-Distillation In Large Language Models 9

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? arXiv:2603.24472 [cs.CL] https://arxiv.org/abs/2603.24472

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, and Tom Goldstein. 2026. Multi-Token Prediction via Self-Distillation. arXiv:2602.06019 [cs.CL] https://arxiv.org/abs/2602.06019

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. 2026. Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing. arXiv:2604.02288 [cs.LG] https://arxiv.org/abs/2604.02288

work page arXiv 2026
[26]

Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, and Rui Wang. 2026. GEAR: Granularity- Adaptive Advantage Reweighting for LLM Agents via Self-Distillation. arXiv:2605.11853 [cs.LG] https://arxiv.org/abs/2605.11853

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. 2026. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arXiv:2604.13016 [cs.LG] https://arxiv.org/abs/2604.13016

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Wenjie Liao, Like Wu, Liangjie Zhao, Shihui Xu, and Shigeru Fujimura. 2026. IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning. arXiv:2604.20933 [cs.LG] https://arxiv.org/abs/2604.20933

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. 2026. VISD: Enhancing Video Reasoning via Structured Self-Distillation. arXiv:2605.06094 [cs.CV] https://arxiv.org/abs/2605.06094

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. Crosslingual On-Policy Self-Distillation for Multilingual Reasoning. arXiv:2605.09548 [cs.CL] https://arxiv.org/abs/2605.09548

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Kevin Lu and Thinking Machines Lab. 2025. On-Policy Distillation.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20251026 https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[33]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. 2026. Privileged Information Distillation for Language Models.arXiv preprint arXiv:2602.04942(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, and Wen Shen. 2026. Multilingual Safety Alignment via Self-Distillation. arXiv:2605.02971 [cs.LG] https://arxiv.org/abs/2605.02971

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. 2026. On-Policy Self-Distillation for Reasoning Compression. arXiv preprint arXiv:2603.05433(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information. arXiv:2605.11609 [cs.LG] https://arxiv.org/abs/2605.11609

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation. arXiv:2605.11613 [cs.LG] https://arxiv.org/abs/2605.11613

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. Self-Distillation Enables Continual Learning. arXiv:2601.19897 [cs.LG] https://arxiv.org/abs/2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Mingyang Song and Mao Zheng. 2026. A Survey of On-Policy Distillation for Large Language Models. arXiv:2604.00626 [cs.LG] https://arxiv.org/ abs/2604.00626

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Alex Stein, Furong Huang, and Tom Goldstein. 2026. GATES: Self-Distillation under Privileged Context with Consensus Gating.arXiv preprint arXiv:2602.20574(2026)

work page arXiv 2026
[44]

Zhiquan Tan and Yinrong Hong. 2026. PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners. arXiv:2604.26573 [cs.LG] https://arxiv.org/abs/2604.26573

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. 2026. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents. arXiv:2604.10674 [cs.LG] https://arxiv.org/abs/2604.10674

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, and Sanqiang Zhao. 2026. SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe. arXiv:2410.05248 [cs.CL] https://arxiv.org/abs/2410.05248

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. 2026. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence. arXiv:2603.11178 [cs.AI] https://arxiv.org/abs/2603.11178

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. 2026. Self-Distilled RLVR. arXiv:2604.03128 [cs.LG] https://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Yuxiao Yang, Xiaoyun Wang, and Weitong Zhang. 2026. OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning. arXiv:2605.12400 [cs.LG] https://arxiv.org/abs/2605.12400 10 Fangming Cui, Sunan Li, and Jiahong Li

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Shunyu Yao, Howard Jiang, Zilin Chen, Chen Zhang, Yichen Wang, Ruocheng Chen, Wenlong Gu, Zipeng Zhang, and Li Sha. 2022. Webshop: Towards Scalable Real-World Web Interaction with Grounded Language Agents.arXiv preprint arXiv:2207.01206(2022)

work page arXiv 2022
[51]

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. Online Experiential Learning for Language Models.arXiv preprint arXiv:2603.16856(2026)

work page arXiv 2026
[52]

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yu, Weinan Dai, TianTian Fan, Gaohong Liu, Lingjun Liu, and et al. 2025e. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476(2025e)

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, and Qinzhen Guo. 2026. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization. arXiv:2605.05040 [cs.LG] https://arxiv.org/abs/2605.05040

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Xiangyu Yue, Yu Zheng, Zhang Zhang, Steven Gao, Yuhang Wang, Runzhe Chen, Yukun Jia, Yitong Sun, Yizhi Gao, Mark Zhao, et al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.arXiv preprint arXiv:2311.16502(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. 2026. Embarrassingly Simple Self-Distillation Improves Code Generation. arXiv:2604.01193 [cs.CL] https://arxiv.org/abs/2604.01193

work page arXiv 2026
[57]

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. 2026. OPSDL: On-Policy Self-Distillation for Long-Context Language Models. arXiv:2604.17535 [cs.CL] https://arxiv.org/abs/2604.17535

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, and Yu Zhou. 2026. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding. arXiv:2605.00642 [cs.AI] https://arxiv.org/abs/2605.00642

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, and Dongbin Zhao

work page
[60]

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

𝜋-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data. arXiv:2604.14054 [cs.LG] https://arxiv.org/abs/2604.14054

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv preprint arXiv:2601.18734(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Zhengyang Zhao, Lu Ma, and Wentao Zhang. 2026. Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning. arXiv:2605.08741 [cs.CL] https://arxiv.org/abs/2605.08741

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Cheng, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025a)

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL] https://arxiv.org/abs/2305.11206

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2024. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. arXiv:2306.13649 [cs.LG] https://arxiv.org/abs/2306.13649

work page arXiv 2024

[2] [2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, and Haoliang Li. 2026. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models. arXiv:2605.11854 [cs.CL] https://arxiv.org/abs/2605.11854

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Ganqu Cui, Liangyuan Yuan, Ning Ding, Zhiwei Yao, Wei Ye, Yujia Wang, Yue Zhang, Jing Xu, Han Zhang, Zini Chen, et al. 2023. UltraFeedback: Boosting Language Models with High-quality Feedback.arXiv preprint arXiv:2310.01377(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Ken Ding. 2026. HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation.arXiv preprint arXiv:2603.23871(2026)

work page arXiv 2026

[8] [8]

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Zihao Han, Tiangang Zhang, Huaibin Wang, and Yilun Sun. 2026. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning. arXiv:2605.11458 [cs.AI] https://arxiv.org/abs/2605.11458

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. 2026. Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision.arXiv preprint arXiv:2604.12002(2026). https: //arxiv.org/abs/2604.12002

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset.Advances in Neural Information Processing Systems34 (2021), 25774–25786

work page 2021

[13] [13]

HuggingFaceH4. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024

[14] [14]

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. Reinforcement Learning via Self-Distillation. arXiv:2601.20802 [cs.LG] https://arxiv.org/abs/2601.20802

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint(2024)

work page 2024

[16] [16]

Minbyul Jeong. 2026. Healthcare AI GYM for Medical Agents. arXiv:2605.02943 [cs.LG] https://arxiv.org/abs/2605.02943

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Jiaming Ji, Meng Liu, Juntao Dai, Xuehai Pan, Ce Zhang, Chi Bian, Botao Chen, Rui Sun, Yashi Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.Advances in Neural Information Processing Systems36 (2023), 24621–24658

work page 2023

[18] [18]

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, and Steven Hoi. 2026. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models. arXiv:2605.05204 [cs.CV] https://arxiv.org/abs/2605.05204

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, and Hao Henry Wang. 2025. Importance-Aware Data Selection for Efficient LLM Instruction Tuning. arXiv:2511.07074 [cs.CL] https://arxiv.org/abs/2511.07074

work page arXiv 2025

[20] [20]

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, and Srijan Kumar. 2026. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models. arXiv:2605.06597 [cs.CL] https://arxiv.org/abs/2605.06597

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. 2026. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning. arXiv:2605.13255 [cs.AI] https://arxiv.org/abs/2605.13255

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Jeonghye Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR. arXiv:2605.10781 [cs.LG] https://arxiv.org/abs/2605.10781 A Brief Overview: On-Policy Self-Distillation In Large Language Models 9

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? arXiv:2603.24472 [cs.CL] https://arxiv.org/abs/2603.24472

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, and Tom Goldstein. 2026. Multi-Token Prediction via Self-Distillation. arXiv:2602.06019 [cs.CL] https://arxiv.org/abs/2602.06019

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. 2026. Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing. arXiv:2604.02288 [cs.LG] https://arxiv.org/abs/2604.02288

work page arXiv 2026

[26] [26]

Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, and Rui Wang. 2026. GEAR: Granularity- Adaptive Advantage Reweighting for LLM Agents via Self-Distillation. arXiv:2605.11853 [cs.LG] https://arxiv.org/abs/2605.11853

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. 2026. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arXiv:2604.13016 [cs.LG] https://arxiv.org/abs/2604.13016

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Wenjie Liao, Like Wu, Liangjie Zhao, Shihui Xu, and Shigeru Fujimura. 2026. IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning. arXiv:2604.20933 [cs.LG] https://arxiv.org/abs/2604.20933

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. 2026. VISD: Enhancing Video Reasoning via Structured Self-Distillation. arXiv:2605.06094 [cs.CV] https://arxiv.org/abs/2605.06094

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. Crosslingual On-Policy Self-Distillation for Multilingual Reasoning. arXiv:2605.09548 [cs.CL] https://arxiv.org/abs/2605.09548

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Kevin Lu and Thinking Machines Lab. 2025. On-Policy Distillation.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20251026 https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[33] [33]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. 2026. Privileged Information Distillation for Language Models.arXiv preprint arXiv:2602.04942(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, and Wen Shen. 2026. Multilingual Safety Alignment via Self-Distillation. arXiv:2605.02971 [cs.LG] https://arxiv.org/abs/2605.02971

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. 2026. On-Policy Self-Distillation for Reasoning Compression. arXiv preprint arXiv:2603.05433(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information. arXiv:2605.11609 [cs.LG] https://arxiv.org/abs/2605.11609

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation. arXiv:2605.11613 [cs.LG] https://arxiv.org/abs/2605.11613

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. Self-Distillation Enables Continual Learning. arXiv:2601.19897 [cs.LG] https://arxiv.org/abs/2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

Mingyang Song and Mao Zheng. 2026. A Survey of On-Policy Distillation for Large Language Models. arXiv:2604.00626 [cs.LG] https://arxiv.org/ abs/2604.00626

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Alex Stein, Furong Huang, and Tom Goldstein. 2026. GATES: Self-Distillation under Privileged Context with Consensus Gating.arXiv preprint arXiv:2602.20574(2026)

work page arXiv 2026

[44] [44]

Zhiquan Tan and Yinrong Hong. 2026. PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners. arXiv:2604.26573 [cs.LG] https://arxiv.org/abs/2604.26573

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. 2026. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents. arXiv:2604.10674 [cs.LG] https://arxiv.org/abs/2604.10674

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, and Sanqiang Zhao. 2026. SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe. arXiv:2410.05248 [cs.CL] https://arxiv.org/abs/2410.05248

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. 2026. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence. arXiv:2603.11178 [cs.AI] https://arxiv.org/abs/2603.11178

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. 2026. Self-Distilled RLVR. arXiv:2604.03128 [cs.LG] https://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Yuxiao Yang, Xiaoyun Wang, and Weitong Zhang. 2026. OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning. arXiv:2605.12400 [cs.LG] https://arxiv.org/abs/2605.12400 10 Fangming Cui, Sunan Li, and Jiahong Li

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Shunyu Yao, Howard Jiang, Zilin Chen, Chen Zhang, Yichen Wang, Ruocheng Chen, Wenlong Gu, Zipeng Zhang, and Li Sha. 2022. Webshop: Towards Scalable Real-World Web Interaction with Grounded Language Agents.arXiv preprint arXiv:2207.01206(2022)

work page arXiv 2022

[51] [51]

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. Online Experiential Learning for Language Models.arXiv preprint arXiv:2603.16856(2026)

work page arXiv 2026

[52] [52]

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[53] [53]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yu, Weinan Dai, TianTian Fan, Gaohong Liu, Lingjun Liu, and et al. 2025e. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476(2025e)

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, and Qinzhen Guo. 2026. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization. arXiv:2605.05040 [cs.LG] https://arxiv.org/abs/2605.05040

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Xiangyu Yue, Yu Zheng, Zhang Zhang, Steven Gao, Yuhang Wang, Runzhe Chen, Yukun Jia, Yitong Sun, Yizhi Gao, Mark Zhao, et al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.arXiv preprint arXiv:2311.16502(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. 2026. Embarrassingly Simple Self-Distillation Improves Code Generation. arXiv:2604.01193 [cs.CL] https://arxiv.org/abs/2604.01193

work page arXiv 2026

[57] [57]

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. 2026. OPSDL: On-Policy Self-Distillation for Long-Context Language Models. arXiv:2604.17535 [cs.CL] https://arxiv.org/abs/2604.17535

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, and Yu Zhou. 2026. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding. arXiv:2605.00642 [cs.AI] https://arxiv.org/abs/2605.00642

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, and Dongbin Zhao

work page

[60] [60]

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

𝜋-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data. arXiv:2604.14054 [cs.LG] https://arxiv.org/abs/2604.14054

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv preprint arXiv:2601.18734(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Zhengyang Zhao, Lu Ma, and Wentao Zhang. 2026. Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning. arXiv:2605.08741 [cs.CL] https://arxiv.org/abs/2605.08741

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Cheng, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025a)

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL] https://arxiv.org/abs/2305.11206

work page internal anchor Pith review Pith/arXiv arXiv 2023