pith. sign in

arxiv: 2605.18141 · v2 · pith:AE72JTNInew · submitted 2026-05-18 · 💻 cs.HC

A Brief Overview: On-Policy Self-Distillation In Large Language Models

Pith reviewed 2026-05-22 09:48 UTC · model grok-4.3

classification 💻 cs.HC
keywords On-Policy Self-DistillationLarge Language ModelsKnowledge DistillationReasoning AlignmentMemory ReductionSelf-Teaching Frameworks
0
0 comments X

The pith

On-policy self-distillation lets one large language model serve as both teacher and student to align reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper overviews On-Policy Self-Distillation (OPSD), a framework where a single LLM plays dual roles during training. The teacher role accesses verified reasoning traces while the student role sees only the problem. Training minimizes per-token divergence on trajectories from the student role to align the model with better rationalizations. This removes the need for a separate teacher model and addresses distribution mismatch issues. The overview explains the foundations for newcomers to the field.

Core claim

On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external

What carries the argument

Dual-role single model with teacher granted verified reasoning traces and student limited to problem statement, trained by minimizing distributional divergence on student-sampled trajectories.

Load-bearing premise

That verified reasoning traces are reliably available to the teacher role and that the divergence minimization on student trajectories produces stable alignment without amplifying errors.

What would settle it

A test where removing access to verified traces or using noisy traces causes the model to perform worse than baseline training methods on reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.18141 by Fangming Cui, Jiahong Li, Sunan Li.

Figure 1
Figure 1. Figure 1: Demonstration of SFT, GRPO, OPD and OPSD. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript is a brief overview of On-Policy Self-Distillation (OPSD) for large language models. It defines OPSD as a unified framework in which a single LLM simultaneously serves as teacher (with privileged access to verified reasoning traces) and student (observing only the problem statement). Training minimizes per-token distributional divergence between the two roles on trajectories sampled from the student policy. The paper claims this eliminates external teachers, leverages ground-truth solutions directly, resolves off-policy distribution mismatch, and reduces GPU memory consumption by 40%-60% relative to standard On-Policy Distillation (OPD). The discussion covers conceptual foundations, methodological innovations, and design principles aimed at beginners.

Significance. If the described benefits hold, OPSD could provide a practical route to memory-efficient, self-contained alignment of LLMs without separate teacher models. The claimed 40-60% memory reduction would be a concrete engineering advantage for scaling distillation. However, because the manuscript supplies only descriptive synthesis and no new derivations, experiments, or stability analysis, its significance is limited to potential utility as an introductory summary rather than a substantive advance.

major comments (2)
  1. [Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.
  2. [Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.
minor comments (1)
  1. The manuscript would benefit from explicit section headings or numbered subsections to improve readability for the intended beginner audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our brief overview manuscript. We appreciate the identification of areas where claims require better qualification given the paper's scope as a conceptual synthesis rather than a source of new empirical results. We address each major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.

    Authors: We agree that the manuscript, being an overview without new experiments, should not present the 40-60% figure as an unsubstantiated assertion. This range is intended to reflect reported outcomes from prior OPSD implementations in the literature that the paper synthesizes. In the revised version we will either insert citations to the specific studies documenting these memory reductions or rephrase the statement to indicate that such savings have been observed in existing OPSD work, thereby removing any implication that the figure is a new result of this overview. revision: yes

  2. Referee: [Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.

    Authors: The referee is correct that the paper supplies no new derivations or formal stability analysis; this is consistent with its purpose as a beginner-oriented overview of existing design principles. The on-policy sampling combined with privileged access to verified traces is presented as the mechanism intended to limit distribution shift and error accumulation. We will revise the abstract and framework sections to state these assumptions more explicitly and to include a concise paragraph noting potential limitations, such as dependence on trace quality and the desirability of empirical checks for drift in particular domains, while pointing readers to the referenced empirical studies for further investigation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive overview without derivations or fitted predictions

full rationale

The paper is explicitly framed as a 'brief overview' and 'brief analysis of the conceptual foundations' of OPSD. It defines the framework in prose (single model as teacher/student with privileged traces, minimize per-token divergence on student trajectories) but supplies no equations, no parameter fitting, no predictions, and no self-citation chains that bear the central claim. The memory-reduction claim is stated as an empirical observation rather than a derived result. No load-bearing step reduces to its own inputs by construction; the content remains self-contained description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an overview paper the work introduces no new mathematical objects, fitted parameters, or invented entities; it describes an existing framework.

pith-pipeline@v0.9.0 · 5746 in / 1116 out tokens · 42495 ms · 2026-05-22T09:48:09.204199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 50 internal anchors

  1. [1]

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2024. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. arXiv:2306.13649 [cs.LG] https://arxiv.org/abs/2306.13649

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

  3. [3]

    Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, and Haoliang Li. 2026. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models. arXiv:2605.11854 [cs.CL] https://arxiv.org/abs/2605.11854

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  5. [5]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

  6. [6]

    Ganqu Cui, Liangyuan Yuan, Ning Ding, Zhiwei Yao, Wei Ye, Yujia Wang, Yue Zhang, Jing Xu, Han Zhang, Zini Chen, et al. 2023. UltraFeedback: Boosting Language Models with High-quality Feedback.arXiv preprint arXiv:2310.01377(2023)

  7. [7]

    Ken Ding. 2026. HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation.arXiv preprint arXiv:2603.23871(2026)

  8. [8]

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

  9. [9]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

  10. [10]

    Zihao Han, Tiangang Zhang, Huaibin Wang, and Yilun Sun. 2026. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning. arXiv:2605.11458 [cs.AI] https://arxiv.org/abs/2605.11458

  11. [11]

    Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. 2026. Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision.arXiv preprint arXiv:2604.12002(2026). https: //arxiv.org/abs/2604.12002

  12. [12]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset.Advances in Neural Information Processing Systems34 (2021), 25774–25786

  13. [13]

    HuggingFaceH4. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/HuggingFaceH4/aime_2024

  14. [14]

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. Reinforcement Learning via Self-Distillation. arXiv:2601.20802 [cs.LG] https://arxiv.org/abs/2601.20802

  15. [15]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint(2024)

  16. [16]

    Minbyul Jeong. 2026. Healthcare AI GYM for Medical Agents. arXiv:2605.02943 [cs.LG] https://arxiv.org/abs/2605.02943

  17. [17]

    Jiaming Ji, Meng Liu, Juntao Dai, Xuehai Pan, Ce Zhang, Chi Bian, Botao Chen, Rui Sun, Yashi Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.Advances in Neural Information Processing Systems36 (2023), 24621–24658

  18. [18]

    Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, and Steven Hoi. 2026. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models. arXiv:2605.05204 [cs.CV] https://arxiv.org/abs/2605.05204

  19. [19]

    Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, and Hao Henry Wang. 2025. Importance-Aware Data Selection for Efficient LLM Instruction Tuning. arXiv:2511.07074 [cs.CL] https://arxiv.org/abs/2511.07074

  20. [20]

    UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

    Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, and Srijan Kumar. 2026. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models. arXiv:2605.06597 [cs.CL] https://arxiv.org/abs/2605.06597

  21. [21]

    Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. 2026. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning. arXiv:2605.13255 [cs.AI] https://arxiv.org/abs/2605.13255

  22. [22]

    Jeonghye Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR. arXiv:2605.10781 [cs.LG] https://arxiv.org/abs/2605.10781 A Brief Overview: On-Policy Self-Distillation In Large Language Models 9

  23. [23]

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? arXiv:2603.24472 [cs.CL] https://arxiv.org/abs/2603.24472

  24. [24]

    John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, and Tom Goldstein. 2026. Multi-Token Prediction via Self-Distillation. arXiv:2602.06019 [cs.CL] https://arxiv.org/abs/2602.06019

  25. [25]

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. 2026. Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing. arXiv:2604.02288 [cs.LG] https://arxiv.org/abs/2604.02288

  26. [26]

    Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, and Rui Wang. 2026. GEAR: Granularity- Adaptive Advantage Reweighting for LLM Agents via Self-Distillation. arXiv:2605.11853 [cs.LG] https://arxiv.org/abs/2605.11853

  27. [27]

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. 2026. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arXiv:2604.13016 [cs.LG] https://arxiv.org/abs/2604.13016

  28. [28]

    Wenjie Liao, Like Wu, Liangjie Zhao, Shihui Xu, and Shigeru Fujimura. 2026. IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning. arXiv:2604.20933 [cs.LG] https://arxiv.org/abs/2604.20933

  29. [29]

    Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. 2026. VISD: Enhancing Video Reasoning via Structured Self-Distillation. arXiv:2605.06094 [cs.CV] https://arxiv.org/abs/2605.06094

  30. [30]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312(2014)

  31. [31]

    Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

    Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. Crosslingual On-Policy Self-Distillation for Multilingual Reasoning. arXiv:2605.09548 [cs.CL] https://arxiv.org/abs/2605.09548

  32. [32]

    Kevin Lu and Thinking Machines Lab. 2025. On-Policy Distillation.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20251026 https://thinkingmachines.ai/blog/on-policy-distillation

  33. [33]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

  34. [34]

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. 2026. Privileged Information Distillation for Language Models.arXiv preprint arXiv:2602.04942(2026)

  35. [35]

    Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, and Wen Shen. 2026. Multilingual Safety Alignment via Self-Distillation. arXiv:2605.02971 [cs.LG] https://arxiv.org/abs/2605.02971

  36. [36]

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. 2026. On-Policy Self-Distillation for Reasoning Compression. arXiv preprint arXiv:2603.05433(2026)

  37. [37]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

  38. [38]

    Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information. arXiv:2605.11609 [cs.LG] https://arxiv.org/abs/2605.11609

  39. [39]

    Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation. arXiv:2605.11613 [cs.LG] https://arxiv.org/abs/2605.11613

  40. [40]

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. Self-Distillation Enables Continual Learning. arXiv:2601.19897 [cs.LG] https://arxiv.org/abs/2601.19897

  41. [41]

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)

  42. [42]

    Mingyang Song and Mao Zheng. 2026. A Survey of On-Policy Distillation for Large Language Models. arXiv:2604.00626 [cs.LG] https://arxiv.org/ abs/2604.00626

  43. [43]

    Alex Stein, Furong Huang, and Tom Goldstein. 2026. GATES: Self-Distillation under Privileged Context with Consensus Gating.arXiv preprint arXiv:2602.20574(2026)

  44. [44]

    Zhiquan Tan and Yinrong Hong. 2026. PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners. arXiv:2604.26573 [cs.LG] https://arxiv.org/abs/2604.26573

  45. [45]

    Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. 2026. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents. arXiv:2604.10674 [cs.LG] https://arxiv.org/abs/2604.10674

  46. [46]

    Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, and Sanqiang Zhao. 2026. SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe. arXiv:2410.05248 [cs.CL] https://arxiv.org/abs/2410.05248

  47. [47]

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. 2026. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence. arXiv:2603.11178 [cs.AI] https://arxiv.org/abs/2603.11178

  48. [48]

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. 2026. Self-Distilled RLVR. arXiv:2604.03128 [cs.LG] https://arxiv.org/abs/2604.03128

  49. [49]

    Yuxiao Yang, Xiaoyun Wang, and Weitong Zhang. 2026. OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning. arXiv:2605.12400 [cs.LG] https://arxiv.org/abs/2605.12400 10 Fangming Cui, Sunan Li, and Jiahong Li

  50. [50]

    Shunyu Yao, Howard Jiang, Zilin Chen, Chen Zhang, Yichen Wang, Ruocheng Chen, Wenlong Gu, Zipeng Zhang, and Li Sha. 2022. Webshop: Towards Scalable Real-World Web Interaction with Grounded Language Agents.arXiv preprint arXiv:2207.01206(2022)

  51. [51]

    Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. Online Experiential Learning for Language Models.arXiv preprint arXiv:2603.16856(2026)

  52. [52]

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275 (2026)

  53. [53]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yu, Weinan Dai, TianTian Fan, Gaohong Liu, Lingjun Liu, and et al. 2025e. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476(2025e)

  54. [54]

    Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, and Qinzhen Guo. 2026. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization. arXiv:2605.05040 [cs.LG] https://arxiv.org/abs/2605.05040

  55. [55]

    Xiangyu Yue, Yu Zheng, Zhang Zhang, Steven Gao, Yuhang Wang, Runzhe Chen, Yukun Jia, Yitong Sun, Yizhi Gao, Mark Zhao, et al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.arXiv preprint arXiv:2311.16502(2023)

  56. [56]

    Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. 2026. Embarrassingly Simple Self-Distillation Improves Code Generation. arXiv:2604.01193 [cs.CL] https://arxiv.org/abs/2604.01193

  57. [57]

    Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. 2026. OPSDL: On-Policy Self-Distillation for Long-Context Language Models. arXiv:2604.17535 [cs.CL] https://arxiv.org/abs/2604.17535

  58. [58]

    Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, and Yu Zhou. 2026. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding. arXiv:2605.00642 [cs.AI] https://arxiv.org/abs/2605.00642

  59. [59]

    Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, and Dongbin Zhao

  60. [60]

    $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    𝜋-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data. arXiv:2604.14054 [cs.LG] https://arxiv.org/abs/2604.14054

  61. [61]

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv preprint arXiv:2601.18734(2026)

  62. [62]

    Zhengyang Zhao, Lu Ma, and Wentao Zhang. 2026. Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning. arXiv:2605.08741 [cs.CL] https://arxiv.org/abs/2605.08741

  63. [63]

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Cheng, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025a)

  64. [64]

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL] https://arxiv.org/abs/2305.11206