pith. machine review for the scientific record. sign in

arxiv: 2604.13016 · v2 · submitted 2026-04-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Bingxiang He, Chaojun Xiao, Cheng Qian, Huan-ang Gao, Jinqian Zhang, Ning Ding, Tianyu Yu, Wenkai Yang, Yaxuan Li, Yuxin Zuo, Zhiyuan Liu

Pith reviewed 2026-05-10 14:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords on-policy distillationlarge language modelsthinking patternstoken alignmentreverse distillationdistillation mechanismspost-training
0
0 comments X

The pith

On-policy distillation succeeds only when student and teacher share compatible thinking patterns and the teacher supplies genuinely new capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why on-policy distillation improves some large language models but leaves others unchanged or worse. It identifies two governing conditions: the student and teacher must share compatible thinking patterns, and the teacher must offer abilities the student has not already encountered in training. These conditions are tested in reverse weak-to-strong experiments using same-family models, where the larger teacher appears distributionally identical to the smaller student. At the token level, success shows as progressive alignment on high-probability tokens in states the student visits, with 97 to 99 percent of probability mass concentrated in a small shared token set. The work supplies two recovery methods for failing cases and notes that the dense per-token signal may limit scaling to long-horizon tasks.

Core claim

The authors establish that on-policy distillation is governed by two conditions. The student and teacher must share compatible thinking patterns. Even when patterns align and the teacher scores higher, the teacher must still provide new capabilities beyond the student's prior training exposure. In same-family reverse distillation tests, 1.5B and 7B models prove distributionally indistinguishable from the student's perspective. Successful runs exhibit progressive alignment on high-probability tokens at visited states, with most probability mass held by a small shared token set. Two practical strategies, off-policy cold start and teacher-aligned prompt selection, restore failing distillation,

What carries the argument

The pair of conditions on thinking-pattern compatibility and novel capability provision, supported by the token-level mechanism of progressive alignment on high-probability tokens at student-visited states.

If this is right

  • OPD fails whenever thinking patterns between student and teacher are incompatible, even if the teacher has higher scores.
  • OPD fails when the teacher merely reinforces capabilities the student has already seen, regardless of pattern match.
  • Off-policy cold start recovers failing OPD by altering the initial state distribution the student encounters.
  • Teacher-aligned prompt selection improves OPD by increasing the chance the student visits states where the teacher adds value.
  • The concentration of 97-99% probability mass in a small token set explains the dense reward signal but raises scalability concerns for long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Distillation gains are likely restricted to closely related model families rather than arbitrary architectures.
  • The small shared token set may reflect a general property of language model probability distributions worth examining in other training regimes.
  • Explicit tests across different model families would clarify how far the two conditions extend.
  • Long-horizon applications of OPD may require supplementary techniques to offset the hidden costs of dense token supervision.

Load-bearing premise

The conditions and token-level mechanisms observed in same-family weak-to-strong experiments are assumed to generalize to other model families, architectures, and tasks.

What would settle it

A clear case of successful on-policy distillation where the student and teacher have incompatible thinking patterns or where the teacher adds no new capabilities would disprove the two conditions as necessary.

read the original abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper investigates the training dynamics of on-policy distillation (OPD) for large language models. It identifies two conditions that govern OPD success or failure: (i) the student and teacher must share compatible thinking patterns, and (ii) the teacher must provide genuinely new capabilities beyond those seen in the student's training. These are validated via weak-to-strong reverse distillation on same-family 1.5B/7B models, where teachers are shown to be distributionally indistinguishable and successful OPD correlates with progressive high-probability token alignment (97-99% mass in a small shared set). The authors propose off-policy cold start and teacher-aligned prompt selection as recovery strategies for failing OPD and raise questions about scalability to long-horizon distillation.

Significance. If the identified conditions and token-level mechanisms prove robust, this work supplies a useful phenomenological and mechanistic account of OPD, a central post-training technique. The concrete experimental observations on distributional indistinguishability and token alignment, together with the two practical recovery recipes, could directly inform distillation practice. The paper is credited for its systematic use of reverse distillation and token probing to surface falsifiable patterns rather than purely theoretical claims.

major comments (2)
  1. [§4] §4 (Validation via weak-to-strong reverse distillation): The central claim that the two conditions 'govern' OPD success or failure rests on experiments confined to same-family 1.5B/7B models. No cross-family or cross-architecture trials are reported to test whether thinking-pattern compatibility or new-capability detection remain predictive when pretraining distributions or inductive biases differ; this untested universality assumption is load-bearing for the governance assertion.
  2. [§3] §3 (Phenomenology of the two conditions): The second condition (teacher must supply genuinely new capabilities) is operationalized primarily through score improvement and distributional overlap; without an independent metric for 'new' capabilities (e.g., out-of-distribution task performance or capability probes), the condition risks being under-specified and difficult to apply outside the tested model family.
minor comments (3)
  1. The methods description would benefit from explicit reporting of statistical tests (e.g., confidence intervals or p-values) supporting the 97-99% probability-mass concentration claim and the progressive alignment curves.
  2. Clarify the precise definition and measurement of 'compatible thinking patterns' (e.g., which divergence or probing method is used) so that the first condition can be replicated on new model pairs.
  3. The discussion of long-horizon limitations is acknowledged but remains qualitative; adding even a small-scale experiment on sequence length would strengthen the final claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment point by point below, offering clarifications and indicating where we will revise the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: §4 (Validation via weak-to-strong reverse distillation): The central claim that the two conditions 'govern' OPD success or failure rests on experiments confined to same-family 1.5B/7B models. No cross-family or cross-architecture trials are reported to test whether thinking-pattern compatibility or new-capability detection remain predictive when pretraining distributions or inductive biases differ; this untested universality assumption is load-bearing for the governance assertion.

    Authors: We deliberately restricted the experiments to same-family 1.5B/7B models to isolate the effects of thinking-pattern compatibility and capability gaps without confounding variables from differing pretraining corpora or architectures. In this controlled weak-to-strong reverse distillation setup, the two conditions are shown to be predictive of OPD outcomes. The manuscript does not claim these conditions universally govern OPD for arbitrary model pairs; we will revise the abstract, introduction, and §4 to explicitly qualify the governance claim as holding within the tested same-family regime and to list cross-family and cross-architecture validation as an important open direction for future work. revision: partial

  2. Referee: §3 (Phenomenology of the two conditions): The second condition (teacher must supply genuinely new capabilities) is operationalized primarily through score improvement and distributional overlap; without an independent metric for 'new' capabilities (e.g., out-of-distribution task performance or capability probes), the condition risks being under-specified and difficult to apply outside the tested model family.

    Authors: We agree that a more independent metric would improve applicability. In the current work, 'new capabilities' are operationalized as the teacher's ability to produce higher task scores together with token distributions that the student cannot initially match, as directly measured in the reverse distillation experiments where the 7B teacher remains distributionally indistinguishable from the 1.5B student's perspective. We will revise §3 to state this operationalization more explicitly and add a brief discussion of potential independent metrics (such as OOD task probes) in the limitations section. No additional experiments are feasible within the current revision cycle. revision: partial

Circularity Check

0 steps flagged

Empirical findings on OPD conditions are self-contained

full rationale

The paper's central claims consist of two conditions for OPD success identified through direct experimental observation in weak-to-strong reverse distillation on same-family models, followed by token-level probing and practical strategy proposals. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive or justify the conditions; the results are presented as outcomes of the described experiments rather than reductions to prior inputs by construction. The derivation chain remains observational and does not loop back on itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on empirical observations from experiments rather than new mathematical axioms or invented entities; standard machine learning assumptions about model behavior and data distributions are invoked implicitly.

axioms (1)
  • domain assumption Standard assumptions about model training dynamics and token probability distributions in language models
    The analysis of thinking patterns and token alignment relies on typical LLM training and evaluation setups without explicit new axioms.

pith-pipeline@v0.9.0 · 5553 in / 1263 out tokens · 57117 ms · 2026-05-10T14:59:55.284414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

    cs.LG 2026-05 unverdicted novelty 7.0

    Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

  2. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  3. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  4. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  5. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  6. MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

    cs.CL 2026-05 unverdicted novelty 7.0

    MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

  7. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 accept novelty 7.0

    GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

  8. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 unverdicted novelty 7.0

    GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.

  9. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.

  10. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  11. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  12. Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    Prune-OPD dynamically prunes unreliable teacher rewards in on-policy distillation by monitoring prefix drift via top-k overlap, reducing training time 37.6-68% on AMC/AIME/HMMT while preserving or improving performance.

  13. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  14. SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...

  15. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  16. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  17. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 5.0

    Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

  18. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  19. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  20. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  21. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

Reference graph

Works this paper leans on

36 extracted references · 25 canonical work pages · cited by 16 Pith papers · 15 internal anchors

  1. [1]

    Balunovi´ c, J

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

  2. [2]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606,

  3. [3]

    HDPO: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

    URLhttp://jmlr.org/papers/v25/23-0870.html. Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,

  4. [4]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,

  5. [5]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,

  6. [6]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

  7. [7]

    JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025a. Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. H...

  8. [8]

    Skywork open reasoner 1 technical report

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025b. Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, ZhenwenLiang, WenxuanWang, etal. Deepmath-103k: Ala...

  9. [9]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  10. [10]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155,

  11. [11]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174,

  12. [12]

    Entropy-aware on-policy distillation of language models

    20 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

  13. [13]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms? arXiv preprint arXiv:2603.24472,

  14. [14]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,

  15. [15]

    Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

  16. [16]

    Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288,

  17. [17]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubra- manian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394,

  18. [18]

    Thinking Machines Lab: Connectionism , year =

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198,

  19. [19]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,

  20. [20]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    URLhttp://arxiv.org/abs/1908.10084. Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation,

  21. [21]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  23. [23]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

  24. [24]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  25. [25]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  26. [26]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026a. Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning be- yond teacher: Generalized on-policy distillation with reward extrapolation.arXiv ...

  27. [27]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  28. [28]

    Self-distillation for multi-token prediction, 2026a

    Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, and Xingwu Sun. Self-distillation for multi-token prediction, 2026a. URLhttps://arxiv.org/abs/2603.23911. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint ...

  29. [29]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Association for Computational Linguistics. URLhttp://arxiv.org/ abs/2403.13372. 22 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe A. Details for Section 3 A.1. GRPO Training Details Base Model.We initialize GRPO training from Qwen3-4B-Base. Training Dataset.We use the processed DAPO-Math-17K dataset for GR...

  30. [30]

    Benchmark-wise breakdown of thinking-pattern compatibility To further unpack the averaged result in Figure 2, Figure 17 presents a benchmark-wise breakdown

    A.3. Benchmark-wise breakdown of thinking-pattern compatibility To further unpack the averaged result in Figure 2, Figure 17 presents a benchmark-wise breakdown. The advantage of distillation from Qwen3-4B-Base-GRPO is broadly consistent across datasets rather than being driven by a single benchmark. The gap is more pronounced on AMC 2023 and AIME 2024, a...

  31. [31]

    This per-benchmark view supports 23 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Table 2|Default hyperparameters for OPD. Item Value Training temperature 1.0 Global batch size 64 Mini batch size 64 Rollout number 4 LogProb top-𝐾16 Top-𝐾strategy Student Top-𝐾 Top-𝑝1.0 Max prompt length 1024 Max response l...

  32. [32]

    Distillation from Qwen3-4B- Base-GRPO consistently matches or outperforms distillation from Qwen3-4B (Non-thinking) across the three benchmarks

    We report results on AIME 2024, AIME 2025, and AMC 2023 separately. Distillation from Qwen3-4B- Base-GRPO consistently matches or outperforms distillation from Qwen3-4B (Non-thinking) across the three benchmarks. the interpretation that better early-stage thinking-pattern compatibility leads to better downstream distillation performance, and the loss from...

  33. [33]

    The second is the gradient norm, which measures the overall magnitudeoftheupdatesignalreachingthestudent. Thethirdistheprobabilitydifference 𝑝𝑡 (𝑣)−𝑞 𝑡 (𝑣) onthetokenwiththelargestabsoluteadvantage,whichtrackswhetherthestudentcanreducethemost pronounced local disagreement with the teacher on the tokens that carry the strongest optimization signal. Togethe...

  34. [34]

    In contrast, with R1-Distill-14B as the teacher, training shows little improvement and the alignment metrics remain poor or unstable

    With Skywork-OR1-Math-7B as the teacher, distillation improves student performance and is accompanied by steadily increasing overlap ratio, overlap-token advantage approaching zero, and a small entropy gap. In contrast, with R1-Distill-14B as the teacher, training shows little improvement and the alignment metrics remain poor or unstable. This provides ad...

  35. [35]

    27 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Table 3| SFT hyperparameters for cold-start distillation from Qwen3-4B (Non-thinking) to Qwen3- 1.7B-Base. Hyper-parameter Value Student model Qwen3-1.7B-Base Training objective Full-parameter SFT Templateqwen3 Training epochs 1 Sequence length 14,336 Per-d...

  36. [36]

    Using the teacher-aligned template consistently matches or outperforms the original DAPO template across the three benchmarks. C.3. Deduplication Details for the DeepMath Subset For the cross-size setting, we construct a DeepMath subset deduplicated against DAPO-Math-17K in order to compare prompts aligned with the teacher’s RL post-training data against ...