Recognition: unknown
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Pith reviewed 2026-05-10 14:59 UTC · model grok-4.3
The pith
On-policy distillation succeeds only when student and teacher share compatible thinking patterns and the teacher supplies genuinely new capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that on-policy distillation is governed by two conditions. The student and teacher must share compatible thinking patterns. Even when patterns align and the teacher scores higher, the teacher must still provide new capabilities beyond the student's prior training exposure. In same-family reverse distillation tests, 1.5B and 7B models prove distributionally indistinguishable from the student's perspective. Successful runs exhibit progressive alignment on high-probability tokens at visited states, with most probability mass held by a small shared token set. Two practical strategies, off-policy cold start and teacher-aligned prompt selection, restore failing distillation,
What carries the argument
The pair of conditions on thinking-pattern compatibility and novel capability provision, supported by the token-level mechanism of progressive alignment on high-probability tokens at student-visited states.
If this is right
- OPD fails whenever thinking patterns between student and teacher are incompatible, even if the teacher has higher scores.
- OPD fails when the teacher merely reinforces capabilities the student has already seen, regardless of pattern match.
- Off-policy cold start recovers failing OPD by altering the initial state distribution the student encounters.
- Teacher-aligned prompt selection improves OPD by increasing the chance the student visits states where the teacher adds value.
- The concentration of 97-99% probability mass in a small token set explains the dense reward signal but raises scalability concerns for long sequences.
Where Pith is reading between the lines
- Distillation gains are likely restricted to closely related model families rather than arbitrary architectures.
- The small shared token set may reflect a general property of language model probability distributions worth examining in other training regimes.
- Explicit tests across different model families would clarify how far the two conditions extend.
- Long-horizon applications of OPD may require supplementary techniques to offset the hidden costs of dense token supervision.
Load-bearing premise
The conditions and token-level mechanisms observed in same-family weak-to-strong experiments are assumed to generalize to other model families, architectures, and tasks.
What would settle it
A clear case of successful on-policy distillation where the student and teacher have incompatible thinking patterns or where the teacher adds no new capabilities would disprove the two conditions as necessary.
read the original abstract
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the training dynamics of on-policy distillation (OPD) for large language models. It identifies two conditions that govern OPD success or failure: (i) the student and teacher must share compatible thinking patterns, and (ii) the teacher must provide genuinely new capabilities beyond those seen in the student's training. These are validated via weak-to-strong reverse distillation on same-family 1.5B/7B models, where teachers are shown to be distributionally indistinguishable and successful OPD correlates with progressive high-probability token alignment (97-99% mass in a small shared set). The authors propose off-policy cold start and teacher-aligned prompt selection as recovery strategies for failing OPD and raise questions about scalability to long-horizon distillation.
Significance. If the identified conditions and token-level mechanisms prove robust, this work supplies a useful phenomenological and mechanistic account of OPD, a central post-training technique. The concrete experimental observations on distributional indistinguishability and token alignment, together with the two practical recovery recipes, could directly inform distillation practice. The paper is credited for its systematic use of reverse distillation and token probing to surface falsifiable patterns rather than purely theoretical claims.
major comments (2)
- [§4] §4 (Validation via weak-to-strong reverse distillation): The central claim that the two conditions 'govern' OPD success or failure rests on experiments confined to same-family 1.5B/7B models. No cross-family or cross-architecture trials are reported to test whether thinking-pattern compatibility or new-capability detection remain predictive when pretraining distributions or inductive biases differ; this untested universality assumption is load-bearing for the governance assertion.
- [§3] §3 (Phenomenology of the two conditions): The second condition (teacher must supply genuinely new capabilities) is operationalized primarily through score improvement and distributional overlap; without an independent metric for 'new' capabilities (e.g., out-of-distribution task performance or capability probes), the condition risks being under-specified and difficult to apply outside the tested model family.
minor comments (3)
- The methods description would benefit from explicit reporting of statistical tests (e.g., confidence intervals or p-values) supporting the 97-99% probability-mass concentration claim and the progressive alignment curves.
- Clarify the precise definition and measurement of 'compatible thinking patterns' (e.g., which divergence or probing method is used) so that the first condition can be replicated on new model pairs.
- The discussion of long-horizon limitations is acknowledged but remains qualitative; adding even a small-scale experiment on sequence length would strengthen the final claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment point by point below, offering clarifications and indicating where we will revise the manuscript to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: §4 (Validation via weak-to-strong reverse distillation): The central claim that the two conditions 'govern' OPD success or failure rests on experiments confined to same-family 1.5B/7B models. No cross-family or cross-architecture trials are reported to test whether thinking-pattern compatibility or new-capability detection remain predictive when pretraining distributions or inductive biases differ; this untested universality assumption is load-bearing for the governance assertion.
Authors: We deliberately restricted the experiments to same-family 1.5B/7B models to isolate the effects of thinking-pattern compatibility and capability gaps without confounding variables from differing pretraining corpora or architectures. In this controlled weak-to-strong reverse distillation setup, the two conditions are shown to be predictive of OPD outcomes. The manuscript does not claim these conditions universally govern OPD for arbitrary model pairs; we will revise the abstract, introduction, and §4 to explicitly qualify the governance claim as holding within the tested same-family regime and to list cross-family and cross-architecture validation as an important open direction for future work. revision: partial
-
Referee: §3 (Phenomenology of the two conditions): The second condition (teacher must supply genuinely new capabilities) is operationalized primarily through score improvement and distributional overlap; without an independent metric for 'new' capabilities (e.g., out-of-distribution task performance or capability probes), the condition risks being under-specified and difficult to apply outside the tested model family.
Authors: We agree that a more independent metric would improve applicability. In the current work, 'new capabilities' are operationalized as the teacher's ability to produce higher task scores together with token distributions that the student cannot initially match, as directly measured in the reverse distillation experiments where the 7B teacher remains distributionally indistinguishable from the 1.5B student's perspective. We will revise §3 to state this operationalization more explicitly and add a brief discussion of potential independent metrics (such as OOD task probes) in the limitations section. No additional experiments are feasible within the current revision cycle. revision: partial
Circularity Check
Empirical findings on OPD conditions are self-contained
full rationale
The paper's central claims consist of two conditions for OPD success identified through direct experimental observation in weak-to-strong reverse distillation on same-family models, followed by token-level probing and practical strategy proposals. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive or justify the conditions; the results are presented as outcomes of the described experiments rather than reductions to prior inputs by construction. The derivation chain remains observational and does not loop back on itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions about model training dynamics and token probability distributions in language models
Forward citations
Cited by 21 Pith papers
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
Prune-OPD dynamically prunes unreliable teacher rewards in on-policy distillation by monitoring prefix drift via top-k overlap, reducing training time 37.6-68% on AMC/AIME/HMMT while preserving or improving performance.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
Reference graph
Works this paper leans on
-
[1]
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,
-
[2]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606,
-
[3]
URLhttp://jmlr.org/papers/v25/23-0870.html. Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,
-
[4]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,
work page internal anchor Pith review arXiv
-
[6]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,
work page internal anchor Pith review arXiv
-
[7]
JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025a. Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. H...
-
[8]
Skywork open reasoner 1 technical report
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025b. Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, ZhenwenLiang, WenxuanWang, etal. Deepmath-103k: Ala...
-
[9]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review arXiv
-
[10]
Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174,
2020
-
[12]
Entropy-aware on-policy distillation of language models
20 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,
-
[13]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms? arXiv preprint arXiv:2603.24472,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,
2016
-
[15]
Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,
-
[16]
Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288,
-
[17]
Small models struggle to learn from strong reasoners
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubra- manian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394,
2025
-
[18]
Thinking Machines Lab: Connectionism , year =
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198,
-
[19]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,
2019
-
[20]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
URLhttp://arxiv.org/abs/1908.10084. Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation,
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[21]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review arXiv 1910
-
[22]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,
work page internal anchor Pith review arXiv
-
[24]
MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,
work page internal anchor Pith review arXiv
-
[25]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026a. Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning be- yond teacher: Generalized on-policy distillation with reward extrapolation.arXiv ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,
work page internal anchor Pith review arXiv
-
[28]
Self-distillation for multi-token prediction, 2026a
Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, and Xingwu Sun. Self-distillation for multi-token prediction, 2026a. URLhttps://arxiv.org/abs/2603.23911. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint ...
-
[29]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Association for Computational Linguistics. URLhttp://arxiv.org/ abs/2403.13372. 22 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe A. Details for Section 3 A.1. GRPO Training Details Base Model.We initialize GRPO training from Qwen3-4B-Base. Training Dataset.We use the processed DAPO-Math-17K dataset for GR...
work page internal anchor Pith review arXiv
-
[30]
Benchmark-wise breakdown of thinking-pattern compatibility To further unpack the averaged result in Figure 2, Figure 17 presents a benchmark-wise breakdown
A.3. Benchmark-wise breakdown of thinking-pattern compatibility To further unpack the averaged result in Figure 2, Figure 17 presents a benchmark-wise breakdown. The advantage of distillation from Qwen3-4B-Base-GRPO is broadly consistent across datasets rather than being driven by a single benchmark. The gap is more pronounced on AMC 2023 and AIME 2024, a...
2023
-
[31]
This per-benchmark view supports 23 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Table 2|Default hyperparameters for OPD. Item Value Training temperature 1.0 Global batch size 64 Mini batch size 64 Rollout number 4 LogProb top-𝐾16 Top-𝐾strategy Student Top-𝐾 Top-𝑝1.0 Max prompt length 1024 Max response l...
2024
-
[32]
Distillation from Qwen3-4B- Base-GRPO consistently matches or outperforms distillation from Qwen3-4B (Non-thinking) across the three benchmarks
We report results on AIME 2024, AIME 2025, and AMC 2023 separately. Distillation from Qwen3-4B- Base-GRPO consistently matches or outperforms distillation from Qwen3-4B (Non-thinking) across the three benchmarks. the interpretation that better early-stage thinking-pattern compatibility leads to better downstream distillation performance, and the loss from...
2024
-
[33]
The second is the gradient norm, which measures the overall magnitudeoftheupdatesignalreachingthestudent. Thethirdistheprobabilitydifference 𝑝𝑡 (𝑣)−𝑞 𝑡 (𝑣) onthetokenwiththelargestabsoluteadvantage,whichtrackswhetherthestudentcanreducethemost pronounced local disagreement with the teacher on the tokens that carry the strongest optimization signal. Togethe...
2024
-
[34]
In contrast, with R1-Distill-14B as the teacher, training shows little improvement and the alignment metrics remain poor or unstable
With Skywork-OR1-Math-7B as the teacher, distillation improves student performance and is accompanied by steadily increasing overlap ratio, overlap-token advantage approaching zero, and a small entropy gap. In contrast, with R1-Distill-14B as the teacher, training shows little improvement and the alignment metrics remain poor or unstable. This provides ad...
2025
-
[35]
27 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Table 3| SFT hyperparameters for cold-start distillation from Qwen3-4B (Non-thinking) to Qwen3- 1.7B-Base. Hyper-parameter Value Student model Qwen3-1.7B-Base Training objective Full-parameter SFT Templateqwen3 Training epochs 1 Sequence length 14,336 Per-d...
2024
-
[36]
Using the teacher-aligned template consistently matches or outperforms the original DAPO template across the three benchmarks. C.3. Deduplication Details for the DeepMath Subset For the cross-size setting, we construct a DeepMath subset deduplicated against DAPO-Math-17K in order to compare prompts aligned with the teacher’s RL post-training data against ...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.