Recognition: 1 theorem link
· Lean TheoremMulti-Rollout On-Policy Distillation via Peer Successes and Failures
Pith reviewed 2026-05-14 21:25 UTC · model grok-4.3
The pith
By conditioning teacher signals on both successful and failed peer rollouts from the same prompt, multi-rollout on-policy distillation supplies denser and better-aligned supervision than single-rollout baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MOPD constructs teacher signals by conditioning on the student's local rollout group, employing both positive peer imitation and contrastive success-failure conditioning; the resulting mixed contexts yield teacher scores that align more closely with verifier rewards and deliver consistent gains over standard on-policy distillation baselines on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks.
What carries the argument
The peer-conditioned distillation framework that builds teacher targets from the student's own multi-rollout group by contrasting successful and failed trajectories for the identical prompt.
If this is right
- Distillation performance improves when the teacher sees both correct and incorrect student attempts for the same prompt rather than one attempt at a time.
- Mixed success-failure contexts increase the correlation between the teacher's token-level scores and the external verifier's binary reward.
- On-policy methods become more effective when they treat the student's trial-and-error set as a structured source of positive and negative evidence.
- The gains appear across four distinct reasoning domains, suggesting the mechanism is not tied to any single task format.
Where Pith is reading between the lines
- The same peer-conditioning idea could be applied in other on-policy RL settings where multiple trajectories are sampled per state to sharpen value estimates.
- If the peer-group construction is kept instance-adaptive, it may reduce the need for hand-crafted negative examples or additional preference data.
- The alignment result suggests that future verifier design could be guided by how well its signals match what a multi-rollout teacher already discovers.
Load-bearing premise
The student's local set of rollouts for a given prompt supplies teacher signals that are more informative and better aligned with verifier rewards without injecting new selection biases.
What would settle it
An experiment in which mixed success-failure conditioning produces no improvement in task accuracy or no increase in correlation between teacher scores and verifier rewards compared with single-rollout distillation.
Figures
read the original abstract
Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned framework that constructs teacher signals by conditioning on both successful and failed rollouts sampled from the student's local rollout group for each prompt. It evaluates two constructions—positive peer imitation and contrastive success-failure conditioning—on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks, claiming consistent improvements over standard on-policy baselines. Teacher-signal analysis is reported to show that mixed success-failure contexts produce better alignment between teacher scores and external verifier rewards.
Significance. If the empirical gains and alignment results hold under rigorous controls, the work indicates that exploiting intra-prompt multi-rollout diversity can yield more informative, instance-adaptive supervision in on-policy distillation for LLMs trained with sparse verifier rewards, without requiring additional external data.
major comments (3)
- [Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.
- [Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.
- [Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.
minor comments (2)
- [Abstract] Abstract: Specify the number of rollouts per prompt and the precise on-policy baselines (e.g., standard OPD, PPO variants) used for comparison.
- [Method] Notation: Define the exact conditioning mechanism for contrastive success-failure (e.g., how failures are formatted as negative evidence) with a short illustrative example.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We provide point-by-point responses to the major comments below and will update the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.
Authors: The manuscript's experimental results section includes tables with performance numbers on all benchmarks, showing improvements over baselines and ablations for the two peer-context constructions. To make these more prominent and address the concern directly, we will add the quantitative deltas, specific ablation comparisons, and statistical significance tests (including p-values) to the abstract, introduction, and a new subsection on statistical analysis in the revised manuscript. revision: yes
-
Referee: [Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.
Authors: We agree that this is an important point to verify. Although the success and failure labels provide a natural distinction, we will add experiments reporting rollout similarity metrics (e.g., average pairwise BLEU scores or embedding cosine similarities within rollout groups) and diversity statistics. We will also include a control experiment using cross-prompt negative examples to rule out bias reinforcement and demonstrate the benefit of intra-prompt peer failures. revision: yes
-
Referee: [Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.
Authors: We will revise the teacher-signal analysis to include concrete quantitative metrics. Specifically, we will report correlation coefficients (Pearson and Spearman) between teacher-assigned scores and verifier rewards for positive-only, failure-only, and mixed constructions. We will also add controls accounting for rollout correlations and present per-construction alignment scores to substantiate the claim with numerical evidence. revision: yes
Circularity Check
No significant circularity; empirical method with external benchmark validation
full rationale
The paper defines MOPD as a framework that constructs teacher signals from the student's own multi-rollout group for each prompt, then reports empirical gains on independent benchmarks (competitive programming, math reasoning, scientific QA, tool-use) against standard on-policy baselines. No equations, derivations, or fitted parameters are shown that reduce the claimed improvements or alignment metrics to quantities defined by the method inputs by construction. Teacher-signal analysis compares against external verifier rewards rather than self-referential quantities. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename known results or smuggle ansatzes. The derivation chain is self-contained as an empirical proposal with measurable external outcomes.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption On-policy distillation offers denser token-level supervision than sparse verifier rewards
- domain assumption Conditioning the teacher on both successful and failed peer rollouts produces more informative signals
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton and Oriol Vinyals and Jeff Dean , title =. arXiv preprint arXiv:1503.02531 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Yoon Kim and Alexander M. Rush , title =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2016
-
[4]
Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng
Xiaohan Xu and Ming Li and Chongyang Tao and Tao Shen and Reynold Cheng and Jinyang Li and Can Xu and Dacheng Tao and Tianyi Zhou , title =. arXiv preprint arXiv:2402.13116 , year =
-
[5]
arXiv preprint arXiv:2305.15717 , year =
Arnav Gudibande and Eric Wallace and Charlie Snell and Xinyang Geng and Hao Liu and Pieter Abbeel and Sergey Levine and Dawn Song , title =. arXiv preprint arXiv:2305.15717 , year =
-
[6]
St. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , booktitle =
-
[7]
A Survey of On-Policy Distillation for Large Language Models , author=. 2026 , eprint=
work page 2026
-
[8]
Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , title =. Proceedings of ICLR , year =
-
[9]
Jongwoo Ko and Sungnyun Kim and Tianyi Chen and Se-Young Yun , title =. Proceedings of ICML , year =
-
[10]
Jongwoo Ko and Tianyi Chen and Sungnyun Kim and Tianyu Ding and Luming Liang and Ilya Zharkov and Se-Young Yun , title =. Proceedings of ICML , year =
-
[11]
Yuqiao Wen and Zichao Li and Wenyu Du and Lili Mou , title =. Proceedings of ACL , year =
-
[12]
Proceedings of COLING , year =
Taiqiang Wu and Chaofan Tao and Jiahao Wang and Runming Yang and Zhe Zhao and Ngai Wong , title =. Proceedings of COLING , year =
-
[13]
Seongryong Jung and Suwan Yoon and DongGeon Kim and Hwanhee Lee , title =. Proceedings of EMNLP , year =
-
[14]
Transactions on Machine Learning Research , year =
Nicolas Boizard and Kevin El Haddad and Céline Hudelot and Pierre Colombo , title =. Transactions on Machine Learning Research , year =
-
[15]
arXiv preprint arXiv:2510.24021 , year =
Haiduo Huang and Jiangcheng Song and Yadong Zhang and Pengju Ren , title =. arXiv preprint arXiv:2510.24021 , year =
-
[16]
Songming Zhang and Xue Zhang and Tong Zhang and Bojie Hu and Yufeng Chen and Jinan Xu , title =. Proceedings of ACL , year =
-
[17]
Wenda Xu and Rujun Han and Zifeng Wang and Long T. Le and Dhruv Madeka and Lei Li and William Yang Wang and Rishabh Agarwal and Chen-Yu Lee and Tomas Pfister , title =. Proceedings of ICLR , year =
-
[18]
Dongxu Zhang and Zhichao Yang and Sepehr Janghorbani and Jun Han and Andrew Ressler and Qian Qian and Gregory D. Lyng and Sanjit Singh Batra and Robert E. Tillman , title =. arXiv preprint arXiv:2602.15260 , year =
-
[19]
Yuxin Jiang and Chunkit Chan and Mingyang Chen and Wei Wang , title =. Proceedings of EMNLP , year =
-
[20]
An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and others , title =. arXiv preprint arXiv:2505.09388 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proceedings of the International Conference on Machine Learning (ICML) , year =
Dan Busbridge and Amitis Shidani and Floris Weers and Jason Ramapuram and Etai Littwin and Russ Webb , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =
-
[22]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=
work page 2026
-
[23]
Entropy-aware on-policy distillation of language models
Woogyeol Jin and Taywon Min and Yongjin Yang and Swanand Ravindra Kadhe and Yi Zhou and Dennis Wei and Nathalie Baracaldo and Kimin Lee , title =. arXiv preprint arXiv:2603.07079 , year =
-
[24]
Transactions on Machine Learning Research , issn=
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[25]
International Conference on Machine Learning , pages=
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[26]
Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[27]
Self-Distillation Enables Continual Learning
Idan Shenfeld and Mehul Damani and Jonas H. arXiv preprint arXiv:2601.19897 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
arXiv preprint arXiv:2602.04942 , year =
Emiliano Penaloza and Dheeraj Vattikonda and Nicolas Gontier and Alexandre Lacoste and Laurent Charlin and Massimo Caccia , title =. arXiv preprint arXiv:2602.04942 , year =
-
[29]
arXiv preprint arXiv:2603.23871 , year =
Ken Ding , title =. arXiv preprint arXiv:2603.23871 , year =
-
[30]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
Hejian Sang and Yuanda Xu and Zhengze Zhou and Ran He and Zhipeng Wang and Jiachen Sun , title =. arXiv preprint arXiv:2603.05433 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
arXiv preprint arXiv:2603.11137 , year =
Jongwoo Ko and Sara Abdali and Young Jin Kim and Tianyi Chen and Pashmina Cameron , title =. arXiv preprint arXiv:2603.11137 , year =
-
[32]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[33]
s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[35]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Advances in Neural Information Processing Systems , volume=
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[38]
arXiv preprint arXiv:2506.03106 , year=
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback , author=. arXiv preprint arXiv:2506.03106 , year=
-
[39]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim and Xufang Luo and Minbeom Kim and Sangmook Lee and Dohyung Kim and Jiwon Jeon and Dongsheng Li and Yuqing Yang , title =. arXiv preprint arXiv:2603.24472 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[41]
arXiv preprint arXiv:2402.00782 , year=
Dense reward for free in reinforcement learning from human feedback , author=. arXiv preprint arXiv:2402.00782 , year=
-
[42]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=
work page 2016
-
[45]
Advances in neural information processing systems , volume=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in neural information processing systems , volume=
-
[46]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
ORPO: Monolithic Preference Optimization without Reference Model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[47]
Advances in Neural Information Processing Systems , volume=
SimPO: Simple Preference Optimization with a Reference-Free Reward , author=. Advances in Neural Information Processing Systems , volume=
-
[48]
Transactions on Machine Learning Research , issn=
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , issn=. 2023 , url=
work page 2023
-
[49]
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base
Xumeng Wen and Zihan Liu and Shun Zheng and Shengyu Ye and Zhirong Wu and Yang Wang and Zhijian Xu and Xiao Liang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang , booktitle=. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base. 2026 , url=
work page 2026
-
[50]
arXiv preprint arXiv:2507.14843 , year=
The Invisible Leash: Why RLVR May or May Not Escape Its Origin , author=. arXiv preprint arXiv:2507.14843 , year=
-
[51]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=
work page 2025
-
[52]
Yuqian Fu and Tinghong Chen and Jiajun Chai and Xihuai Wang and Songjun Tu and Guojun Yin and Wei Lin and Qichao Zhang and Yuanheng Zhu and Dongbin Zhao , booktitle=. 2026 , url=
work page 2026
-
[53]
Deng, Qiyuan and Chen, Kehai and Zhang, Min and Xu, Zhongwen , booktitle=. Hi. 2026 , url=
work page 2026
-
[54]
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning , author=. 2026 , eprint=
work page 2026
-
[55]
arXiv preprint arXiv:2504.14945 , year =
Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang , title =. arXiv preprint arXiv:2504.14945 , year =
-
[56]
arXiv preprint arXiv:2512.23097 , year =
Yingru Li and Ziniu Li and Jiacai Liu , title =. arXiv preprint arXiv:2512.23097 , year =
-
[57]
Hunter Lightman and Vineet Kosaraju and Yura Burda and Harri Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. Proceedings of ICLR , year =
-
[58]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Rewarding Progress: Scaling Automated Process Verifiers for
Amrith Setlur and Chirag Nagpal and Adam Fisch and Xinyang Geng and Jacob Eisenstein and Rishabh Agarwal and Alekh Agarwal and Jonathan Berant and Aviral Kumar , booktitle=. Rewarding Progress: Scaling Automated Process Verifiers for. 2025 , url=
work page 2025
-
[60]
arXiv preprint arXiv:2501.07301 , year=
The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=
-
[61]
arXiv preprint arXiv:2411.11504 , year=
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering , author=. arXiv preprint arXiv:2411.11504 , year=
-
[62]
Logical and Symbolic Reasoning in Language Models @ AAAI 2026 , year=
Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models , author=. Logical and Symbolic Reasoning in Language Models @ AAAI 2026 , year=
work page 2026
-
[63]
arXiv preprint arXiv:2505.16142 , year =
Shicheng Xu and Liang Pang and Yunchang Zhu and Jia Gu and Zihao Wei and Jingcheng Deng and Feiyang Pan and Huawei Shen and Xueqi Cheng , title =. arXiv preprint arXiv:2505.16142 , year =
-
[64]
arXiv preprint arXiv:2506.02208 , year =
Hongling Xu and Qi Zhu and Heyuan Deng and Jinpeng Li and Lu Hou and Yasheng Wang and others , title =. arXiv preprint arXiv:2506.02208 , year =
-
[65]
arXiv preprint arXiv:2509.14257 , year =
Yuanjie Lyu and Chengyu Wang and Jun Huang and Tong Xu , title =. arXiv preprint arXiv:2509.14257 , year =
-
[66]
Reinforcement-aware Knowledge Distillation for LLM Reasoning
Zhaoyang Zhang and Shuli Jiang and Yantao Shen and Yuting Zhang and Dhananjay Ram and Shuo Yang and others , title =. arXiv preprint arXiv:2602.22495 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
The twelfth international conference on learning representations , year=
On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=
-
[68]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
Black-Box On-Policy Distillation of Large Language Models , author=. arXiv preprint arXiv:2511.10643 , year=
-
[70]
On-Policy Context Distillation for Language Models
On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Learning beyond teacher: Generalized on-policy distillation with reward extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=
-
[73]
arXiv preprint arXiv:2504.11456 , year=
Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=
-
[74]
Reinforcement Learning via Self-Distillation
Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=
work page 2024
-
[76]
Toolalpaca: Generalized tool learning for language models with 3000 simulated cases
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases , author=. arXiv preprint arXiv:2306.05301 , year=
-
[77]
The Thirteenth International Conference on Learning Representations , year =
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , title =. The Thirteenth International Conference on Learning Representations , year =
-
[78]
Advances in Neural Information Processing Systems , volume=
STaR: Bootstrapping Reasoning with Reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[79]
Reinforced Self-Training (ReST) for Language Modeling
ReST: Reinforced Self-Training (ReST) for Language Models , author=. arXiv preprint arXiv:2308.08998 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.