arxiv: 2605.12652 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Weichen Yu , Xiaomin Li , Yizhou Zhao , Xiaoze Liu , Ruowang Zhang , Haixin Wang , Yinyi Luo , Chen Henry Wu

show 3 more authors

Gaurav Mittal Matt Fredrikson Yu Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationmulti-rolloutpeer conditioningsuccess-failure contrastlanguage model post-trainingreasoning benchmarksverifier alignment

0 comments

The pith

By conditioning teacher signals on both successful and failed peer rollouts from the same prompt, multi-rollout on-policy distillation supplies denser and better-aligned supervision than single-rollout baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard on-policy distillation wastes information by treating each student rollout in isolation, even though the student has already generated multiple attempts for the same prompt. MOPD instead feeds the teacher both the successes and the failures within that local group so that positive patterns can be reinforced and plausible mistakes can be explicitly discouraged. This produces token-level targets that track external verifier rewards more closely than isolated distillation does. A sympathetic reader would care because sparse verifier rewards are the dominant training signal for reasoning models, and any method that extracts more signal from the same samples could reduce the cost of post-training. The experiments show the approach works across competitive programming, mathematical reasoning, scientific question answering, and tool-use tasks.

Core claim

MOPD constructs teacher signals by conditioning on the student's local rollout group, employing both positive peer imitation and contrastive success-failure conditioning; the resulting mixed contexts yield teacher scores that align more closely with verifier rewards and deliver consistent gains over standard on-policy distillation baselines on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks.

What carries the argument

The peer-conditioned distillation framework that builds teacher targets from the student's own multi-rollout group by contrasting successful and failed trajectories for the identical prompt.

If this is right

Distillation performance improves when the teacher sees both correct and incorrect student attempts for the same prompt rather than one attempt at a time.
Mixed success-failure contexts increase the correlation between the teacher's token-level scores and the external verifier's binary reward.
On-policy methods become more effective when they treat the student's trial-and-error set as a structured source of positive and negative evidence.
The gains appear across four distinct reasoning domains, suggesting the mechanism is not tied to any single task format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same peer-conditioning idea could be applied in other on-policy RL settings where multiple trajectories are sampled per state to sharpen value estimates.
If the peer-group construction is kept instance-adaptive, it may reduce the need for hand-crafted negative examples or additional preference data.
The alignment result suggests that future verifier design could be guided by how well its signals match what a multi-rollout teacher already discovers.

Load-bearing premise

The student's local set of rollouts for a given prompt supplies teacher signals that are more informative and better aligned with verifier rewards without injecting new selection biases.

What would settle it

An experiment in which mixed success-failure conditioning produces no improvement in task accuracy or no increase in correlation between teacher scores and verifier rewards compared with single-rollout distillation.

Figures

Figures reproduced from arXiv: 2605.12652 by Chen Henry Wu, Gaurav Mittal, Haixin Wang, Matt Fredrikson, Ruowang Zhang, Weichen Yu, Xiaomin Li, Xiaoze Liu, Yinyi Luo, Yizhou Zhao, Yu Hu.

**Figure 1.** Figure 1: MOPD Illustration. To directly examine whether peer conditioning improves the self-teacher signal itself, we introduce an analysis of self-teacher signal quality. For each prompt, we fix a set of student-generated rollouts containing both successful and failed attempts, vary only the context shown to the self-teacher, and compare the self-teacher’s normalized logits or scores with ground-truth verifier r… view at source ↗

**Figure 2.** Figure 2: MOPD Pipeline. to the successes and failures observed in the other rollouts. This prevents the teacher from exploiting local, instance-specific evidence contained in the rollout group. 4 Multi-Rollout On-Policy Distillation We propose Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that exploits the local structure of multiple on-policy rollouts generated for the same… view at source ↗

**Figure 3.** Figure 3: Number of training data that have ever generated a correct answer in the N rollout during training. Case Study. During training, we save the generated rollouts and compare them on the same question across training steps to provide a case study. Additionally, after training for the same number of steps, we save checkpoints from both SDPO and MOPD, then sample from these checkpoints to evaluate whether each … view at source ↗

**Figure 4.** Figure 4: Self-teacher-signal quality across seven context conditions. Each panel reports an averaged prompt [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Diversity Analysis. evidence sharpens decision boundaries that positive evidence alone leaves blurred. 4) Combining both types yields the best results: the “2 success + 1 failure” context achieves the highest score on 5 of the 6 ranking and discrimination metrics in the signal-quality analysis, with a competitive Brier score, and the highest LCB downstream mean@8 among the compact peer-context settings. 5)… view at source ↗

read the original abstract

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOPD gets modest gains by turning same-prompt failures into contrastive signals for on-policy distillation, but the circularity risk from correlated rollouts is real and unaddressed in the reported results.

read the letter

The main thing to know is that this paper takes the multi-rollout groups already generated during on-policy training and uses both the successes and failures within each group to build richer teacher signals. Positive peer imitation copies good patterns; contrastive success-failure conditioning adds explicit negative examples of plausible mistakes. Experiments across competitive programming, math reasoning, scientific QA, and tool use show consistent improvements over standard on-policy baselines, plus better alignment between teacher scores and verifier rewards when mixed contexts are used. That is the concrete advance over prior work that treats each rollout in isolation. The approach is simple and fits naturally into existing post-training pipelines that already sample multiple trajectories per prompt. The empirical pattern is reported across four domains, which gives it some breadth. The soft spot is exactly the one flagged in the stress test. All trajectories come from the current student policy, so they share token patterns, reasoning shortcuts, and failure modes. Conditioning the teacher on this correlated set can amplify those biases instead of supplying independent evidence. The abstract claims mixed contexts improve alignment with verifiers, but without ablations on rollout similarity, cross-prompt negatives, or diversity metrics, it is difficult to rule out that the gains are partly circular. If the full paper only shows aggregate benchmark lifts without those controls, the central claim rests on an assumption that intra-group variation is sufficient. This work is for groups already running on-policy distillation or RL post-training on LLMs and looking for cheap ways to densify supervision. A reader who cares about practical distillation tricks will find the constructions and the multi-domain results useful to try. It deserves a serious referee because the idea is well-motivated, the experiments are broad, and the circularity issue is fixable with targeted controls rather than fatal. I would send it out.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned framework that constructs teacher signals by conditioning on both successful and failed rollouts sampled from the student's local rollout group for each prompt. It evaluates two constructions—positive peer imitation and contrastive success-failure conditioning—on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks, claiming consistent improvements over standard on-policy baselines. Teacher-signal analysis is reported to show that mixed success-failure contexts produce better alignment between teacher scores and external verifier rewards.

Significance. If the empirical gains and alignment results hold under rigorous controls, the work indicates that exploiting intra-prompt multi-rollout diversity can yield more informative, instance-adaptive supervision in on-policy distillation for LLMs trained with sparse verifier rewards, without requiring additional external data.

major comments (3)

[Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.
[Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.
[Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.

minor comments (2)

[Abstract] Abstract: Specify the number of rollouts per prompt and the precise on-policy baselines (e.g., standard OPD, PPO variants) used for comparison.
[Method] Notation: Define the exact conditioning mechanism for contrastive success-failure (e.g., how failures are formatted as negative evidence) with a short illustrative example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We provide point-by-point responses to the major comments below and will update the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.

Authors: The manuscript's experimental results section includes tables with performance numbers on all benchmarks, showing improvements over baselines and ablations for the two peer-context constructions. To make these more prominent and address the concern directly, we will add the quantitative deltas, specific ablation comparisons, and statistical significance tests (including p-values) to the abstract, introduction, and a new subsection on statistical analysis in the revised manuscript. revision: yes
Referee: [Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.

Authors: We agree that this is an important point to verify. Although the success and failure labels provide a natural distinction, we will add experiments reporting rollout similarity metrics (e.g., average pairwise BLEU scores or embedding cosine similarities within rollout groups) and diversity statistics. We will also include a control experiment using cross-prompt negative examples to rule out bias reinforcement and demonstrate the benefit of intra-prompt peer failures. revision: yes
Referee: [Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.

Authors: We will revise the teacher-signal analysis to include concrete quantitative metrics. Specifically, we will report correlation coefficients (Pearson and Spearman) between teacher-assigned scores and verifier rewards for positive-only, failure-only, and mixed constructions. We will also add controls accounting for rollout correlations and present per-construction alignment scores to substantiate the claim with numerical evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmark validation

full rationale

The paper defines MOPD as a framework that constructs teacher signals from the student's own multi-rollout group for each prompt, then reports empirical gains on independent benchmarks (competitive programming, math reasoning, scientific QA, tool-use) against standard on-policy baselines. No equations, derivations, or fitted parameters are shown that reduce the claimed improvements or alignment metrics to quantities defined by the method inputs by construction. Teacher-signal analysis compares against external verifier rewards rather than self-referential quantities. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename known results or smuggle ansatzes. The derivation chain is self-contained as an empirical proposal with measurable external outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-rollout trial-and-error behavior contains structured positive and negative evidence that can be turned into faithful teacher signals; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption On-policy distillation offers denser token-level supervision than sparse verifier rewards
Stated directly in the opening of the abstract as the motivation for OPD.
domain assumption Conditioning the teacher on both successful and failed peer rollouts produces more informative signals
Core premise of the MOPD framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1335 out tokens · 32483 ms · 2026-05-14T21:25:53.077522+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 17 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton and Oriol Vinyals and Jeff Dean , title =. arXiv preprint arXiv:1503.02531 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Rush , title =

Yoon Kim and Alexander M. Rush , title =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2016
[4]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

Xiaohan Xu and Ming Li and Chongyang Tao and Tao Shen and Reynold Cheng and Jinyang Li and Can Xu and Dacheng Tao and Tianyi Zhou , title =. arXiv preprint arXiv:2402.13116 , year =

work page arXiv
[5]

arXiv preprint arXiv:2305.15717 , year =

Arnav Gudibande and Eric Wallace and Charlie Snell and Xinyang Geng and Hao Liu and Pieter Abbeel and Sergey Levine and Dawn Song , title =. arXiv preprint arXiv:2305.15717 , year =

work page arXiv
[6]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , booktitle =

St. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , booktitle =

work page
[7]

2026 , eprint=

A Survey of On-Policy Distillation for Large Language Models , author=. 2026 , eprint=

work page 2026
[8]

Proceedings of ICLR , year =

Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , title =. Proceedings of ICLR , year =

work page
[9]

Proceedings of ICML , year =

Jongwoo Ko and Sungnyun Kim and Tianyi Chen and Se-Young Yun , title =. Proceedings of ICML , year =

work page
[10]

Proceedings of ICML , year =

Jongwoo Ko and Tianyi Chen and Sungnyun Kim and Tianyu Ding and Luming Liang and Ilya Zharkov and Se-Young Yun , title =. Proceedings of ICML , year =

work page
[11]

Proceedings of ACL , year =

Yuqiao Wen and Zichao Li and Wenyu Du and Lili Mou , title =. Proceedings of ACL , year =

work page
[12]

Proceedings of COLING , year =

Taiqiang Wu and Chaofan Tao and Jiahao Wang and Runming Yang and Zhe Zhao and Ngai Wong , title =. Proceedings of COLING , year =

work page
[13]

Proceedings of EMNLP , year =

Seongryong Jung and Suwan Yoon and DongGeon Kim and Hwanhee Lee , title =. Proceedings of EMNLP , year =

work page
[14]

Transactions on Machine Learning Research , year =

Nicolas Boizard and Kevin El Haddad and Céline Hudelot and Pierre Colombo , title =. Transactions on Machine Learning Research , year =

work page
[15]

arXiv preprint arXiv:2510.24021 , year =

Haiduo Huang and Jiangcheng Song and Yadong Zhang and Pengju Ren , title =. arXiv preprint arXiv:2510.24021 , year =

work page arXiv
[16]

Proceedings of ACL , year =

Songming Zhang and Xue Zhang and Tong Zhang and Bojie Hu and Yufeng Chen and Jinan Xu , title =. Proceedings of ACL , year =

work page
[17]

Le and Dhruv Madeka and Lei Li and William Yang Wang and Rishabh Agarwal and Chen-Yu Lee and Tomas Pfister , title =

Wenda Xu and Rujun Han and Zifeng Wang and Long T. Le and Dhruv Madeka and Lei Li and William Yang Wang and Rishabh Agarwal and Chen-Yu Lee and Tomas Pfister , title =. Proceedings of ICLR , year =

work page
[18]

Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

Dongxu Zhang and Zhichao Yang and Sepehr Janghorbani and Jun Han and Andrew Ressler and Qian Qian and Gregory D. Lyng and Sanjit Singh Batra and Robert E. Tillman , title =. arXiv preprint arXiv:2602.15260 , year =

work page arXiv
[19]

Proceedings of EMNLP , year =

Yuxin Jiang and Chunkit Chan and Mingyang Chen and Wei Wang , title =. Proceedings of EMNLP , year =

work page
[20]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and others , title =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the International Conference on Machine Learning (ICML) , year =

Dan Busbridge and Amitis Shidani and Floris Weers and Jason Ramapuram and Etai Littwin and Russ Webb , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =

work page
[22]

2026 , eprint=

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=

work page 2026
[23]

Entropy-aware on-policy distillation of language models

Woogyeol Jin and Taywon Min and Yongjin Yang and Swanand Ravindra Kadhe and Yi Zhou and Dennis Wei and Nathalie Baracaldo and Kimin Lee , title =. arXiv preprint arXiv:2603.07079 , year =

work page arXiv
[24]

Transactions on Machine Learning Research , issn=

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[25]

International Conference on Machine Learning , pages=

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[27]

Self-Distillation Enables Continual Learning

Idan Shenfeld and Mehul Damani and Jonas H. arXiv preprint arXiv:2601.19897 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2602.04942 , year =

Emiliano Penaloza and Dheeraj Vattikonda and Nicolas Gontier and Alexandre Lacoste and Laurent Charlin and Massimo Caccia , title =. arXiv preprint arXiv:2602.04942 , year =

work page arXiv
[29]

arXiv preprint arXiv:2603.23871 , year =

Ken Ding , title =. arXiv preprint arXiv:2603.23871 , year =

work page arXiv
[30]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang and Yuanda Xu and Zhengze Zhou and Ran He and Zhipeng Wang and Jiachen Sun , title =. arXiv preprint arXiv:2603.05433 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2603.11137 , year =

Jongwoo Ko and Sara Abdali and Young Jin Kim and Tianyi Chen and Pashmina Cameron , title =. arXiv preprint arXiv:2603.11137 , year =

work page arXiv
[32]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[33]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[35]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Advances in Neural Information Processing Systems , volume=

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[38]

arXiv preprint arXiv:2506.03106 , year=

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback , author=. arXiv preprint arXiv:2506.03106 , year=

work page arXiv
[39]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim and Xufang Luo and Minbeom Kim and Sangmook Lee and Dohyung Kim and Jiwon Jeon and Dongsheng Li and Yuqing Yang , title =. arXiv preprint arXiv:2603.24472 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[41]

arXiv preprint arXiv:2402.00782 , year=

Dense reward for free in reinforcement learning from human feedback , author=. arXiv preprint arXiv:2402.00782 , year=

work page arXiv
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

work page 2016
[45]

Advances in neural information processing systems , volume=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in neural information processing systems , volume=

work page
[46]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

ORPO: Monolithic Preference Optimization without Reference Model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[47]

Advances in Neural Information Processing Systems , volume=

SimPO: Simple Preference Optimization with a Reference-Free Reward , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

Transactions on Machine Learning Research , issn=

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

work page 2023
[49]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base

Xumeng Wen and Zihan Liu and Shun Zheng and Shengyu Ye and Zhirong Wu and Yang Wang and Zhijian Xu and Xiao Liang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang , booktitle=. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base. 2026 , url=

work page 2026
[50]

arXiv preprint arXiv:2507.14843 , year=

The Invisible Leash: Why RLVR May or May Not Escape Its Origin , author=. arXiv preprint arXiv:2507.14843 , year=

work page arXiv
[51]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=

work page 2025
[52]

2026 , url=

Yuqian Fu and Tinghong Chen and Jiajun Chai and Xihuai Wang and Songjun Tu and Guojun Yin and Wei Lin and Qichao Zhang and Yuanheng Zhu and Dongbin Zhao , booktitle=. 2026 , url=

work page 2026
[53]

Deng, Qiyuan and Chen, Kehai and Zhang, Min and Xu, Zhongwen , booktitle=. Hi. 2026 , url=

work page 2026
[54]

2026 , eprint=

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning , author=. 2026 , eprint=

work page 2026
[55]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang , title =. arXiv preprint arXiv:2504.14945 , year =

work page arXiv
[56]

arXiv preprint arXiv:2512.23097 , year =

Yingru Li and Ziniu Li and Jiacai Liu , title =. arXiv preprint arXiv:2512.23097 , year =

work page arXiv
[57]

Proceedings of ICLR , year =

Hunter Lightman and Vineet Kosaraju and Yura Burda and Harri Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. Proceedings of ICLR , year =

work page
[58]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Rewarding Progress: Scaling Automated Process Verifiers for

Amrith Setlur and Chirag Nagpal and Adam Fisch and Xinyang Geng and Jacob Eisenstein and Rishabh Agarwal and Alekh Agarwal and Jonathan Berant and Aviral Kumar , booktitle=. Rewarding Progress: Scaling Automated Process Verifiers for. 2025 , url=

work page 2025
[60]

arXiv preprint arXiv:2501.07301 , year=

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

work page arXiv
[61]

arXiv preprint arXiv:2411.11504 , year=

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering , author=. arXiv preprint arXiv:2411.11504 , year=

work page arXiv
[62]

Logical and Symbolic Reasoning in Language Models @ AAAI 2026 , year=

Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models , author=. Logical and Symbolic Reasoning in Language Models @ AAAI 2026 , year=

work page 2026
[63]

arXiv preprint arXiv:2505.16142 , year =

Shicheng Xu and Liang Pang and Yunchang Zhu and Jia Gu and Zihao Wei and Jingcheng Deng and Feiyang Pan and Huawei Shen and Xueqi Cheng , title =. arXiv preprint arXiv:2505.16142 , year =

work page arXiv
[64]

arXiv preprint arXiv:2506.02208 , year =

Hongling Xu and Qi Zhu and Heyuan Deng and Jinpeng Li and Lu Hou and Yasheng Wang and others , title =. arXiv preprint arXiv:2506.02208 , year =

work page arXiv
[65]

arXiv preprint arXiv:2509.14257 , year =

Yuanjie Lyu and Chengyu Wang and Jun Huang and Tong Xu , title =. arXiv preprint arXiv:2509.14257 , year =

work page arXiv
[66]

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Zhaoyang Zhang and Shuli Jiang and Yantao Shen and Yuting Zhang and Dhananjay Ram and Shuo Yang and others , title =. arXiv preprint arXiv:2602.22495 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[67]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

work page
[68]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Black-Box On-Policy Distillation of Large Language Models , author=. arXiv preprint arXiv:2511.10643 , year=

work page arXiv
[70]

On-Policy Context Distillation for Language Models

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=

work page arXiv
[73]

arXiv preprint arXiv:2504.11456 , year=

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page arXiv
[74]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

work page 2024
[76]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases , author=. arXiv preprint arXiv:2306.05301 , year=

work page arXiv
[77]

The Thirteenth International Conference on Learning Representations , year =

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , title =. The Thirteenth International Conference on Learning Representations , year =

work page
[78]

Advances in Neural Information Processing Systems , volume=

STaR: Bootstrapping Reasoning with Reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[79]

Reinforced Self-Training (ReST) for Language Modeling

ReST: Reinforced Self-Training (ReST) for Language Models , author=. arXiv preprint arXiv:2308.08998 , year=

work page internal anchor Pith review Pith/arXiv arXiv