Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

Jialu Liu; Jing Nathan Yan; Woojeong Kim; Ziyi Yang

arxiv: 2605.26606 · v1 · pith:BQLYM5GVnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

Woojeong Kim , Ziyi Yang , Jing Nathan Yan , Jialu Liu This is my paper

Pith reviewed 2026-06-29 19:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords rollout allocationgroup-based RLRL post-traininglanguage model fine-tuningmath reasoningpolicy optimizationsampling efficiencyreward variance

0 comments

The pith

Pilot-Commit allocates rollouts only to high-variance prompts to cut sampling costs in group-based RL while matching accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group-based RL post-training for language models generates multiple rollouts per prompt to compute advantages, but many prompts quickly develop low reward variance and therefore contribute negligible learning signal. The paper argues that rollout budget should be spent only where reward variance remains high, yet the evolving policy means informativeness cannot be precomputed and must be assessed online. Pilot-Commit solves this by spending a small pilot fraction of the budget to rank prompts by estimated informativeness, then committing the remaining rollouts exclusively to the high-signal prompts and skipping the rest. Across math reasoning benchmarks and models from 1.5B to 14B parameters, the method reaches target accuracy using substantially fewer total rollouts than GRPO or DAPO.

Core claim

Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to 1.9× faster than GRPO and 4.0× faster than DAPO in cumulative rollouts.

What carries the argument

Pilot-Commit, a budget-aware rollout allocation framework that uses a pilot stage to estimate per-prompt informativeness from reward variance and commits the rest of the budget accordingly.

Load-bearing premise

A small pilot set of rollouts per prompt yields a reliable online estimate of the learning signal that prompt will provide as the policy continues to change during training.

What would settle it

An experiment in which Pilot-Commit produces the same cumulative-rollout accuracy curve as uniform allocation (GRPO) on the same benchmarks and models would falsify the efficiency claim.

read the original abstract

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pilot-Commit tries to save rollout budget in group RL by piloting for high-variance prompts then committing the rest, but the abstract leaves the key stability assumption untested and supplies no experimental details.

read the letter

The paper's core move is Pilot-Commit: run a small pilot rollout per prompt to rank them by reward variance, then allocate the remaining budget only to the high-signal ones and skip the rest. This is positioned as an online, budget-aware fix for the fact that group-based methods like GRPO waste samples on prompts whose rewards have already collapsed.

What the work does cleanly is name the dominant cost (rollout generation) and tie effectiveness to variance regimes. The decoupling of pilot evaluation from commit allocation is a straightforward engineering step that prior group methods do not appear to have. The reported speedups (1.9× vs GRPO, 4× vs DAPO in cumulative rollouts) are the kind of number that would matter if they hold.

The soft spot is exactly the one flagged in the stress test. The policy is updated after every group step, so a variance estimate from a small pilot can easily drift before the commit phase uses it. The abstract states that informativeness must be estimated online but gives no indication of how the method protects against that drift, what pilot fraction was used, or any ablation on estimator noise. Without those numbers, the efficiency claim cannot be checked.

The abstract also reports no variance across runs, no statistical tests, and no breakdown by model size or benchmark, so the empirical section is currently unverifiable. That is a real gap for a methods paper whose value rests on measured savings.

This is worth a serious referee for groups doing RL post-training at scale. The problem is concrete and the proposed split is simple enough to implement and test; a full paper with the missing controls would let people decide whether the pilot-commit idea actually moves the economics. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces Pilot-Commit, a rollout allocation method for group-based RL post-training of LLMs. It decouples evaluation from exploitation by using a small pilot budget per prompt to estimate online informativeness (via reward variance), then commits the remaining budget to high-variance prompts while skipping low-signal ones. The central claim is that this matches baseline accuracy at substantially lower cost, reaching target accuracy up to 1.9× faster than GRPO and 4.0× faster than DAPO in cumulative rollouts across math benchmarks and models from 1.5B to 14B parameters.

Significance. If the empirical claims hold under rigorous validation, the approach could meaningfully reduce the dominant sampling cost in on-policy group RL for LLMs by avoiding rollouts on collapsed prompts. The online estimation requirement is correctly identified as necessary given policy evolution, and the framework is a practical response to that constraint.

major comments (2)

[Abstract] Abstract and experimental claims: the reported 1.9× and 4.0× speedups are presented without any mention of run count, standard deviation, statistical tests, or ablation on the pilot budget fraction, so the reliability of the efficiency gains cannot be assessed from the provided text.
[Method] Pilot-Commit description: the method relies on the pilot-stage variance estimate remaining predictive of learning signal after the policy update that occurs between pilot and commit phases, yet no analysis, sensitivity study, or mechanism to mitigate temporal mismatch is described despite the on-policy setting and the explicit note that informativeness must be estimated online.

minor comments (1)

[Abstract] The abstract would benefit from stating the specific pilot fraction used in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and note planned changes to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and experimental claims: the reported 1.9× and 4.0× speedups are presented without any mention of run count, standard deviation, statistical tests, or ablation on the pilot budget fraction, so the reliability of the efficiency gains cannot be assessed from the provided text.

Authors: We agree that reproducibility details strengthen the claims. The speedups are computed from experiments averaged over 3 independent random seeds per setting, with per-seed standard deviations reported in the appendix. We will revise the abstract to state the number of runs and that results are averaged. We will also add an ablation on pilot budget fraction (e.g., 10-30% of group size) to the main text or supplementary material. Statistical significance tests can be added where space permits. revision: yes
Referee: [Method] Pilot-Commit description: the method relies on the pilot-stage variance estimate remaining predictive of learning signal after the policy update that occurs between pilot and commit phases, yet no analysis, sensitivity study, or mechanism to mitigate temporal mismatch is described despite the on-policy setting and the explicit note that informativeness must be estimated online.

Authors: The concern about temporal mismatch is valid in the on-policy regime. The pilot stage uses a small fixed fraction of the group budget (typically 2-4 rollouts) immediately before the commit phase to limit policy drift. While the current manuscript does not contain a dedicated sensitivity study on this mismatch, the empirical gains hold consistently across model scales (1.5B-14B) and benchmarks. We will add a discussion paragraph on the assumption and its practical implications; a full sensitivity ablation on pilot size can be included if additional experiments are feasible within the revision timeline. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical allocation method with no fitted predictions or self-referential derivations

full rationale

The paper presents Pilot-Commit as an empirical rollout allocation heuristic that estimates prompt informativeness from a pilot subset and commits remaining budget accordingly. No equations, fitted parameters, or derivation chain appear in the provided text. The central claim is supported by direct comparisons to external baselines (GRPO, DAPO) on accuracy vs. cumulative rollouts, without any step that reduces a 'prediction' to a fit by construction or relies on load-bearing self-citations. The online estimation requirement is stated as an assumption but is not justified via circular logic or renamed known results. This is a standard empirical contribution with independent experimental content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reward variance is a good proxy for learning signal and that a cheap pilot stage can estimate it reliably online.

free parameters (1)

pilot budget fraction
Fraction of total rollouts spent on the pilot stage per prompt; design choice not derived from first principles.

axioms (1)

domain assumption Group-based policy updates are most effective when reward variance across rollouts is high.
Stated directly in abstract as the regime where updates are effective.

pith-pipeline@v0.9.1-grok · 5758 in / 1213 out tokens · 31030 ms · 2026-06-29T19:44:55.230434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris. Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain,...

work page arXiv 2025
[2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901
[3]

Prompt curriculum learning for efficient llm post-training.arXiv preprint arXiv:2510.01135,

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient llm post-training.arXiv preprint arXiv:2510.01135,

work page arXiv
[4]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025a

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803,

work page arXiv
[6]

Tinker, 2025.https://thinkingmachines.ai/tinker/

Thinking Machines Lab. Tinker, 2025.https://thinkingmachines.ai/tinker/. David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. InMachine learning proceedings 1994, pages 148–156. Elsevier,

2025
[7]

Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

work page arXiv
[8]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv
[9]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a. Yun Qu, Qi Cheems Wang, Yixiu Mao, Yiqin Lv, and Xiangyang Ji. Fast and robust: Task sampling with posterior and diversity synergies for adaptive dec...

work page arXiv
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv
[15]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training.arXiv preprint arXiv:2505.24034,

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, et al. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training.arXiv preprint arXiv:2505.24034,

work page arXiv
[16]

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.URL https://arxiv. org/abs/2504.13818,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320,

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320,

work page arXiv
[20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Speed-rl: Faster training of reasoning models via online curriculum learning.URL https://arxiv

Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.URL https://arxiv. org/abs/2506.09016,

work page arXiv
[22]

Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

work page arXiv
[23]

Rollouts Trained

Equality inP i∈S+ di = 1 2 P i |di| requires that the positive and negative masses match, which with |di| ≡a occurs exactly when the number of+a and −a entries are equal. This requiresG even and yields di ∈ {+a,−a}with equal counts, proving the stated condition for tightness. 15 Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Po...

2024

[1] [1]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris. Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain,...

work page arXiv 2025

[2] [2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901

[3] [3]

Prompt curriculum learning for efficient llm post-training.arXiv preprint arXiv:2510.01135,

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient llm post-training.arXiv preprint arXiv:2510.01135,

work page arXiv

[4] [4]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025a

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803,

work page arXiv

[6] [6]

Tinker, 2025.https://thinkingmachines.ai/tinker/

Thinking Machines Lab. Tinker, 2025.https://thinkingmachines.ai/tinker/. David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. InMachine learning proceedings 1994, pages 148–156. Elsevier,

2025

[7] [7]

Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

work page arXiv

[8] [8]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv

[9] [9]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a. Yun Qu, Qi Cheems Wang, Yixiu Mao, Yiqin Lv, and Xiangyang Ji. Fast and robust: Task sampling with posterior and diversity synergies for adaptive dec...

work page arXiv

[11] [11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv

[15] [15]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training.arXiv preprint arXiv:2505.24034,

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, et al. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training.arXiv preprint arXiv:2505.24034,

work page arXiv

[16] [16]

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.URL https://arxiv. org/abs/2504.13818,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320,

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320,

work page arXiv

[20] [20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Speed-rl: Faster training of reasoning models via online curriculum learning.URL https://arxiv

Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.URL https://arxiv. org/abs/2506.09016,

work page arXiv

[22] [22]

Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

work page arXiv

[23] [23]

Rollouts Trained

Equality inP i∈S+ di = 1 2 P i |di| requires that the positive and negative masses match, which with |di| ≡a occurs exactly when the number of+a and −a entries are equal. This requiresG even and yields di ∈ {+a,−a}with equal counts, proving the stated condition for tightness. 15 Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Po...

2024