pith. machine review for the scientific record. sign in

arxiv: 2605.07114 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

Dongsheng Ding, Edgar Dobriban, Shuo Li, Tao Wang, Yan Sun

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learning with verifiable rewardsrollout allocationgroup-based policy optimizationmathematical reasoninglarge language modelshit utilityGRPO
0
0 comments X

The pith

Allocating rollouts to maximize the posterior probability of hitting at least one correct answer improves Pass@K in group-based RLVR while preserving Pass@1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that uniform rollout allocation in group-based methods like GRPO wastes samples on prompts whose groups are already saturated or unlikely to succeed. It defines hit utility as the posterior probability that at least one rollout in a proposed extra allocation for a prompt will be correct, then introduces HORA as the policy that chooses allocations to maximize the total hit utility across a batch. HORA leaves reward evaluation and the group-based advantage estimator untouched yet raises Pass@K over compute-matched GRPO in ten of twelve model-benchmark settings. A sympathetic reader cares because the change is learning-free and drop-in compatible, offering a direct way to spend limited rollout compute more effectively during RLVR training for reasoning models.

Core claim

HORA is a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. Hit utility for a prompt is the posterior probability, under a uniform prior over success probabilities, that at least one rollout in an additional allocation will be correct. By reallocating budgets toward prompts with higher hit utility, HORA improves Pass@K over fixed-allocation GRPO across four mathematical reasoning benchmarks and three model scales while keeping Pass@1 comparable, and it remains compatible with other group-based estimators such as RLOO.

What carries the argument

Hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct, which is maximized in sum to decide the batch allocation.

If this is right

  • HORA serves as a drop-in replacement for uniform allocation inside GRPO and similar group-based methods such as RLOO.
  • Pass@1 remains comparable while Pass@K rises, indicating better coverage of correct trajectories without harming single-sample accuracy.
  • A uniform prior for computing hit utilities performs competitively with five prompt-conditioned learned priors in ablation studies.
  • The gains appear consistently across four mathematical reasoning benchmarks and three model scales in ten of twelve configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hit-utility logic could be tested in other verifiable-reward domains such as code generation to check whether the uniform prior still suffices.
  • Because HORA operates batch-wise and leaves the estimator unchanged, it could be combined with existing prompt-level difficulty predictors for further refinement.
  • If the uniform prior assumption weakens on highly correlated prompt sets, replacing it with a cheap learned prior might amplify the observed Pass@K gains.

Load-bearing premise

The posterior hit utility can be reliably estimated using a uniform prior over prompt success probabilities.

What would settle it

An experiment in which HORA assigns extra rollouts to prompts that ultimately produce fewer correct trajectories than the same prompts would under uniform allocation, yielding no net gain or a loss in total correct hits on the fixed compute budget.

Figures

Figures reproduced from arXiv: 2605.07114 by Dongsheng Ding, Edgar Dobriban, Shuo Li, Tao Wang, Yan Sun.

Figure 1
Figure 1. Figure 1: Comparison of fixed-group GRPO (top) and HORA (bottom). GRPO allocates a uniform group [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass@K curves on Qwen2.5-7B for in-distribution MATH500 (left) and out-of-distribution AIME 2025 (right). HORA (blue) lies strictly above the Base checkpoint (gray) at every K we evaluate. GRPO (orange) falls below the base model on MATH500 from K = 64 onward and on AIME 2025 at K = 1024 (66.7 vs. 70.0), whereas HORA (blue) remains above the Base model across the evaluated range in both cases. toward hard … view at source ↗
Figure 3
Figure 3. Figure 3: Allocation dynamics on the Qwen2.5-7B run. Left: average phase-A input fraction Pr(Ci = k) and average phase-B allocation share for each observed pre-rollout correct-count bucket ci = k, averaged over 196 optimizer steps. Right: phase-B allocation share over training, grouped by ci = 0, ci ∈ {1, 2, 3}, ci ∈ {4, 5, 6, 7}, and ci = 8; curves are smoothed with a 6-step moving average. 9 [PITH_FULL_IMAGE:figu… view at source ↗
Figure 4
Figure 4. Figure 4: Pass@K curves for Base, RLOO, and HORA + RLOO on Qwen2.5-7B. HORA changes only the rollout-allocation stage, while the downstream advantage estimator is RLOO for both RL methods. 4.2 Ablations We next ablate the main design choices in HORA. These experiments address three questions: (1) is hit-utility allocation needed beyond simpler hard-first heuristics, (2) how much budget should be spent on pre-rollout… view at source ↗
Figure 5
Figure 5. Figure 5: shows that these more complex priors do not yield a clear advantage over the fixed prior. On AMC23, several prompt-conditioned priors are slightly ahead at intermediate values of K, but the fixed prior catches up by K = 256. On AIME 2025, the fixed prior is consistently better across the evaluated range and reaches Pass@1024 = 70.0, compared with at most 66.7 among the predictor-based variants. These resul… view at source ↗
Figure 6
Figure 6. Figure 6: Pass@K curves for Qwen2.5-1.5B-Instruct across all four benchmarks. Three update modes are evaluated: Hidden-static keeps the pretrained probe frozen during RL training (no online updates). Hidden-online updates the probe online: each optimizer step replays the most recent 10 generation cycles, takes 5 Adam steps with learning rate 10−3 , weight decay 10−2 , geometric importance weight γ t = 0.97t , and ma… view at source ↗
Figure 7
Figure 7. Figure 7: Pass@K curves for Qwen2.5-3B across all four benchmarks. GP. We adopt the recursive Gaussian-process predictor proposed by VIP [Nguyen et al., 2026]: an RBF kernel over MPNet features with median-distance bandwidth, prior mean µ0 = −1 in logit-space (so the prior pˆi ≈ 0.27), and recursive Bayesian updates on the observed empirical rates ci/G0. We use the implementation provided by the original authors and… view at source ↗
Figure 8
Figure 8. Figure 8: Pass@K curves for Qwen2.5-7B across all four benchmarks. The MATH500 and AIME 2025 panels reproduce [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean response length (left) and mean token entropy (right) over training for HORA + GRPO [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prior ablation across all four benchmarks on Qwen2.5-3B. The fixed [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces hit utility, defined as the posterior probability that at least one rollout in a proposed additional batch for a given prompt will be correct, computed under a uniform prior over per-prompt success probabilities. It proposes HORA, a learning-free policy that reallocates rollout budgets within each batch to maximize aggregate hit utility. The method leaves the downstream reward evaluation and group-based advantage estimator (e.g., GRPO) unchanged. Experiments across four mathematical reasoning benchmarks and three model scales report that HORA preserves Pass@1 while improving Pass@K relative to compute-matched GRPO in ten of twelve configurations (one tie, one saturated case), and is compatible with other estimators such as RLOO. Ablations indicate the uniform prior is competitive with five prompt-conditioned learned alternatives.

Significance. If the results hold under fuller verification, the work is significant for RLVR in LLMs: it offers a simple, parameter-free mechanism to reduce wasteful uniform rollout allocation on saturated prompts while improving coverage of useful trajectories, without requiring changes to existing group-based estimators. The learning-free design and reported compatibility with GRPO/RLOO make adoption straightforward, and the prior ablations provide a useful robustness check. This could meaningfully improve sample efficiency in reasoning model training.

major comments (2)
  1. [Experimental Results] Experimental Results section: the abstract and main claims report consistent Pass@K gains in ten of twelve model-benchmark pairs, yet no error bars, standard deviations, or number of independent runs are referenced, which is load-bearing for assessing whether the improvements are reliable or could be explained by sampling variance.
  2. [Method] Method (hit-utility derivation): the allocation policy is derived from the posterior hit-utility objective under a fixed uniform prior; while the manuscript states that ablations show this prior is competitive with five learned alternatives, the absence of training details, hyper-parameters, or quantitative ablation tables for those alternatives makes it difficult to confirm the comparison is fair and does not undermine the central claim.
minor comments (3)
  1. The four mathematical reasoning benchmarks are not named in the abstract or early sections; explicitly listing them (e.g., GSM8K, MATH, etc.) would improve immediate clarity.
  2. [Method] Notation for hit utility and the incremental batch size should be defined with a clear equation reference in the main text rather than only in the appendix.
  3. [Experiments] A summary table of exact Pass@K deltas versus GRPO for each of the twelve configurations would make the empirical claims easier to parse at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the abstract and main claims report consistent Pass@K gains in ten of twelve model-benchmark pairs, yet no error bars, standard deviations, or number of independent runs are referenced, which is load-bearing for assessing whether the improvements are reliable or could be explained by sampling variance.

    Authors: We agree that explicit reporting of variability measures strengthens the claims. The experiments underlying the ten-of-twelve improvements were performed with multiple independent runs, but the current manuscript does not state the run count or include error bars. In the revised version we will add the number of independent runs and standard deviations (or equivalent variability measures) to the Experimental Results section and tables. revision: yes

  2. Referee: [Method] Method (hit-utility derivation): the allocation policy is derived from the posterior hit-utility objective under a fixed uniform prior; while the manuscript states that ablations show this prior is competitive with five learned alternatives, the absence of training details, hyper-parameters, or quantitative ablation tables for those alternatives makes it difficult to confirm the comparison is fair and does not undermine the central claim.

    Authors: We thank the referee for highlighting the need for greater transparency in the ablation studies. The manuscript reports that the uniform prior is competitive with five prompt-conditioned learned priors, yet omits training details, hyperparameters, and a quantitative table. We will expand the ablation section in the revision to include these elements, thereby allowing readers to verify the fairness of the comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces hit utility as a new posterior probability definition and derives HORA directly as its maximizing allocation policy under a stated uniform prior. This is an explicit optimization step from the introduced objective rather than any reduction to fitted data, self-citations, or prior results by construction. Empirical Pass@K gains are reported from multi-benchmark experiments and are not presented as mathematical derivations. The core method is self-contained, leaves the group-based advantage estimator unchanged, and includes ablations on the prior choice; none of the seven enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the definition of hit utility as a posterior probability and the use of a uniform prior to compute it; no fitted parameters are introduced and the downstream RLVR components remain unchanged.

axioms (2)
  • domain assumption Hit utility is the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct.
    This definition is the foundation for the allocation objective and is invoked to motivate the optimization in the abstract.
  • ad hoc to paper A uniform prior over prompt success probabilities is sufficient to estimate hit utilities for allocation decisions.
    The abstract states that HORA uses this prior and that it is competitive with learned alternatives, making it a key modeling choice.
invented entities (1)
  • Hit utility no independent evidence
    purpose: Quantify the expected value of allocating additional rollouts to a given prompt.
    Newly introduced scalar that drives the optimal allocation policy.

pith-pipeline@v0.9.0 · 5540 in / 1553 out tokens · 37430 ms · 2026-05-11T01:05:50.827115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 18 canonical work pages · 11 internal anchors

  1. [1]

    Step guided reasoning: Improving mathematical reasoning using guidance generation and step reasoning

    Lang Cao, Yingtian Zou, Chao Peng, Renhong Chen, Wu Ning, and Yitong Li. Step guided reasoning: Improving mathematical reasoning using guidance generation and step reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21112–21129,

  2. [2]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  3. [3]

    Differential smooth- ing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,

    Jingchu Gai, Guanning Zeng, Huaqing Zhang, and Aditi Raghunathan. Differential smoothing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,

  4. [4]

    Rewarding the unlikely: Lifting grpo beyond distribution sharpening

    Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559–25571,

  5. [5]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  6. [6]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  7. [7]

    Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026

    Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.01601,

  8. [8]

    Hermes: Towards efficient and verifiable mathematical reasoning in llms.arXiv preprint arXiv:2511.18760,

    Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, and Farzan Farnia. Hermes: Towards efficient and verifiable mathematical reasoning in llms.arXiv preprint arXiv:2511.18760,

  9. [9]

    Aime: Ai system optimization via multiple llm evaluators.arXiv preprint arXiv:2410.03131,

    Bhrij Patel, Souradip Chakraborty, Wesley A Suttle, Mengdi Wang, Amrit Singh Bedi, and Dinesh Manocha. Aime: Ai system optimization via multiple llm evaluators.arXiv preprint arXiv:2410.03131,

  10. [10]

    Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1240–1250, 2026a. Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive...

  11. [11]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  12. [12]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

  13. [13]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

  14. [14]

    The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

    Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

  15. [15]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  16. [16]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  17. [17]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

  18. [18]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

  19. [19]

    Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

    Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, and Xing Hu. Revisiting entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning.arXiv preprint arXiv:2510.10959,

  20. [20]

    The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

  21. [21]

    All runs use the TRL implementation of GRPO with vLLM rollout in colocate mode

    GRPO baselines use identical hyperparameters except for the HORA-specific rows (pre-rollout countG0 and Beta prior). All runs use the TRL implementation of GRPO with vLLM rollout in colocate mode. To accelerate format learning, we run vanilla GRPO as a warm-up at the start of each HORA run and switch to HORA once the global format-reward mean stays at or ...

  22. [22]

    E Length and entropy statistics To corroborate the claim in Section 4.1 that HORA preserves more exploratory sampling behavior, Figure 9 plots mean response length and mean token entropy over training for the Qwen2.5-7B HORA run against 17 Table 4: Configuration for Qwen2.5-3B. Parameter V alue Parameter V alue Pretrained model Qwen2.5-3B Training set MAT...

  23. [23]

    We useN = 256for MATH500 and AMC23, andN = 1024for AIME 2024 and AIME

  24. [24]

    Bold marks the better value in each column. Method MATH500 AMC23 AIME 2024 AIME 2025 Pass@1 Pass@256 Pass@1 Pass@256 Pass@1 Pass@1024 Pass@1 Pass@1024 RLOO baseline 71.0 92.0 49.6 97.512.370.010.2 73.3 HORA + RLOO71.9 94.2 51.1 100.011.673.38.873.3 21 50 100 150 200 Optimizer step 400 500 600 700 800 900Mean response length (tokens) 50 100 150 200 Optimiz...

  25. [25]

    The fixedBeta(1, 1)prior used by HORA (solid blue) is compared against the five learned-prior variants described above

    (HORA) Hidden-static Hidden-online Hidden-dual MPNet-dual GP Figure 10: Prior ablation across all four benchmarks on Qwen2.5-3B. The fixedBeta(1, 1)prior used by HORA (solid blue) is compared against the five learned-prior variants described above. No learned variant uniformly dominates the fixed prior across all four benchmarks: on AIME 2025 the fixed pr...