pith. sign in

arxiv: 2606.05606 · v1 · pith:ZJM3G5UMnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· math.OC

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Pith reviewed 2026-06-28 02:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords adaptive rollout allocationreinforcement learning post-trainingBeta posteriorBernoulli varianceonline resource allocationFenchel dualsample efficiencyLLM reasoning
0
0 comments X

The pith

CERO allocates a fixed rollout budget adaptively across prompts using posterior expected Bernoulli variance to improve sample efficiency in LLM RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed rollout budgets per prompt are inefficient because different prompts yield different amounts of training signal under a shared global budget. CERO tracks a Beta posterior for each prompt's success probability and uses the expected Bernoulli variance as a Bayesian measure of the marginal value of one more rollout. This value is turned into a concave saturating utility function whose optimization couples allocations across all prompts and all epochs. The temporally nonseparable objective is solved by a Fenchel dual reformulation updated with projected online gradient descent, which carries an O(sqrt(K)) regret guarantee against the offline optimum. Experiments show the resulting policy outperforms the fixed-allocation baseline GRPO on mathematical-reasoning tasks across several open-weight models.

Core claim

CERO maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. This estimate is used to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. The objective is solved via a Fenchel-dual reformulation whose dual variables are updated by projected online gradient descent, delivering an O(sqrt(K)) regret bound against the offline allocation benchmark.

What carries the argument

The posterior expected Bernoulli variance, turned into a concave saturating utility over cumulative allocations and optimized through Fenchel dual variables with projected online gradient descent.

If this is right

  • Under fixed prompt utilities the method achieves O(sqrt(K)) regret relative to the best offline allocation.
  • Adaptive budgeting improves sample efficiency over fixed rollout counts on mathematical-reasoning benchmarks.
  • Allocations for different prompts and different training epochs become interdependent through the shared global budget.
  • The same framework applies across multiple open-weight LLMs without per-prompt hyperparameter retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variance-based utility could be reused for other instance-dependent sampling decisions such as choosing which trajectories to retain or which prompts to up-weight in the data mix.
  • If the Beta success model is replaced by a richer posterior that tracks gradient magnitude, the same dual machinery might automatically allocate compute where learning progress is fastest.
  • The online dual updates could be run in a streaming fashion, allowing the rollout budget to be adjusted mid-epoch without restarting the optimizer.

Load-bearing premise

The posterior expected Bernoulli variance accurately estimates the marginal training value of an extra rollout for any given prompt.

What would settle it

Running CERO and GRPO on the same set of prompts with identical total rollout count and observing that CERO does not produce higher final accuracy on the target benchmarks.

read the original abstract

LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper formulates adaptive rollout allocation as an online resource allocation problem with diminishing returns, constructs a concave utility from posterior expected Bernoulli variance, derives a Fenchel-dual reformulation, and applies projected online gradient descent. It then proves an O(sqrt(K)) regret bound against an offline benchmark specifically under fixed prompt utilities. This bound is a standard result in online convex optimization and does not reduce to or depend on the Beta-posterior estimation procedure or the fitted utilities used in the algorithm. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatzes smuggled via citation appear in the derivation. The central theoretical claim remains independent of the empirical utility estimates.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on modeling rollout value via Beta posteriors and assuming diminishing returns that justify a concave utility; these modeling choices are domain assumptions whose validity is not independently verified in the abstract.

free parameters (1)
  • Beta prior hyperparameters
    Initial Beta parameters for each prompt's posterior must be chosen before any data arrives.
axioms (1)
  • domain assumption Prompt-level success probability follows a Bernoulli process whose variance estimates marginal rollout value
    Invoked to turn uncertainty into a utility score that saturates.

pith-pipeline@v0.9.1-grok · 5738 in / 1161 out tokens · 36221 ms · 2026-06-28T02:30:09.213036+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Buchbinder and J

    N. Buchbinder and J. Naor. Improved bounds for online routing and packing via a primal-dual approach. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 293–304. IEEE,

  2. [2]

    X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Pich´ e, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,

  3. [3]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  4. [4]

    Z. Hu, J. Qiu, T. Bai, H. Yang, B. Yuan, Q. Jing, C. He, and W. Zhang. Vade: Variance-aware dynamic sam- pling via online sample-level difficulty estimation for multimodal rl.arXiv preprint arXiv:2511.18902,

  5. [5]

    Jiang and J

    J. Jiang and J. Zhang. Online resource allocation with stochastic resource consumption.arXiv preprint arXiv:2012.07933,

  6. [6]

    Z. Li, C. Chen, T. Yang, T. Ding, R. Sun, G. Zhang, W. Huang, and Z.-Q. Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

  7. [7]

    Z. Lin, M. Lin, Y. Xie, and R. Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

  8. [8]

    Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

    I. Mahrooghi, A. Lotfi, and E. Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning.arXiv preprint arXiv:2602.14868,

  9. [9]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  10. [10]

    Shrivastava, A

    V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

  11. [11]

    Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818,

  12. [12]

    A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

  13. [13]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  14. [14]

    Yao, Y.-K

    Z. Yao, Y.-K. Zhang, Y. Chen, Y. Sun, Z. Xu, Y. Yang, T. Hu, Q. Gu, H. Su, and X. Cai. Coba-rl: Capability- oriented budget allocation for reinforcement learning in llms.arXiv preprint arXiv:2602.03048,

  15. [15]

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  16. [16]

    Zhang, D

    R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016,

  17. [17]

    Zhang, Z

    Z. Zhang, Z. Han, C. Mavromatis, Q. Zhu, Y. Zhang, S. Guan, D. Wang, X. Zhou, S. Wang, S. Adeshina, et al. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338,

  18. [18]

    Zong and J

    Y. Zong and J. Jiang. Online semi-infinite linear programming: Efficient algorithms via function approxi- mation.arXiv preprint arXiv:2603.16200,