Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Pith reviewed 2026-06-28 02:30 UTC · model grok-4.3
The pith
CERO allocates a fixed rollout budget adaptively across prompts using posterior expected Bernoulli variance to improve sample efficiency in LLM RL post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CERO maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. This estimate is used to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. The objective is solved via a Fenchel-dual reformulation whose dual variables are updated by projected online gradient descent, delivering an O(sqrt(K)) regret bound against the offline allocation benchmark.
What carries the argument
The posterior expected Bernoulli variance, turned into a concave saturating utility over cumulative allocations and optimized through Fenchel dual variables with projected online gradient descent.
If this is right
- Under fixed prompt utilities the method achieves O(sqrt(K)) regret relative to the best offline allocation.
- Adaptive budgeting improves sample efficiency over fixed rollout counts on mathematical-reasoning benchmarks.
- Allocations for different prompts and different training epochs become interdependent through the shared global budget.
- The same framework applies across multiple open-weight LLMs without per-prompt hyperparameter retuning.
Where Pith is reading between the lines
- The variance-based utility could be reused for other instance-dependent sampling decisions such as choosing which trajectories to retain or which prompts to up-weight in the data mix.
- If the Beta success model is replaced by a richer posterior that tracks gradient magnitude, the same dual machinery might automatically allocate compute where learning progress is fastest.
- The online dual updates could be run in a streaming fashion, allowing the rollout budget to be adjusted mid-epoch without restarting the optimizer.
Load-bearing premise
The posterior expected Bernoulli variance accurately estimates the marginal training value of an extra rollout for any given prompt.
What would settle it
Running CERO and GRPO on the same set of prompts with identical total rollout count and observing that CERO does not produce higher final accuracy on the target benchmarks.
read the original abstract
LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper formulates adaptive rollout allocation as an online resource allocation problem with diminishing returns, constructs a concave utility from posterior expected Bernoulli variance, derives a Fenchel-dual reformulation, and applies projected online gradient descent. It then proves an O(sqrt(K)) regret bound against an offline benchmark specifically under fixed prompt utilities. This bound is a standard result in online convex optimization and does not reduce to or depend on the Beta-posterior estimation procedure or the fitted utilities used in the algorithm. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatzes smuggled via citation appear in the derivation. The central theoretical claim remains independent of the empirical utility estimates.
Axiom & Free-Parameter Ledger
free parameters (1)
- Beta prior hyperparameters
axioms (1)
- domain assumption Prompt-level success probability follows a Bernoulli process whose variance estimates marginal rollout value
Reference graph
Works this paper leans on
-
[1]
Buchbinder and J
N. Buchbinder and J. Naor. Improved bounds for online routing and packing via a primal-dual approach. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 293–304. IEEE,
2006
- [2]
-
[3]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
J. Jiang and J. Zhang. Online resource allocation with stochastic resource consumption.arXiv preprint arXiv:2012.07933,
- [6]
- [7]
-
[8]
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
I. Mahrooghi, A. Lotfi, and E. Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning.arXiv preprint arXiv:2602.14868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,
-
[11]
Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
- [16]
- [17]
-
[18]
Y. Zong and J. Jiang. Online semi-infinite linear programming: Efficient algorithms via function approxi- mation.arXiv preprint arXiv:2603.16200,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.