Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Jiashuo Jiang; Yige Wang; Yiming Zong

arxiv: 2606.05606 · v1 · pith:ZJM3G5UMnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· math.OC

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Yiming Zong , Yige Wang , Jiashuo Jiang This is my paper

Pith reviewed 2026-06-28 02:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords adaptive rollout allocationreinforcement learning post-trainingBeta posteriorBernoulli varianceonline resource allocationFenchel dualsample efficiencyLLM reasoning

0 comments

The pith

CERO allocates a fixed rollout budget adaptively across prompts using posterior expected Bernoulli variance to improve sample efficiency in LLM RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed rollout budgets per prompt are inefficient because different prompts yield different amounts of training signal under a shared global budget. CERO tracks a Beta posterior for each prompt's success probability and uses the expected Bernoulli variance as a Bayesian measure of the marginal value of one more rollout. This value is turned into a concave saturating utility function whose optimization couples allocations across all prompts and all epochs. The temporally nonseparable objective is solved by a Fenchel dual reformulation updated with projected online gradient descent, which carries an O(sqrt(K)) regret guarantee against the offline optimum. Experiments show the resulting policy outperforms the fixed-allocation baseline GRPO on mathematical-reasoning tasks across several open-weight models.

Core claim

CERO maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. This estimate is used to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. The objective is solved via a Fenchel-dual reformulation whose dual variables are updated by projected online gradient descent, delivering an O(sqrt(K)) regret bound against the offline allocation benchmark.

What carries the argument

The posterior expected Bernoulli variance, turned into a concave saturating utility over cumulative allocations and optimized through Fenchel dual variables with projected online gradient descent.

If this is right

Under fixed prompt utilities the method achieves O(sqrt(K)) regret relative to the best offline allocation.
Adaptive budgeting improves sample efficiency over fixed rollout counts on mathematical-reasoning benchmarks.
Allocations for different prompts and different training epochs become interdependent through the shared global budget.
The same framework applies across multiple open-weight LLMs without per-prompt hyperparameter retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The variance-based utility could be reused for other instance-dependent sampling decisions such as choosing which trajectories to retain or which prompts to up-weight in the data mix.
If the Beta success model is replaced by a richer posterior that tracks gradient magnitude, the same dual machinery might automatically allocate compute where learning progress is fastest.
The online dual updates could be run in a streaming fashion, allowing the rollout budget to be adjusted mid-epoch without restarting the optimizer.

Load-bearing premise

The posterior expected Bernoulli variance accurately estimates the marginal training value of an extra rollout for any given prompt.

What would settle it

Running CERO and GRPO on the same set of prompts with identical total rollout count and observing that CERO does not produce higher final accuracy on the target benchmarks.

read the original abstract

LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CERO gives a Bayesian adaptive rollout scheme with a dual trick and regret bound, but the abstract leaves the experiments and proof details too thin to assess the gains.

read the letter

The main takeaway is that this work treats rollout count as an optimizable resource rather than a fixed hyperparameter. It maintains per-prompt Beta posteriors, takes expected Bernoulli variance as the utility of one more rollout, builds a concave saturating function from that, and then uses a Fenchel dual plus projected online gradient descent to allocate a global budget across prompts and epochs. Under fixed utilities they prove an O(sqrt(K)) regret bound against the offline optimum.

That combination of posterior variance utility and the dual reformulation for temporal coupling is the concrete new piece. Framing the problem as online resource allocation with diminishing returns is also a clean way to think about it, and the claimed outperformance over GRPO on math-reasoning tasks with open-weight models would matter if it holds.

The soft spots are the missing pieces. The abstract states the regret bound and the experimental wins but gives no derivation steps, no description of how the posteriors are initialized or updated in the actual training loop, and no information on baseline controls, dataset sizes, or statistical reporting. The bound is stated for fixed utilities, yet the method fits those utilities from data, so the connection between the theorem and the running algorithm needs checking. The central modeling choice—that posterior expected variance is a good proxy for marginal value—also sits on an assumption that is plausible but not obviously tight.

This is aimed at people working on sample-efficient RL post-training for LLMs. A reader who cares about adaptive budgeting or online convex optimization applied to training loops would find the formulation worth seeing. It deserves a serious referee because it has both a formal result and empirical claims on a practical issue, even though the current write-up leaves enough gaps that heavy revision would be expected.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper formulates adaptive rollout allocation as an online resource allocation problem with diminishing returns, constructs a concave utility from posterior expected Bernoulli variance, derives a Fenchel-dual reformulation, and applies projected online gradient descent. It then proves an O(sqrt(K)) regret bound against an offline benchmark specifically under fixed prompt utilities. This bound is a standard result in online convex optimization and does not reduce to or depend on the Beta-posterior estimation procedure or the fitted utilities used in the algorithm. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatzes smuggled via citation appear in the derivation. The central theoretical claim remains independent of the empirical utility estimates.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on modeling rollout value via Beta posteriors and assuming diminishing returns that justify a concave utility; these modeling choices are domain assumptions whose validity is not independently verified in the abstract.

free parameters (1)

Beta prior hyperparameters
Initial Beta parameters for each prompt's posterior must be chosen before any data arrives.

axioms (1)

domain assumption Prompt-level success probability follows a Bernoulli process whose variance estimates marginal rollout value
Invoked to turn uncertainty into a utility score that saturates.

pith-pipeline@v0.9.1-grok · 5738 in / 1161 out tokens · 36221 ms · 2026-06-28T02:30:09.213036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 17 canonical work pages · 7 internal anchors

[1]

Buchbinder and J

N. Buchbinder and J. Naor. Improved bounds for online routing and packing via a primal-dual approach. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 293–304. IEEE,

2006
[2]

X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Pich´ e, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,

work page arXiv
[3]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Z. Hu, J. Qiu, T. Bai, H. Yang, B. Yuan, Q. Jing, C. He, and W. Zhang. Vade: Variance-aware dynamic sam- pling via online sample-level difficulty estimation for multimodal rl.arXiv preprint arXiv:2511.18902,

work page arXiv
[5]

Jiang and J

J. Jiang and J. Zhang. Online resource allocation with stochastic resource consumption.arXiv preprint arXiv:2012.07933,

work page arXiv 2012
[6]

Z. Li, C. Chen, T. Yang, T. Ding, R. Sun, G. Zhang, W. Huang, and Z.-Q. Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

work page arXiv
[7]

Z. Lin, M. Lin, Y. Xie, and R. Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv
[8]

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

I. Mahrooghi, A. Lotfi, and E. Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning.arXiv preprint arXiv:2602.14868,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Shrivastava, A

V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv
[11]

Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Yao, Y.-K

Z. Yao, Y.-K. Zhang, Y. Chen, Y. Sun, Z. Xu, Y. Yang, T. Hu, Q. Gu, H. Su, and X. Cai. Coba-rl: Capability- oriented budget allocation for reinforcement learning in llms.arXiv preprint arXiv:2602.03048,

work page arXiv
[15]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Zhang, D

R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016,

work page arXiv
[17]

Zhang, Z

Z. Zhang, Z. Han, C. Mavromatis, Q. Zhu, Y. Zhang, S. Guan, D. Wang, X. Zhou, S. Wang, S. Adeshina, et al. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338,

work page arXiv
[18]

Zong and J

Y. Zong and J. Jiang. Online semi-infinite linear programming: Efficient algorithms via function approxi- mation.arXiv preprint arXiv:2603.16200,

work page arXiv

[1] [1]

Buchbinder and J

N. Buchbinder and J. Naor. Improved bounds for online routing and packing via a primal-dual approach. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 293–304. IEEE,

2006

[2] [2]

X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Pich´ e, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,

work page arXiv

[3] [3]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Z. Hu, J. Qiu, T. Bai, H. Yang, B. Yuan, Q. Jing, C. He, and W. Zhang. Vade: Variance-aware dynamic sam- pling via online sample-level difficulty estimation for multimodal rl.arXiv preprint arXiv:2511.18902,

work page arXiv

[5] [5]

Jiang and J

J. Jiang and J. Zhang. Online resource allocation with stochastic resource consumption.arXiv preprint arXiv:2012.07933,

work page arXiv 2012

[6] [6]

Z. Li, C. Chen, T. Yang, T. Ding, R. Sun, G. Zhang, W. Huang, and Z.-Q. Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849,

work page arXiv

[7] [7]

Z. Lin, M. Lin, Y. Xie, and R. Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv

[8] [8]

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

I. Mahrooghi, A. Lotfi, and E. Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning.arXiv preprint arXiv:2602.14868,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Shrivastava, A

V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv

[11] [11]

Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Yao, Y.-K

Z. Yao, Y.-K. Zhang, Y. Chen, Y. Sun, Z. Xu, Y. Yang, T. Hu, Q. Gu, H. Su, and X. Cai. Coba-rl: Capability- oriented budget allocation for reinforcement learning in llms.arXiv preprint arXiv:2602.03048,

work page arXiv

[15] [15]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Zhang, D

R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016,

work page arXiv

[17] [17]

Zhang, Z

Z. Zhang, Z. Han, C. Mavromatis, Q. Zhu, Y. Zhang, S. Guan, D. Wang, X. Zhou, S. Wang, S. Adeshina, et al. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338,

work page arXiv

[18] [18]

Zong and J

Y. Zong and J. Jiang. Online semi-infinite linear programming: Efficient algorithms via function approxi- mation.arXiv preprint arXiv:2603.16200,

work page arXiv