DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high-confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically-grounded dual-pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance-minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient-aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high-confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: https://github.com/GithubX-F/DynaMO-RL.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
citing papers explorer
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.
-
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.