DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Adaptive rollout allocation for online reinforcement learning with verifiable rewards
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
citing papers explorer
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.