pith. machine review for the scientific record. sign in

arxiv: 2504.13818 · v5 · submitted 2025-04-18 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Fei Fang, J. Zico Kolter, Yash Savani, Yixuan Even Xu

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.CL
keywords policydown-samplinglearningrolloutsgenerationgrpooptimizationpods
0
0 comments X
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  2. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  3. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  4. TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

    eess.SP 2026-04 unverdicted novelty 5.0

    TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

  5. PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

    cs.LG 2026-04 unverdicted novelty 5.0

    PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

  6. Your Model Diversity, Not Method, Determines Reasoning Strategy

    cs.AI 2026-04 unverdicted novelty 5.0

    The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.