XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Fan Lai; Haizhong Zheng; Minghao Fang; Udbhav Bamba; Yifan Yu

arxiv: 2510.06672 · v3 · pith:R3RNULAHnew · submitted 2025-10-08 · 💻 cs.LG

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Udbhav Bamba , Minghao Fang , Yifan Yu , Haizhong Zheng , Fan Lai This is my paper

classification 💻 cs.LG

keywords xrpogrpopromptsexplorationreasoningrolloutacrossadvances

0 comments

read the original abstract

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning
stat.ML 2026-05 unverdicted novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRP...
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
cs.DB 2026-04 unverdicted novelty 7.0

LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
cs.LG 2026-03 unverdicted novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 6.0

PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 5.0

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution o...