F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Alexey Gorbatovski; Alexey Malakhov; Boris Shaposhnikov; Daniil Gavrilov; Daniil Plyusov; Daria Korotyshova; Viacheslav Sinii

arxiv: 2602.06717 · v2 · pith:LRNCKBYCnew · submitted 2026-02-06 · 💻 cs.LG · cs.AI

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Daniil Plyusov , Alexey Gorbatovski , Boris Shaposhnikov , Viacheslav Sinii , Alexey Malakhov , Daria Korotyshova , Daniil Gavrilov This is my paper

classification 💻 cs.LG cs.AI

keywords groupcategoricalgrporightarrowupdatesbehaviorcispocomputational

0 comments

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so training proceeds with finite rollout sets that can reinforce only the correct behavior they expose. At practical group sizes, updates can miss rare-correct trajectories while still containing mixed rewards, concentrating probability on more common sampled solutions. We derive the probability of such prompt-local tail-miss events as a function of group size, showing non-monotonic behavior, and in the categorical abstraction characterize how unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware scaling coefficient, inspired by Focal loss, that down-weights updates on high-success sampled groups. Empirically, categorical simulation illustrates the same effect in the categorical setting, Maze provides a single-solution test, and LLM experiments include a representative GRPO group-size sweep together with fixed-$N$ transfer across GRPO, DAPO, and CISPO. On Qwen2.5-7B at $N{=}8$, our method improves average math pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO); OOD pass@256 also improves in all three cases, without increasing group size or computational cost.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
cs.LG 2026-05 unverdicted novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...