pith. machine review for the scientific record. sign in

arxiv: 2511.00066 · v4 · submitted 2025-10-29 · 💻 cs.LG

Recognition: unknown

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Authors on Pith no claims yet
classification 💻 cs.LG
keywords grpogeneralizationoptimizationrlvrgroupgrpo-sglargeloss
0
0 comments X
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    cs.CL 2026-02 unverdicted novelty 6.0

    STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.