pith. machine review for the scientific record. sign in

arxiv: 2510.00977 · v3 · submitted 2025-10-01 · 💻 cs.LG · cs.CL

Recognition: unknown

It Takes Two: Your GRPO Is Secretly DPO

Authors on Pith no claims yet
classification 💻 cs.LG cs.CL
keywords grpocontrastivelearningmethodsonlyrolloutsviewaccurate
0
0 comments X
read the original abstract

GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  2. SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

  3. Interactive Critique-Revision Training for Reliable Structured LLM Generation

    cs.LG 2026-05 unverdicted novelty 5.0

    DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.