pith. sign in

arxiv: 2601.10201 · v2 · pith:MPY37EC3new · submitted 2026-01-15 · 💻 cs.LG · cs.AI· cs.CL

Future-KL Regularized GRPO: Process-Level Credit Assignment from f-Divergence Regularization

classification 💻 cs.LG cs.AIcs.CL
keywords grporegularizationpolicyadvantagedivergencefrpofuture-klgroup
0
0 comments X
read the original abstract

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  2. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  3. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.

  4. LLM Reasoning with Process Rewards for Outcome-Guided Steps

    cs.LG 2026-02 unverdicted novelty 5.0

    PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.