Future-KL Regularized GRPO: Process-Level Credit Assignment from f-Divergence Regularization
read the original abstract
Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
-
LLM Reasoning with Process Rewards for Outcome-Guided Steps
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.