Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

Hao Bai; Jiarui Yao; Ruida Wang; Tong Zhang

arxiv: 2601.10201 · v2 · pith:MPY37EC3new · submitted 2026-01-15 · 💻 cs.LG · cs.AI· cs.CL

Future-KL Regularized GRPO: Process-Level Credit Assignment from f-Divergence Regularization

Jiarui Yao , Ruida Wang , Hao Bai , Tong Zhang This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords grporegularizationpolicyadvantagedivergencefrpofuture-klgroup

0 comments

read the original abstract

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
LLM Reasoning with Process Rewards for Outcome-Guided Steps
cs.LG 2026-02 unverdicted novelty 5.0

PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.