CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

· 2025 · cs.LG · arXiv 2509.20712

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

representative citing papers

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

cs.LG · 2026-02-10 · unverdicted · novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

cs.LG · 2025-12-05 · unverdicted · novelty 6.0

Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-11-08 · unverdicted · novelty 6.0

Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

cs.LG · 2026-06-17 · unverdicted · novelty 5.0

STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

cs.AI · 2026-04-20 · conditional · novelty 4.0

Novice programmers completed more tasks with lower workload using GitHub Copilot versus a human partner, but reported significantly more positive and arousing emotions with the human teammate.

citing papers explorer

Showing 7 of 7 citing papers.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 23 · 2 links · internal anchor
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective cs.LG · 2026-02-10 · unverdicted · none · ref 17 · internal anchor
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning cs.LG · 2025-12-05 · unverdicted · none · ref 18 · internal anchor
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models cs.CL · 2025-11-08 · unverdicted · none · ref 3 · internal anchor
Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability cs.LG · 2026-06-17 · unverdicted · none · ref 31 · internal anchor
STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning cs.AI · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · conditional · none · ref 35 · internal anchor
Novice programmers completed more tasks with lower workload using GitHub Copilot versus a human partner, but reported significantly more positive and arousing emotions with the human teammate.

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

fields

years

verdicts

representative citing papers

citing papers explorer