Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao , Hong Wang , Haoyang Liu , Jian Luo , Jiarui Yu , Hande Dong , Qiang Lin , Can Wang

show 1 more author

Jiawei Chen

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords entropyfactorsrlvrchangecollapseeffectivenessheuristicinterventions

0 comments

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, its training is often plagued by \emph{entropy collapse}, a rapid decline in policy entropy that limits exploration and undermines training effectiveness. While recent works attempt to mitigate this issue via several heuristic entropy interventions, the underlying mechanisms remain poorly understood. In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations. Extensive experiments across six mathematical reasoning and three coding benchmarks demonstrate that STEER effectively mitigates entropy collapse and consistently outperforms state-of-the-art baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.