Recognition: 3 theorem links
· Lean TheoremThe Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Pith reviewed 2026-05-12 12:25 UTC · model grok-4.3
The pith
Reinforcement learning performance for reasoning language models is traded directly from policy entropy and saturates when entropy reaches zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that downstream performance in these RL setups is traded from policy entropy according to the equation R = -a * e^H + b, with the covariance between action probabilities and logit changes serving as the driver of entropy reduction. Empirical matches confirm the covariance equals entropy differences, and its consistent positivity explains the observed collapse. Managing entropy through targeted restrictions on high-covariance tokens enables better scaling of compute in RL for reasoning.
What carries the argument
The covariance between action probability and the change in logits, which drives the reduction in policy entropy and is proportional to advantage under policy-gradient updates.
If this is right
- Without entropy-preserving interventions, RL training for reasoning will reach a hard performance ceiling once entropy is exhausted.
- Restricting updates on high-covariance tokens via clipping or KL penalty maintains exploration and raises final reward.
- The monotonic entropy drop is caused by the persistently positive covariance term.
- The entropy-performance relation makes the training ceiling predictable before entropy collapses.
Where Pith is reading between the lines
- The same covariance-driven mechanism may underlie the need for entropy bonuses in other RL settings such as RLHF.
- Covariance-based token selection could be tested in non-language RL domains where entropy collapse also occurs.
- The R-H relation could be checked on larger models or additional reasoning benchmarks to assess its generality.
- These controls might be combined with existing clipping or regularization techniques for additive gains.
Load-bearing premise
The covariance term between token probability and logit change remains mostly positive throughout training.
What would settle it
A training run in which performance continues to rise substantially after entropy has fallen near zero, or in which the observed R versus H curve deviates from the fitted exponential form.
read the original abstract
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that entropy collapse during RL training of reasoning LLMs limits performance, and establishes an empirical transformation equation R=-a*e^H+b between policy entropy H and downstream reward R, indicating that performance is traded from entropy and bottlenecked by its exhaustion with a predictable ceiling at H=0. It derives theoretically that entropy dynamics are driven by the covariance between action probability and logit change (proportional to advantage under policy-gradient updates), shows empirical match between this covariance and observed entropy differences, notes that the covariance remains mostly positive (explaining monotonic entropy decrease), and proposes Clip-Cov and KL-Cov interventions that restrict updates on high-covariance tokens to encourage exploration and improve final performance.
Significance. If the fitted R-H relation proves robust and the interventions are shown to respect or exceed the predicted curve, the work could meaningfully advance practical entropy management in RL for LLMs, addressing a recurring scaling obstacle. The covariance-based mechanistic account of entropy change is a clear strength that could guide future algorithm design, provided its generality is established beyond the reported runs.
major comments (3)
- The transformation equation R=-a*e^H+b is presented as an 'empirical law' that 'strongly indicates' performance is traded from entropy and 'bottlenecked by its exhaustion.' This relation is obtained by fitting observed (H,R) pairs along standard RL trajectories; no derivation connects the covariance-driven entropy dynamics to the specific exponential form. Moreover, while Clip-Cov and KL-Cov are shown to raise final performance while preserving entropy, the manuscript does not verify whether the new (H,R) points remain on the original fitted curve or exceed the extrapolated ceiling, leaving the causal bottleneck interpretation unsupported by interventional evidence.
- In the theoretical analysis, the covariance term is stated to be proportional to advantage when using Policy Gradient-like algorithms. This creates a risk of circularity because advantage is the quantity that directly drives the policy update; the manuscript should explicitly state the assumptions under which the proportionality holds and clarify whether the observed positivity of the covariance is a derived necessity or an empirical regularity of the specific tasks and runs.
- The empirical study claims an exact match between the covariance term and entropy differences, supporting the theoretical conclusion. However, the claim that the covariance 'stays mostly positive throughout training' (thereby explaining monotonic entropy decrease) is presented as an observation rather than a general result; its dependence on task, model scale, or algorithm variant is not systematically tested, weakening the generality of the entropy-dynamics explanation.
minor comments (2)
- The abstract refers to 'vast RL runs' without providing the number of runs, task diversity, or model scales in the main text or appendix, making it difficult to assess the breadth of the empirical support.
- The parameters a and b in the transformation equation receive no theoretical interpretation beyond being fit coefficients; a brief discussion of their expected range or dependence on task difficulty would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the scope and limitations of our empirical and theoretical claims. We address each major point below, acknowledging where revisions are needed to strengthen the presentation and evidence.
read point-by-point responses
-
Referee: The transformation equation R=-a*e^H+b is presented as an 'empirical law' that 'strongly indicates' performance is traded from entropy and 'bottlenecked by its exhaustion.' This relation is obtained by fitting observed (H,R) pairs along standard RL trajectories; no derivation connects the covariance-driven entropy dynamics to the specific exponential form. Moreover, while Clip-Cov and KL-Cov are shown to raise final performance while preserving entropy, the manuscript does not verify whether the new (H,R) points remain on the original fitted curve or exceed the extrapolated ceiling, leaving the causal bottleneck interpretation unsupported by interventional evidence.
Authors: We agree that the R-H relation is strictly empirical, obtained by fitting trajectories from standard RL runs, and that no derivation from the covariance mechanism to the exponential form is provided. The 'empirical law' phrasing is intended to highlight the observed predictive pattern and its practical implication for a performance ceiling at H=0, rather than a theoretically derived necessity. For the interventions, we acknowledge the manuscript does not include an explicit check of whether the improved (H,R) points from Clip-Cov and KL-Cov lie on or above the original fitted curve. We will revise by adding this analysis, including plots of the new points against the fitted relation and discussion of whether they respect or exceed the predicted ceiling. This will supply the requested interventional evidence. revision: partial
-
Referee: In the theoretical analysis, the covariance term is stated to be proportional to advantage when using Policy Gradient-like algorithms. This creates a risk of circularity because advantage is the quantity that directly drives the policy update; the manuscript should explicitly state the assumptions under which the proportionality holds and clarify whether the observed positivity of the covariance is a derived necessity or an empirical regularity of the specific tasks and runs.
Authors: We will add explicit clarification in the revised theoretical section. The proportionality arises directly from the policy-gradient update: for a softmax policy, the change in logit for an action is proportional to the advantage estimator times the action probability (under standard REINFORCE or PPO-style estimators with baseline). This holds under the assumptions of the policy-gradient theorem, unbiased advantage estimation, and the specific form of the gradient. The positivity of the covariance is not a mathematical necessity derived from these assumptions alone (negative advantages could in principle produce negative covariance), but rather an empirical regularity observed in our training runs on the reported tasks. We will state this distinction clearly and note that the sign may depend on task structure and the distribution of advantages. revision: yes
-
Referee: The empirical study claims an exact match between the covariance term and entropy differences, supporting the theoretical conclusion. However, the claim that the covariance 'stays mostly positive throughout training' (thereby explaining monotonic entropy decrease) is presented as an observation rather than a general result; its dependence on task, model scale, or algorithm variant is not systematically tested, weakening the generality of the entropy-dynamics explanation.
Authors: The reported exact numerical match between covariance and entropy change is specific to the experimental setups and figures shown. We agree that the consistent positivity is presented as an observation from those runs rather than a proven general result. We will revise the discussion to explicitly frame the positivity as an empirical finding tied to the tasks and models tested, and to acknowledge that systematic variation across model scales, tasks, or algorithm variants (e.g., different advantage estimators) has not been performed. If space permits, we can include a small number of additional runs with a different model size to illustrate consistency, but a full sweep is beyond the current scope. revision: partial
- A theoretical derivation connecting the covariance-driven entropy dynamics to the specific exponential form of the R-H relation.
- Comprehensive empirical validation of covariance positivity and the R-H relation across arbitrary tasks, model scales, and algorithm variants.
Circularity Check
Fitted R=-a*e^H+b relation presented as empirical law without derivation from covariance dynamics
specific steps
-
fitted input called prediction
[Abstract]
"In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b."
Parameters a and b are fitted to (H,R) data collected during standard RL training. The claimed trade-off, exhaustion bottleneck, and extrapolated ceiling at H=0 are therefore direct outputs of the fitting process and functional form rather than a derived result from the covariance-based entropy dynamics.
full rationale
The paper derives entropy change as driven by covariance between action probabilities and logit updates (proportional to advantage under PG), with exact empirical match to entropy differences. This part is self-contained. However, the load-bearing claim that performance is 'traded from' entropy with a predictable ceiling at H=0 rests on fitting R=-a*e^H+b to observed (H,R) pairs along standard trajectories and labeling the result an 'empirical law'. No derivation connects the covariance mechanism to this exponential form, so the bottleneck interpretation and ceiling are consequences of the chosen fit rather than independent predictions.
Axiom & Free-Parameter Ledger
free parameters (2)
- a
- b
axioms (1)
- domain assumption Policy-gradient update direction is proportional to advantage
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one; existence_economically_inevitable echoeswe establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion
-
CostJcost_pos_of_ne_one; Jcost_symm echoesthe change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage
-
DiscretenessForcingcontinuous_no_isolated_zero_defect echoesthe covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically
Forward citations
Cited by 36 Pith papers
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
VaP-CSMV uses a cross-semantic encoder and multi-view decoder to unify DRL solving of HFVRP variants, outperforming prior neural solvers while matching heuristics at much lower inference time and generalizing zero-sho...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
AEM lifts entropy analysis to the response level and uses a derived uncertainty proxy to rescale advantages, enabling better exploration-exploitation balance and consistent gains over RL baselines on agent benchmarks.
-
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.
-
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
-
TIP: Token Importance in On-Policy Distillation
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
-
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
-
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
Covariance-based entropy control selectively regularizes high-covariance tokens in softmax policies and achieves asymptotic unbiasedness upon annealing, unlike traditional regularization which introduces dense bias an...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.