pith. machine review for the scientific record.
sign in

arxiv: 2511.05993 · v3 · submitted 2025-11-08 · 💻 cs.CL · cs.AI· cs.LG

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Pith reviewed 2026-05-17 23:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords entropy collapsereinforcement learninglarge language modelsreasoning modelspositive advantageloss reweightingRLVRresponse diversity
0
0 comments X

The pith

Tokens with positive advantages drive entropy collapse during RLVR training of large reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies why the entropy of large language models tends to collapse during reinforcement learning with verifiable rewards, which limits response diversity and traps models in suboptimal solutions. Experiments across multiple benchmarks reveal that clipping thresholds, the number of off-policy updates, and training data diversity all shape entropy behavior. The central result is that tokens carrying positive advantages account for most of the collapse. To counter this, the authors introduce Positive-Advantage Reweighting, which scales down the loss weight on those tokens to keep entropy higher while preserving competitive reasoning performance.

Core claim

Tokens with positive advantages are the primary drivers of entropy collapse. By reweighting the loss contributions of these tokens during RLVR training, model entropy can be regulated to avoid premature convergence to local minima, with the approach maintaining overall optimization stability and final performance.

What carries the argument

Positive-Advantage Reweighting, a loss adjustment that scales the contribution of tokens with positive advantages to control entropy decay.

If this is right

  • Adjusting clipping thresholds can slow entropy collapse.
  • Limiting off-policy updates helps retain higher entropy.
  • More diverse training data sustains response variety.
  • Reweighting positive-advantage tokens prevents collapse while keeping final performance competitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reweighting idea might extend to other RL objectives that do not rely on verifiable rewards.
  • Sustained entropy could improve model calibration on tasks requiring uncertainty estimates.
  • Combining this method with data-curation strategies might further increase reasoning diversity.

Load-bearing premise

Reweighting loss contributions for positive-advantage tokens preserves overall optimization stability and final performance without introducing new biases or training instabilities.

What would settle it

An experiment showing that Positive-Advantage Reweighting still produces entropy collapse or lowers benchmark scores compared with standard RLVR would disprove the central claim.

Figures

Figures reproduced from arXiv: 2511.05993 by Deyi Xiong, Jian Luan, Pengzhi Gao, Renren Jin, Tongxuan Zhang, Wei Liu, Wuwei Huang, Yuqi Ren, Zhuowen Han.

Figure 1
Figure 1. Figure 1: Evolution of LLM entropy and Avg@64 per [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of log probabilities for correct [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of LLM entropy during RLVR train [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of entropy and training rewards [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows the entropy evolution across var￾ious methods. As shown, Clip-Cov, KL-Cov, 0 50 100 150 200 250 300 350 400 Training Steps 0.0 0.2 0.4 0.6 0.8 1.0 Entropy GRPO Ada-Ent-Reg ( = 0.2) Ada-Ent-Reg ( = 0.3657) Clip-Cov KL-Cov Entropy-Adv GRPO (Adv 0) GRPO (Adv 0) Rand-Pos-Clip Pos-Adv-Reweight (Stage-based) Pos-Adv-Reweight (Epoch-wise) Pos-Adv-Reweight (Entropy-guided) [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of entropy (solid lines) and N-gram [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ratio of the prompt entropy of Qwen2.5-Math [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scatter plot of prompt entropy versus accuracy [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Spearman’s rank correlation coefficients be [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of entropy and training rewards [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of the negative entropy change and the covariance terms during GRPO training of Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Evolution of the negative entropy change and the covariance terms during GRPO training of Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Evolution of the negative entropy change and the covariance terms during GRPO training of Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Evolution of the negative entropy change and the covariance terms during GRPO training of Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Evolution of the negative entropy change and the covariance terms during GRPO training augmented [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Evolution of entropy (solid lines) and N [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: Ratio of the entropy of Llama-3.1-8B￾Instruct at different training steps to its initial entropy prior to training, under varying numbers of off-policy updates. 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Entropy 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Correlation: 0.2078 AIME 2024 AIME 2025 MATH500 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Scatter plot of accuracy versus entropy for [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 23
Figure 23. Figure 23: Distribution of log probabilities for correct [PITH_FULL_IMAGE:figures/full_fig_p021_23.png] view at source ↗
Figure 25
Figure 25. Figure 25: Evolution of entropy for Llama-3.1-8B￾Instruct trained with GRPO and its variants [PITH_FULL_IMAGE:figures/full_fig_p022_25.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training data. Furthermore, through both theoretical analysis and empirical validation, we demonstrate that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this insight, we propose Positive-Advantage Reweighting, a simple yet effective approach that regulates model entropy by adjusting the loss weights assigned to tokens with positive advantages during RLVR training, while maintaining competitive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines entropy collapse during reinforcement learning with verifiable rewards (RLVR) for large language models. It reports experiments correlating entropy with response diversity, calibration, and benchmark performance, identifies three influencing factors (clipping thresholds, off-policy update count, and training data diversity), and claims via theoretical decomposition and empirical validation that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this, the authors propose Positive-Advantage Reweighting to adjust loss weights on positive-advantage tokens and thereby regulate entropy while preserving competitive performance.

Significance. If the central claim holds after proper isolation of effects, the work supplies a useful mechanistic account of entropy dynamics in RLVR and a lightweight practical intervention. The combination of theoretical analysis with extensive experiments across benchmarks is a strength; the reweighting method is simple enough to be adopted if shown to avoid new instabilities.

major comments (2)
  1. [Empirical validation] Empirical sections: aggregate entropy curves are shown while clipping thresholds and off-policy steps are varied together. The claim that positive-advantage tokens dominate requires a ceteris-paribus ablation that holds clipping and update count fixed while zeroing or sign-flipping advantages on the positive subset; without it the primary-driver designation remains vulnerable to the three confounding factors the paper itself lists.
  2. [Theoretical analysis] Theoretical analysis: the decomposition of the policy-gradient entropy term by sign(advantage) is presented, yet no quantitative comparison (e.g., relative magnitude bounds or per-token contribution ratios) demonstrates that the positive-advantage component exceeds the negative-advantage and clipping/off-policy contributions. This step is load-bearing for the 'primary drivers' assertion.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'extensive experiments' would be clearer if the specific benchmarks, number of runs, and statistical reporting (e.g., standard errors) were mentioned.
  2. [Method] Notation: the definition of advantage and the exact reweighting formula should be stated once in a single equation block rather than scattered across text and pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications on our empirical and theoretical support for the role of positive-advantage tokens while committing to targeted revisions that strengthen the isolation of effects and quantitative evidence.

read point-by-point responses
  1. Referee: [Empirical validation] Empirical sections: aggregate entropy curves are shown while clipping thresholds and off-policy steps are varied together. The claim that positive-advantage tokens dominate requires a ceteris-paribus ablation that holds clipping and update count fixed while zeroing or sign-flipping advantages on the positive subset; without it the primary-driver designation remains vulnerable to the three confounding factors the paper itself lists.

    Authors: We agree that simultaneous variation of factors in some aggregate plots leaves room for stronger isolation. Within individual training runs we already condition entropy statistics on advantage sign while holding other hyperparameters constant for that run, which provides partial control. To directly respond to the request, we will add a dedicated ablation section in which clipping thresholds and off-policy update counts are fixed at reference values while we zero or sign-flip the advantages of the positive-advantage token subset only. The resulting entropy trajectories and performance metrics will be reported to confirm the dominant contribution of positive-advantage tokens. These new results will appear in the revised manuscript. revision: yes

  2. Referee: [Theoretical analysis] Theoretical analysis: the decomposition of the policy-gradient entropy term by sign(advantage) is presented, yet no quantitative comparison (e.g., relative magnitude bounds or per-token contribution ratios) demonstrates that the positive-advantage component exceeds the negative-advantage and clipping/off-policy contributions. This step is load-bearing for the 'primary drivers' assertion.

    Authors: The decomposition isolates the entropy term according to sign(advantage) and shows that positive-advantage tokens produce updates that systematically reduce entropy. We acknowledge that explicit numerical comparisons were not included. In the revision we will compute and tabulate the relative magnitudes of the positive- versus negative-advantage components at multiple training checkpoints, together with per-token contribution ratios and direct comparisons against the magnitude of clipping and off-policy effects. These quantitative results will be added to the theoretical section to substantiate the primary-driver claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical decomposition and empirical factors are independently validated

full rationale

The paper's central claim—that positive-advantage tokens drive entropy collapse—is supported by a combination of theoretical analysis of the policy-gradient entropy term and separate empirical experiments that vary clipping thresholds, off-policy update counts, and data diversity. No equation or derivation reduces the result to a fitted parameter or self-referential definition by construction. The three confounding factors are explicitly identified and controlled in the experimental design rather than assumed away, and the Positive-Advantage Reweighting method is presented as a motivated intervention rather than a tautological restatement of the observations. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, background axioms, or newly postulated entities; the reweighting is presented as a loss adjustment rather than an invented construct.

pith-pipeline@v0.9.0 · 5519 in / 1033 out tokens · 35838 ms · 2026-05-17T23:28:58.499397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

  2. Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    cs.LG 2026-02 unverdicted novelty 6.0

    Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Reasoning with exploration: An entropy per- spective.CoRR, abs/2506.14758. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning language...

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.CoRR, abs/2501.12948. Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2025. Decomposing the entropy- performance exchange: The missing keys to un- locking effective reinforcement learning.CoRR, abs/2508.02260. Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui ...

  3. [3]

    Ce-gppo: Coordinating entropy via gradient- preserving clipping policy optimization in reinforce- ment learning.Preprint, arXiv:2509.20712. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sun...

  4. [4]

    Judgebench: A benchmark for evaluating LLM-as-a-judge.arXiv preprint arXiv:2407.11969, 2024a

    Reasoning or memorization? unreliable results of reinforcement learning due to data contamination. CoRR, abs/2507.10532. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. 2024. Qwen2.5-math tech- ni...

  5. [5]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    DAPO: an open-source LLM reinforcement learning system at scale.CoRR, abs/2503.14476. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, and 6 oth- ers. 2025. The landscap...

  6. [6]

    with the AdamW optimizer, employing a 1We initially intended to train Llama-3.1-8B with GRPO, as it is a pretrained model that has not undergone instruction tuning, similar to Qwen2.5-Math-7B. However, during prelim- inary experiments, we observed that Llama-3.1-8B frequently generated endlessly repetitive responses during training, which substantially sl...