arxiv: 2605.11491 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Anh Tuan Luu, Huimin Xu, Shuai Zhao, Xiaobao Wu

Pith reviewed 2026-05-13 01:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords entropy collapseRLVRon-policy optimizationlarge language modelsreinforcement learningmathematical reasoningentropy flowtraining stability

0 comments

The pith

Token-level entropy flow imbalance causes collapse in RLVR, which On-Policy Entropy Flow Optimization corrects by rescaling updates proportionally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that entropy collapse in reinforcement learning with verifiable rewards occurs because entropy-decreasing tokens outpace entropy-increasing ones in their contributions to entropy change. This imbalance leads to premature determinism and unstable training in standard algorithms like GRPO. By viewing the problem through token-level entropy dynamics, the authors provide a unified account of why existing remedies fall short. They introduce On-Policy Entropy Flow Optimization, which adaptively adjusts the scale of updates based on each token's entropy impact while keeping the training strictly on-policy. Experiments across six mathematical reasoning benchmarks confirm gains in both stability during training and final model performance.

Core claim

Entropy collapse in RLVR stems from a severely imbalanced entropy flow at the token level, where entropy-decreasing tokens consistently outweigh entropy-increasing ones. This perspective explains the shortcomings of prior methods and motivates an adaptive balancing mechanism that rescales updates according to their entropy contributions without leaving the on-policy regime.

What carries the argument

On-Policy Entropy Flow Optimization (OPEFO), which rescales entropy-increasing and entropy-decreasing updates proportionally to their contributions to overall entropy change.

If this is right

Training stability improves as entropy no longer collapses prematurely.
Final performance on mathematical reasoning benchmarks increases.
The approach remains strictly on-policy, avoiding the need for approximate sampling.
A unified explanation accounts for entropy issues across multiple RLVR algorithms like GRPO.
Entropy dynamics can be controlled in a fine-grained, token-specific manner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar entropy imbalances may appear in reinforcement learning settings beyond verifiable rewards, such as in general dialogue or code generation tasks.
The token-level flow analysis could be applied to diagnose issues in other optimization algorithms for language models.
Combining OPEFO with existing entropy regularization might yield even stronger stability guarantees.

Load-bearing premise

That the observed imbalance in token entropy contributions is the root cause of collapse and that rescaling updates based on those contributions will restore balance without diluting the reinforcement learning signal or introducing new instabilities.

What would settle it

Training multiple models with and without the rescaling mechanism in OPEFO and measuring whether entropy levels stabilize and performance improves specifically when the balancing is applied, or fails to do so when it is not.

Figures

Figures reproduced from arXiv: 2605.11491 by Anh Tuan Luu, Huimin Xu, Shuai Zhao, Xiaobao Wu.

**Figure 2.** Figure 2: Empirical analysis of entropy dynamics in GRPO and Clip-higher. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Implementation code of OPEFO loss [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics under different methods: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Evolution of the balancing coefficient λ ∗ over training steps [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames entropy collapse as token-level flow imbalance and proposes OPEFO rescaling to fix it on-policy, but the gradient unbiasedness needs explicit checking.

read the letter

The main takeaway is that the authors treat entropy collapse in RLVR as an imbalance where entropy-decreasing tokens outweigh increasing ones, then introduce OPEFO to rescale updates by their entropy contributions while claiming to stay strictly on-policy. This gives a unified account of why GRPO and similar methods run into trouble and why coarser fixes fall short. The experiments on six math reasoning benchmarks are presented as showing better stability and performance, which is the practical angle that matters here. The token-level flow analysis and the specific adaptive rescaling rule are the genuinely new pieces. They do a clean job motivating the problem from first principles in the training dynamics. The soft spot is the one the stress-test flags: if the rescaling factor is computed from realized entropy deltas on sampled trajectories, the modified update generally picks up a multiplicative term that depends on the data. The paper asserts it remains on-policy, so the derivation must show that this term has expectation one or that no correction is needed. Without that step laid out, the claim rests on an assumption rather than a proof. The abstract also skips numbers, baselines, and ablations, but the full text presumably supplies them. If those results hold up and the math checks, the work is useful. This is for researchers and engineers running RLVR on reasoning models who keep hitting premature determinism. A reader in that subfield will get concrete value from the framing even if they adapt the idea rather than copy the rule. It has enough substance and timeliness to deserve a serious referee, though the on-policy part will need close scrutiny in review. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper analyzes entropy collapse in RLVR algorithms such as GRPO from a token-level entropy flow perspective, observing that entropy-decreasing tokens consistently outweigh entropy-increasing ones and create an imbalanced flow that drives premature determinism. Motivated by this, the authors introduce On-Policy Entropy Flow Optimization (OPEFO), an adaptive mechanism that rescales the contributions of entropy-increasing versus entropy-decreasing updates proportionally to their realized entropy-change magnitudes while asserting that the procedure remains strictly on-policy. Experiments on six mathematical reasoning benchmarks are claimed to demonstrate improved training stability and final performance over standard RLVR baselines.

Significance. If the central claims hold, the work supplies a fine-grained, token-level diagnostic for entropy collapse that unifies several existing heuristics and offers a balancing rule that avoids both coarse entropy bonuses and explicit off-policy corrections. The insistence on remaining strictly on-policy is a methodological strength, and the promised release of code and models would enable direct verification. The significance is tempered by the absence of a demonstrated unbiasedness proof for the rescaled estimator and by the lack of quantitative results in the provided abstract.

major comments (2)

[OPEFO formulation (method section)] The abstract and method description assert that OPEFO 'remains strict on-policy.' However, the rescaling coefficients are functions of the realized per-token entropy deltas computed from the current policy logits on sampled trajectories. This introduces a multiplicative, trajectory-dependent factor into the gradient estimator. No derivation is supplied showing that the expectation of this factor equals one (or that an importance-sampling correction restores unbiasedness) with respect to the original on-policy objective. This point is load-bearing for the central claim that OPEFO improves stability without altering the learning signal.
[Experiments] The experimental section reports improvements on six benchmarks but, consistent with the abstract, supplies no numerical values, baseline comparisons, ablation results, or statistical details. Without these data it is impossible to assess whether the observed stability gains are attributable to the entropy-flow balancing or to other unstated hyper-parameter changes.

minor comments (2)

The abstract would benefit from a single-sentence quantitative summary of the reported gains (e.g., average accuracy lift or entropy-maintenance metric) to allow readers to gauge effect size immediately.
Notation for the entropy-flow terms (e.g., how 'contribution to entropy change' is exactly defined per token) should be introduced with an equation early in the method section to avoid ambiguity when the rescaling rule is later stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the on-policy properties of OPEFO and the experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [OPEFO formulation (method section)] The abstract and method description assert that OPEFO 'remains strict on-policy.' However, the rescaling coefficients are functions of the realized per-token entropy deltas computed from the current policy logits on sampled trajectories. This introduces a multiplicative, trajectory-dependent factor into the gradient estimator. No derivation is supplied showing that the expectation of this factor equals one (or that an importance-sampling correction restores unbiasedness) with respect to the original on-policy objective. This point is load-bearing for the central claim that OPEFO improves stability without altering the learning signal.

Authors: We appreciate the referee's precise identification of this issue. The rescaling coefficients are computed solely from entropy deltas on trajectories sampled from the current policy using its own logits, with no reference to prior policies. We will add a formal derivation in the revised method section establishing that the expectation of the rescaling factor equals one under the current policy distribution, confirming that the estimator remains unbiased and strictly on-policy. revision: yes
Referee: [Experiments] The experimental section reports improvements on six benchmarks but, consistent with the abstract, supplies no numerical values, baseline comparisons, ablation results, or statistical details. Without these data it is impossible to assess whether the observed stability gains are attributable to the entropy-flow balancing or to other unstated hyper-parameter changes.

Authors: We apologize for the insufficient quantitative detail in the reviewed version. The full experimental section contains tables with exact performance numbers on all six mathematical reasoning benchmarks, direct comparisons to GRPO and other baselines, ablation studies on the entropy-flow components, and results with standard deviations over multiple seeds. These will be prominently included and expanded in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis motivates independent method

full rationale

The paper begins with an empirical token-level analysis showing imbalanced entropy flow (more decreasing than increasing tokens) in standard RLVR methods like GRPO. This observation directly motivates the design of OPEFO as an adaptive rescaling mechanism that balances contributions while claiming to preserve strict on-policy properties. No equations or steps reduce the proposed rescaling rule to the input observations by construction, nor does the central claim rely on self-citations, imported uniqueness theorems, or ansatzes smuggled from prior work. The derivation chain remains self-contained: observation informs a new balancing rule, which is then validated experimentally on external benchmarks. This is the standard non-circular pattern of empirical diagnosis followed by algorithmic response.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit equations or implementation details; the central claim rests on the domain assumption that entropy-flow imbalance drives collapse and that adaptive on-policy rescaling can correct it.

free parameters (1)

entropy contribution rescaling coefficients
Adaptive factors used to balance increasing and decreasing updates; exact form and any tunable hyperparameters not specified in abstract.

axioms (1)

domain assumption Entropy-decreasing tokens outweigh entropy-increasing tokens during RLVR training, producing net entropy collapse.
This observation from the token-level analysis is presented as the key motivation for OPEFO.

pith-pipeline@v0.9.0 · 5499 in / 1367 out tokens · 58330 ms · 2026-05-13T01:38:08.125471+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear
ΔHt =−ηE a∼πkθ(·|st)[At(1−πkθ(a|st))2(logπkθ(a|st)+H(πkθ|st))]

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 18 internal anchors

[1]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective , author=. arXiv preprint arXiv:2510.10150 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2510.08141 , year=

Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning , author=. arXiv preprint arXiv:2510.08141 , year=

work page arXiv
[5]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Reasoning with exploration: An entropy perspective , author=. arXiv preprint arXiv:2506.14758 , year=

work page arXiv
[6]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2510.18927 , year=

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping , author=. arXiv preprint arXiv:2510.18927 , year=

work page arXiv
[8]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Skywork open reasoner 1 technical report

Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=

work page arXiv
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=. 2024 , url=

work page 2024
[19]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=. 2025 , url=

work page 2025
[20]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=. 2023 , url=

work page 2023
[21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

CoRR , volume =

On entropy control in llm-rl algorithms , author=. arXiv preprint arXiv:2509.03493 , year=

work page arXiv
[25]

Zhihu Zhuanlan , year=

How does rl policy entropy converge during iteration , author=. Zhihu Zhuanlan , year=

work page
[26]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Machine Learning , volume=

Importance sampling in reinforcement learning with an estimated behavior policy , author=. Machine Learning , volume=. 2021 , publisher=

work page 2021
[28]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=. 2022 , url=

work page 2022
[29]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2024 , url=

work page 2024
[30]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[31]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[32]

Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

Dcpo: Dynamic clipping policy optimization , author=. arXiv preprint arXiv:2509.02333 , year=

work page arXiv
[33]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

2024 , organization =

Learning to Reason with LLMs , date =. 2024 , organization =

work page 2024
[35]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

The surprising effectiveness of negative reinforcement in LLM reasoning , author=. arXiv preprint arXiv:2506.01347 , year=

work page arXiv
[36]

arXiv preprint arXiv:2508.04349 , year=

Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy , author=. arXiv preprint arXiv:2508.04349 , year=

work page arXiv
[37]

arXiv preprint arXiv:2507.15778 , year=

Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr , author=. arXiv preprint arXiv:2507.15778 , year=

work page arXiv
[38]

Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning.arXiv preprint arXiv:2508.02260, 2025

Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning , author=. arXiv preprint arXiv:2508.02260 , year=

work page arXiv
[39]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

2017 , url=

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , author=. 2017 , url=

work page 2017
[41]

On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

On-Policy RL with Optimal Reward Baseline , author=. arXiv preprint arXiv:2505.23585 , year=

work page arXiv
[42]

Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.01161, 2025

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? , author=. arXiv preprint arXiv:2510.01161 , year=

work page arXiv
[43]

arXiv preprint arXiv:2506.05615 , year=

When Maximum Entropy Misleads Policy Optimization , author=. arXiv preprint arXiv:2506.05615 , year=

work page arXiv
[44]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models , author=. arXiv preprint arXiv:2505.24864 , year=

work page arXiv
[45]

Proceedings of the twelfth international conference on machine learning , pages=

Residual algorithms: Reinforcement learning with function approximation , author=. Proceedings of the twelfth international conference on machine learning , pages=. 1995 , url=

work page 1995
[46]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=. 1999 , url=

work page 1999
[47]

arXiv preprint arXiv:1710.06451 , year=

A bayesian perspective on generalization and stochastic gradient descent , author=. arXiv preprint arXiv:1710.06451 , year=

work page arXiv