arxiv: 2505.22617 · v1 · submitted 2025-05-28 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Bowen Zhou, Ganqu Cui, Hao Peng, Haozhan Li, Huayu Chen, Jiacheng Chen, Lei Bai, Lifan Yuan, Ning Ding, Wanli Ouyang, Weize Chen, Yuchen Fan, Yu Cheng, Yuchen Zhang, Yuxin Zuo, Zhi Wang, Zhiyuan Liu

Pith reviewed 2026-05-12 12:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningpolicy entropylanguage modelsreasoningentropy collapsepolicy gradientexploration

0 comments

The pith

Reinforcement learning performance for reasoning language models is traded directly from policy entropy and saturates when entropy reaches zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that policy entropy in RL for reasoning LLMs collapses early in training, directly causing performance to plateau because gains come at the expense of entropy. An empirical transformation shows that reward R follows R equals negative a times e to the power of H plus b, so the maximum achievable performance is fully determined once entropy hits zero. This link is produced by the covariance between token probabilities and logit updates, which remains positive and therefore drives monotonic entropy loss. The authors then introduce two methods that restrict updates on high-covariance tokens to keep entropy from collapsing and thereby reach higher final performance.

Core claim

The authors establish that downstream performance in these RL setups is traded from policy entropy according to the equation R = -a * e^H + b, with the covariance between action probabilities and logit changes serving as the driver of entropy reduction. Empirical matches confirm the covariance equals entropy differences, and its consistent positivity explains the observed collapse. Managing entropy through targeted restrictions on high-covariance tokens enables better scaling of compute in RL for reasoning.

What carries the argument

The covariance between action probability and the change in logits, which drives the reduction in policy entropy and is proportional to advantage under policy-gradient updates.

If this is right

Without entropy-preserving interventions, RL training for reasoning will reach a hard performance ceiling once entropy is exhausted.
Restricting updates on high-covariance tokens via clipping or KL penalty maintains exploration and raises final reward.
The monotonic entropy drop is caused by the persistently positive covariance term.
The entropy-performance relation makes the training ceiling predictable before entropy collapses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same covariance-driven mechanism may underlie the need for entropy bonuses in other RL settings such as RLHF.
Covariance-based token selection could be tested in non-language RL domains where entropy collapse also occurs.
The R-H relation could be checked on larger models or additional reasoning benchmarks to assess its generality.
These controls might be combined with existing clipping or regularization techniques for additive gains.

Load-bearing premise

The covariance term between token probability and logit change remains mostly positive throughout training.

What would settle it

A training run in which performance continues to rise substantially after entropy has fallen near zero, or in which the observed R versus H curve deviates from the fitted exponential form.

read the original abstract

This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The covariance derivation for entropy change is the useful piece; the fitted R-H curve is just a descriptive fit without causal checks.

read the letter

The paper's core contribution is showing that entropy collapse in RL for reasoning LLMs tracks a covariance between token probabilities and logit updates, which stays positive and drives the drop. They match this term to observed entropy differences on their runs and then build Clip-Cov and KL-Cov to limit updates on high-covariance tokens. Those interventions keep entropy from crashing and lift final performance on the tasks they test. That part is straightforward and directly actionable for anyone running similar PPO-style loops on LLMs. The empirical match between the covariance term and entropy change is the cleanest result here; it gives a mechanistic handle instead of just adding another entropy bonus. The R = -a*e^H + b relation, however, is obtained by fitting the observed trajectory points. Nothing in the derivation forces the exponential form, and the paper does not test whether the new points from Clip-Cov or KL-Cov still lie on the same curve or push past the extrapolated H=0 ceiling. Without that check the relation stays correlational rather than evidence of a hard bottleneck. The positivity of the covariance is also reported as consistent in their runs but not shown to be inevitable across tasks or algorithms. This work is aimed at people training reasoning models with RL and who already see entropy collapse in their logs. It gives two lightweight patches worth trying and a partial explanation for why the collapse happens. The covariance analysis and the interventions are solid enough to merit referee time even if the performance law needs tighter validation.

Referee Report

3 major / 2 minor

Summary. The paper claims that entropy collapse during RL training of reasoning LLMs limits performance, and establishes an empirical transformation equation R=-a*e^H+b between policy entropy H and downstream reward R, indicating that performance is traded from entropy and bottlenecked by its exhaustion with a predictable ceiling at H=0. It derives theoretically that entropy dynamics are driven by the covariance between action probability and logit change (proportional to advantage under policy-gradient updates), shows empirical match between this covariance and observed entropy differences, notes that the covariance remains mostly positive (explaining monotonic entropy decrease), and proposes Clip-Cov and KL-Cov interventions that restrict updates on high-covariance tokens to encourage exploration and improve final performance.

Significance. If the fitted R-H relation proves robust and the interventions are shown to respect or exceed the predicted curve, the work could meaningfully advance practical entropy management in RL for LLMs, addressing a recurring scaling obstacle. The covariance-based mechanistic account of entropy change is a clear strength that could guide future algorithm design, provided its generality is established beyond the reported runs.

major comments (3)

The transformation equation R=-a*e^H+b is presented as an 'empirical law' that 'strongly indicates' performance is traded from entropy and 'bottlenecked by its exhaustion.' This relation is obtained by fitting observed (H,R) pairs along standard RL trajectories; no derivation connects the covariance-driven entropy dynamics to the specific exponential form. Moreover, while Clip-Cov and KL-Cov are shown to raise final performance while preserving entropy, the manuscript does not verify whether the new (H,R) points remain on the original fitted curve or exceed the extrapolated ceiling, leaving the causal bottleneck interpretation unsupported by interventional evidence.
In the theoretical analysis, the covariance term is stated to be proportional to advantage when using Policy Gradient-like algorithms. This creates a risk of circularity because advantage is the quantity that directly drives the policy update; the manuscript should explicitly state the assumptions under which the proportionality holds and clarify whether the observed positivity of the covariance is a derived necessity or an empirical regularity of the specific tasks and runs.
The empirical study claims an exact match between the covariance term and entropy differences, supporting the theoretical conclusion. However, the claim that the covariance 'stays mostly positive throughout training' (thereby explaining monotonic entropy decrease) is presented as an observation rather than a general result; its dependence on task, model scale, or algorithm variant is not systematically tested, weakening the generality of the entropy-dynamics explanation.

minor comments (2)

The abstract refers to 'vast RL runs' without providing the number of runs, task diversity, or model scales in the main text or appendix, making it difficult to assess the breadth of the empirical support.
The parameters a and b in the transformation equation receive no theoretical interpretation beyond being fit coefficients; a brief discussion of their expected range or dependence on task difficulty would improve clarity.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the scope and limitations of our empirical and theoretical claims. We address each major point below, acknowledging where revisions are needed to strengthen the presentation and evidence.

read point-by-point responses

Referee: The transformation equation R=-a*e^H+b is presented as an 'empirical law' that 'strongly indicates' performance is traded from entropy and 'bottlenecked by its exhaustion.' This relation is obtained by fitting observed (H,R) pairs along standard RL trajectories; no derivation connects the covariance-driven entropy dynamics to the specific exponential form. Moreover, while Clip-Cov and KL-Cov are shown to raise final performance while preserving entropy, the manuscript does not verify whether the new (H,R) points remain on the original fitted curve or exceed the extrapolated ceiling, leaving the causal bottleneck interpretation unsupported by interventional evidence.

Authors: We agree that the R-H relation is strictly empirical, obtained by fitting trajectories from standard RL runs, and that no derivation from the covariance mechanism to the exponential form is provided. The 'empirical law' phrasing is intended to highlight the observed predictive pattern and its practical implication for a performance ceiling at H=0, rather than a theoretically derived necessity. For the interventions, we acknowledge the manuscript does not include an explicit check of whether the improved (H,R) points from Clip-Cov and KL-Cov lie on or above the original fitted curve. We will revise by adding this analysis, including plots of the new points against the fitted relation and discussion of whether they respect or exceed the predicted ceiling. This will supply the requested interventional evidence. revision: partial
Referee: In the theoretical analysis, the covariance term is stated to be proportional to advantage when using Policy Gradient-like algorithms. This creates a risk of circularity because advantage is the quantity that directly drives the policy update; the manuscript should explicitly state the assumptions under which the proportionality holds and clarify whether the observed positivity of the covariance is a derived necessity or an empirical regularity of the specific tasks and runs.

Authors: We will add explicit clarification in the revised theoretical section. The proportionality arises directly from the policy-gradient update: for a softmax policy, the change in logit for an action is proportional to the advantage estimator times the action probability (under standard REINFORCE or PPO-style estimators with baseline). This holds under the assumptions of the policy-gradient theorem, unbiased advantage estimation, and the specific form of the gradient. The positivity of the covariance is not a mathematical necessity derived from these assumptions alone (negative advantages could in principle produce negative covariance), but rather an empirical regularity observed in our training runs on the reported tasks. We will state this distinction clearly and note that the sign may depend on task structure and the distribution of advantages. revision: yes
Referee: The empirical study claims an exact match between the covariance term and entropy differences, supporting the theoretical conclusion. However, the claim that the covariance 'stays mostly positive throughout training' (thereby explaining monotonic entropy decrease) is presented as an observation rather than a general result; its dependence on task, model scale, or algorithm variant is not systematically tested, weakening the generality of the entropy-dynamics explanation.

Authors: The reported exact numerical match between covariance and entropy change is specific to the experimental setups and figures shown. We agree that the consistent positivity is presented as an observation from those runs rather than a proven general result. We will revise the discussion to explicitly frame the positivity as an empirical finding tied to the tasks and models tested, and to acknowledge that systematic variation across model scales, tasks, or algorithm variants (e.g., different advantage estimators) has not been performed. If space permits, we can include a small number of additional runs with a different model size to illustrate consistency, but a full sweep is beyond the current scope. revision: partial

standing simulated objections not resolved

A theoretical derivation connecting the covariance-driven entropy dynamics to the specific exponential form of the R-H relation.
Comprehensive empirical validation of covariance positivity and the R-H relation across arbitrary tasks, model scales, and algorithm variants.

Circularity Check

1 steps flagged

Fitted R=-a*e^H+b relation presented as empirical law without derivation from covariance dynamics

specific steps

fitted input called prediction [Abstract]
"In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b."

Parameters a and b are fitted to (H,R) data collected during standard RL training. The claimed trade-off, exhaustion bottleneck, and extrapolated ceiling at H=0 are therefore direct outputs of the fitting process and functional form rather than a derived result from the covariance-based entropy dynamics.

full rationale

The paper derives entropy change as driven by covariance between action probabilities and logit updates (proportional to advantage under PG), with exact empirical match to entropy differences. This part is self-contained. However, the load-bearing claim that performance is 'traded from' entropy with a predictable ceiling at H=0 rests on fitting R=-a*e^H+b to observed (H,R) pairs along standard trajectories and labeling the result an 'empirical law'. No derivation connects the covariance mechanism to this exponential form, so the bottleneck interpretation and ceiling are consequences of the chosen fit rather than independent predictions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance-entropy mapping rests on two fitted scalars a and b; the entropy-dynamics claim rests on the standard policy-gradient assumption that the update direction is proportional to advantage.

free parameters (2)

a
Slope parameter in the fitted transformation R = -a * e^H + b relating entropy to downstream reward.
b
Intercept parameter in the same fitted transformation; sets the predicted performance ceiling at H=0.

axioms (1)

domain assumption Policy-gradient update direction is proportional to advantage
Invoked to link the covariance term to entropy change in the theoretical derivation.

pith-pipeline@v0.9.0 · 5632 in / 1346 out tokens · 54209 ms · 2026-05-12T12:25:04.247628+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one; existence_economically_inevitable echoes
we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion
Cost Jcost_pos_of_ne_one; Jcost_symm echoes
the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage
DiscretenessForcing continuous_no_isolated_zero_defect echoes
the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
cs.LG 2026-04 unverdicted novelty 7.0

VaP-CSMV uses a cross-semantic encoder and multi-view decoder to unify DRL solving of HFVRP variants, outperforming prior neural solvers while matching heuristics at much lower inference time and generalizing zero-sho...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

AEM lifts entropy analysis to the response level and uses a derived uncertainty proxy to rescale advantages, enabling better exploration-exploitation balance and consistent gains over RL baselines on agent benchmarks.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
TIP: Token Importance in On-Policy Distillation
cs.LG 2026-04 conditional novelty 6.0

In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
cs.CL 2026-04 unverdicted novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
cs.AI 2026-05 unverdicted novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
cs.LG 2026-05 unverdicted novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Covariance-based entropy control selectively regularizes high-covariance tokens in softmax policies and achieves asymptotic unbiasedness upon annealing, unlike traditional regularization which introduces dense bias an...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 4.0

AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.