hub Canonical reference

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang · 2025 · cs.CL · arXiv 2506.14758

Canonical reference. 75% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LLM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1 method 1

citation-polarity summary

background 6 baseline 1 use method 1

representative citing papers

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

cs.LG · 2026-01-31 · unverdicted · novelty 7.0

MinervaRL applies reinforcement learning with verifiable rewards from CTI standards to improve LLM structured output performance by 15.8 points over base models across 12 benchmarks.

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

cs.AI · 2026-01-08 · conditional · novelty 7.0

Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

On the Step Length Confounding in LLM Reasoning Data Selection

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.

Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR

cs.CL · 2026-04-06 · conditional · novelty 6.0 · 2 refs

AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

cs.LG · 2026-02-10 · unverdicted · novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

cs.LG · 2025-12-05 · unverdicted · novelty 6.0

Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-11-08 · unverdicted · novelty 6.0

Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

cs.AI · 2025-10-12 · unverdicted · novelty 6.0

UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

cs.LG · 2025-10-09 · unverdicted · novelty 6.0

SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.

Entropy After </Think> for reasoning model early exiting

cs.LG · 2025-09-30 · unverdicted · novelty 6.0

Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

cs.AI · 2025-09-28 · unverdicted · novelty 6.0

RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation cs.AI · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models cs.AI · 2026-01-08 · conditional · none · ref 2 · internal anchor
Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 6 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning cs.AI · 2025-10-12 · unverdicted · none · ref 3 · internal anchor
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
Emergent Slow Thinking in LLMs as Inverse Tree Freezing cs.AI · 2025-09-28 · unverdicted · none · ref 7 · internal anchor
RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 119 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · unreviewed · ref 34 · internal anchor

Reasoning with Exploration: An Entropy Perspective

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer