OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
hub Canonical reference
Reasoning with Exploration: An Entropy Perspective
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LLM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
MinervaRL applies reinforcement learning with verifiable rewards from CTI standards to improve LLM structured output performance by 15.8 points over base models across 12 benchmarks.
Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
citing papers explorer
No citing papers match the current filters.