UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
hub
Reasoning with exploration: An entropy perspective
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.