UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
citation dossier
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others
why this work matters in Pith
Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.LG (7 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
representative citing papers
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
-
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
-
Visually-Guided Policy Optimization for Multimodal Reasoning
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
-
On the Step Length Confounding in LLM Reasoning Data Selection
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
-
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
-
Policy Improvement Reinforcement Learning
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
Targeted Exploration via Unified Entropy Control for Reinforcement Learning
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
-
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.