citation dossier

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei · 2025 · arXiv 2506.14758

17Pith papers citing it

18reference links

cs.LGtop field · 7 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.LG (7 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

AIPO: : Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

Visually-Guided Policy Optimization for Multimodal Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

On the Step Length Confounding in LLM Reasoning Data Selection

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.

Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR

cs.CL · 2026-04-06 · conditional · novelty 6.0 · 2 refs

AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01 · unverdicted · novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

cs.AI · 2026-04-20 · unverdicted · novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.

Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

cs.LG · 2026-04-13 · unverdicted · novelty 5.0

Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

citing papers explorer

Showing 17 of 17 citing papers.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity cs.LG · 2026-05-01 · unverdicted · none · ref 42
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 11
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 5
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
AIPO: : Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 9
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 33
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 3
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 7
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization cs.CL · 2026-04-13 · unverdicted · none · ref 1
Policy Split bifurcates LLM policies into normal and high-entropy modes with dual-mode entropy regularization to enhance exploration while preserving task accuracy.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG · 2026-04-13 · unverdicted · none · ref 45
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
Visually-Guided Policy Optimization for Multimodal Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
On the Step Length Confounding in LLM Reasoning Data Selection cs.CL · 2026-04-08 · unverdicted · none · ref 1
Average log probability selection for LLM reasoning datasets is confounded by step length because longer steps dilute low-probability first tokens; ASLEC-DROP and ASLEC-CASL remove this bias.
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR cs.CL · 2026-04-06 · conditional · none · ref 1 · 2 links
AsymGRPO decouples positive and negative advantage modulation in RLVR to separately boost useful entropy and suppress noisy entropy, improving LLM reasoning performance.
Policy Improvement Reinforcement Learning cs.LG · 2026-04-01 · unverdicted · none · ref 9
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · unverdicted · none · ref 34
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning cs.AI · 2026-04-16 · unverdicted · none · ref 2
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis cs.LG · 2026-04-13 · unverdicted · none · ref 1
Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 119
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer