Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie , Ruotong Pan , Xiangyu Wu , Yunfei Zhang , Jiayi Fu , Tingting Gao , Guorui Zhou

Authors on Pith no claims yet

classification 💻 cs.AI

keywords reasoningucasadvantageexplorationrlvracrosscollapseentropy

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using a logit-space self-confidence proxy, and then applies an asymmetric token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse. Code is available at https://github.com/xvolcano02/UCAS.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
cs.CV 2026-04 unverdicted novelty 6.0

A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.
One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement
cs.CL 2026-04 unverdicted novelty 5.0

ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to u...
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.