Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
8 Pith papers cite this work. Polarity classification is still indexing.
abstract
Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $\rho_0$ and $\rho_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.
citation-role summary
citation-polarity summary
years
2026 8verdicts
UNVERDICTED 8roles
background 2polarities
background 2representative citing papers
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.
SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees on safety monotonicity, policy convergence, and accountability propagation.
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Controlled noise injection into GSM8K rewards for Qwen2.5 models shows persistent validation gaps under compute scaling and asymmetric degradation from false negatives versus false positives.
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
This review synthesizes representative advances in high-dimensional statistics, highlights common themes and open problems, and points to key entry works.
citing papers explorer
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
This review synthesizes representative advances in high-dimensional statistics, highlights common themes and open problems, and points to key entry works.