super hub Canonical reference

Process Reinforcement through Implicit Rewards

Ganqu Cui, Hanbin Wang, Jiacheng Chen, Lifan Yuan, Yuchen Zhang, Zefan Wang · 2025 · cs.LG · arXiv 2502.01456

Canonical reference. 79% of citing Pith papers cite this work as background.

112 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 112 citing papers more from Ganqu Cui arXiv PDF

abstract

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 method 5

citation-polarity summary

background 19 use method 5

claims ledger

abstract Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward model

authors

Ganqu Cui Hanbin Wang Jiacheng Chen Lifan Yuan Yuchen Zhang Zefan Wang

co-cited works

representative citing papers

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.

Reinforcement Learning from Rich Feedback with Distributional DAgger

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CHERRL is a new controllable testbed for reproducing, analyzing, and detecting reward hacking in rubric-based RL by injecting known biases into LLM-as-a-Judge systems.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Unsupervised Process Reward Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

cs.CL · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

cs.LG · 2026-04-19 · accept · novelty 7.0

The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Self-Distilled RLVR

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

cs.CL · 2025-12-23 · unverdicted · novelty 7.0

ThinkARM abstracts LLM reasoning traces into Schoenfeld episodes and shows that exploration steps correlate with correctness while efficiency methods selectively suppress evaluative feedback.

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

cs.AI · 2025-12-08 · conditional · novelty 7.0

TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

cs.SE · 2025-10-21 · conditional · novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

LCA frames outcome-supervised PRM training as MIL, introduces SWS pooling for dependent steps, proves Bayes consistency under mild assumptions, and reports consistent gains over prior outcome-supervised baselines.

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

The log-probability ratio from RL post-training recovers the optimal advantage function, providing an effective free signal for test-time scaling, uncertainty estimation, and failure attribution in LLM agents.

Learning with a Single Rollout via Monte Carlo Pass@k Critic

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

SR-PPO trains a Pass@k critic from single-rollout Monte Carlo outcomes to enable token-level advantage estimation in language model RL, yielding stable training and Pass@128 gains on math benchmarks.

Learning Process Rewards via Success Visitation Matching for Efficient RL

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.

citing papers explorer

Showing 20 of 20 citing papers after filters.

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards cs.AI · 2025-12-08 · conditional · none · ref 1 · internal anchor
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.
Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation cs.AI · 2026-06-22 · unverdicted · none · ref 3 · internal anchor
DEAR identifies decision tokens via entropy and evidence tokens via cosine similarity plus divergence to improve on-policy reasoning distillation over standard methods.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning cs.AI · 2026-06-10 · unverdicted · none · ref 17 · internal anchor
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents cs.AI · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 28 · 2 links · internal anchor
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models cs.AI · 2026-05-02 · unverdicted · none · ref 2 · internal anchor
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 10 · internal anchor
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning cs.AI · 2025-10-12 · unverdicted · none · ref 4 · internal anchor
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization cs.AI · 2025-07-31 · unverdicted · none · ref 6 · internal anchor
RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents cs.AI · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
MMPO introduces Belief Entropy as a self-supervised signal to provide fine-grained supervision for memory policies in LLM agents, outperforming outcome-based RL on long-horizon tasks up to 1.75M tokens.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 16 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent cs.AI · 2026-04-20 · conditional · none · ref 5 · internal anchor
Injecting 1% targeted synthetic data into GPT-2's pre-training substantially improves performance on 8 of 9 failing BLiMP grammatical paradigms, indicating data scarcity causes formal linguistic failures.
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization cs.AI · 2025-08-13 · unverdicted · none · ref 50 · internal anchor
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 146 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards cs.AI · 2026-06-27 · unverdicted · none · ref 66 · internal anchor
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · conditional · none · ref 21 · internal anchor
Novice programmers completed more tasks with lower workload using GitHub Copilot versus a human partner, but reported significantly more positive and arousing emotions with the human teammate.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 269 · internal anchor
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning cs.AI · 2026-04-11 · unreviewed · ref 8 · internal anchor
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search cs.AI · 2026-04-04 · unreviewed · ref 6 · internal anchor

Process Reinforcement through Implicit Rewards

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer