DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
super hub Canonical reference
Process Reinforcement through Implicit Rewards
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward model
authors
co-cited works
representative citing papers
SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
CHERRL is a new controllable testbed for reproducing, analyzing, and detecting reward hacking in rubric-based RL by injecting known biases into LLM-as-a-Judge systems.
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
ThinkARM abstracts LLM reasoning traces into Schoenfeld episodes and shows that exploration steps correlate with correctness while efficiency methods selectively suppress evaluative feedback.
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
LCA frames outcome-supervised PRM training as MIL, introduces SWS pooling for dependent steps, proves Bayes consistency under mild assumptions, and reports consistent gains over prior outcome-supervised baselines.
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
citing papers explorer
-
TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.
-
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
-
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
-
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.
-
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
MMPO introduces Belief Entropy as a self-supervised signal to provide fine-grained supervision for memory policies in LLM agents, outperforming outcome-based RL on long-horizon tasks up to 1.75M tokens.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
-
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
- OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
- SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
- OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search