DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
hub Canonical reference
Process Reinforcement through Implicit Rewards
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward model
co-cited works
representative citing papers
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
ThinkARM abstracts LLM reasoning traces into Schoenfeld episodes and shows that exploration steps correlate with correctness while efficiency methods selectively suppress evaluative feedback.
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
LCA frames outcome-supervised PRM training as MIL, introduces SWS pooling for dependent steps, proves Bayes consistency under mild assumptions, and reports consistent gains over prior outcome-supervised baselines.
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
ADWIN adaptively selects training horizons in on-policy distillation via prefix alignment checks, cutting end-to-end cost by up to 4.1x while matching or exceeding full-rollout accuracy on math and code benchmarks.
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
citing papers explorer
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
-
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.