pith. machine review for the scientific record. sign in

arxiv: 2502.01456 · v2 · submitted 2025-02-03 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Process Reinforcement through Implicit Rewards

Bingxiang He, Bowen Zhou, Ganqu Cui, Hanbin Wang, Hao Peng, Huayu Chen, Jiacheng Chen, Jiarui Yuan, Kaiyan Zhang, Lifan Yuan, Maosong Sun, Ning Ding, Qixin Xu, Shuo Wang, Tianyu Yu, Weize Chen, Wendi Li, Xingtai Lv, Xu Han, Yuan Yao, Yuchen Fan, Yu Cheng, Yuchen Zhang, Zefan Wang, Zhiyuan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords process reward modelsimplicit rewardsreinforcement learningLLM reasoningoutcome supervisionmath benchmarkscredit assignment
0
0 comments X

The pith

PRIME derives implicit process rewards from policy rollouts and outcome labels alone to enable online training of process reward models for LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dense process supervision for large language models can be achieved without collecting explicit step-by-step labels by instead computing implicit rewards directly from the model's own rollouts paired with final outcome correctness. This removes the expensive offline labeling step and the separate reward-model training phase required by prior methods, letting the process reward model update online during reinforcement learning. The resulting signals address credit assignment and training efficiency problems that arise with purely sparse outcome rewards. Experiments on mathematical competition problems and coding tasks starting from a base model demonstrate clear gains over supervised fine-tuning and competitive performance against larger instruct models trained on far more data.

Core claim

PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards, combines well with various advantage functions, and forgoes the dedicated reward model training phase that existing approaches require, substantially reducing the development overhead.

What carries the argument

Implicit process rewards computed from policy rollouts and outcome labels, which supply fine-grained training signals for updating the process reward model directly during reinforcement learning.

Load-bearing premise

Implicit rewards extracted only from full rollouts and final outcome labels can give reliable step-level credit without reward hacking or mis-assignment that would occur if the signals were noisy.

What would settle it

Running the same reinforcement learning loop on math and coding benchmarks and finding no average gain over the supervised fine-tuning baseline or observing clear reward-hacking behaviors such as length exploitation without reasoning improvement.

read the original abstract

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PRIME, a method for online training of process reward models (PRMs) in LLM reinforcement learning that derives implicit process rewards solely from policy rollouts and binary outcome labels, eliminating the need for expensive dedicated process annotations. It combines this with various advantage functions and reports substantial gains on mathematical reasoning and coding benchmarks: starting from Qwen2.5-Math-7B-Base, the approach yields a 15.1% average improvement over the SFT baseline, with the resulting Eurus-2-7B-PRIME model outperforming Qwen2.5-Math-7B-Instruct on seven benchmarks using only 10% of the training data.

Significance. If the implicit-reward mechanism genuinely supplies reliable step-level signals that improve credit assignment over outcome-only RL without introducing new forms of reward hacking, the method could meaningfully reduce the cost and complexity of dense-reward RL for reasoning models. The reported benchmark improvements are large enough to be practically relevant, and the ability to forgo a separate PRM training stage is a clear engineering advantage; however, these benefits hinge on the unverified assumption that outcome-derived implicit rewards meaningfully differentiate correct versus incorrect intermediate steps in long trajectories.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.
  2. [Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.
  3. [§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.
minor comments (2)
  1. [§3] Notation for the implicit reward function is introduced without a clear reference to the advantage estimator used (e.g., which of the 'various advantage functions' is default); a single equation or algorithm box would improve reproducibility.
  2. [Experiments] The abstract states '10% of its training data' for Eurus-2-7B-PRIME versus Qwen2.5-Math-7B-Instruct, but the main text does not specify the exact data volume or composition used for the instruct model, making the comparison harder to interpret.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have prepared revisions to improve clarity and strengthen the experimental support for our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.

    Authors: We agree that the extraction of per-step implicit rewards from rollouts and terminal labels requires more explicit formalization. In the revised manuscript we will insert the full set of equations and pseudocode in §3 that define the implicit reward computation, showing how the terminal outcome is used to assign differentiated step-level signals via the rollout structure rather than uniform back-propagation. This addition will make clear that the procedure is non-circular and process-sensitive. revision: yes

  2. Referee: [Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.

    Authors: The referee correctly notes the absence of a direct outcome-only ablation and step-level correlation analysis. We will add these controls in the revised experiments section: (1) a head-to-head comparison of PRIME against pure outcome RL using identical data, optimizer, and base model, and (2) correlation metrics between the learned implicit rewards and available step annotations on a held-out set. These additions will allow readers to attribute performance gains more precisely to the implicit process signals. revision: yes

  3. Referee: [§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.

    Authors: We acknowledge that end-to-end benchmark gains alone do not fully rule out credit-assignment ambiguity. In the revision we will include step-level accuracy metrics on trajectories where partial process labels exist and add qualitative examples illustrating reward differentiation between correct and incorrect intermediate steps. While obtaining new large-scale human process annotations is resource-intensive and beyond the scope of the current study, the added quantitative and qualitative analyses will provide direct evidence that the implicit rewards supply non-uniform, process-sensitive signals. revision: partial

Circularity Check

0 steps flagged

No circularity: implicit rewards derived from standard rollout-based advantage estimation without self-referential reduction.

full rationale

The paper's core mechanism computes implicit process rewards directly from policy-generated rollouts paired with terminal outcome labels, then uses these for online PRM updates within existing advantage estimators. This construction is self-contained and does not define the reward signal in terms of the target improvement, fit a parameter on a subset and relabel it as a prediction, or import uniqueness via self-citation chains. Empirical gains on math/coding benchmarks are presented as validation rather than as a definitional consequence of the method itself. No load-bearing step reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that outcome labels suffice to generate useful implicit process signals; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Outcome labels alone can be used to derive reliable implicit process rewards during online training
    This is the core premise that allows PRIME to forgo dedicated process label collection.

pith-pipeline@v0.9.0 · 5635 in / 1210 out tokens · 35223 ms · 2026-05-11T20:17:20.605350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation Jcost uniqueness echoes

    PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards.

  • Foundation.LawOfExistence defect_zero_iff_one echoes

    dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  3. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  4. Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

  5. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  6. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  7. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

  8. A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

    cs.LG 2026-04 accept novelty 7.0

    The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.

  9. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  10. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  11. Self-Distilled RLVR

    cs.LG 2026-04 unverdicted novelty 7.0

    RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

  12. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  13. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  14. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  15. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  16. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

  17. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  18. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  19. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  20. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  21. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  22. Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

    cs.CL 2026-05 unverdicted novelty 6.0

    FineStep adds step-level process rewards and credit assignment to tool-augmented Text-to-SQL, achieving 3.25% higher execution accuracy than GRPO on BIRD while cutting redundant tool calls.

  23. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  24. GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.

  25. Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

    cs.LG 2026-05 conditional novelty 6.0

    DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

  26. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  27. V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.

  28. TEMPO: Scaling Test-time Training for Large Reasoning Models

    cs.LG 2026-04 unverdicted novelty 6.0

    TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

  29. Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

    cs.CL 2026-04 unverdicted novelty 6.0

    PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...

  30. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  31. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  32. AgentV-RL: Scaling Reward Modeling with Agentic Verifier

    cs.CL 2026-04 unverdicted novelty 6.0

    AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.

  33. Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    IPVRM learns prefix values to produce reliable step rewards from sequence outcomes using TD learning, enabling distribution-level RL that improves reasoning when paired with calibrated rewards.

  34. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  35. OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

    cs.AI 2026-04 unverdicted novelty 6.0

    OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.

  36. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  37. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    cs.LG 2025-07 unverdicted novelty 6.0

    RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

  38. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 5.0

    SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.

  39. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  40. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  41. TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

    eess.SP 2026-04 unverdicted novelty 5.0

    TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

  42. SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    cs.LG 2026-04 unverdicted novelty 5.0

    SCOPE routes LLM on-policy rollouts by correctness into teacher-perplexity-weighted KL for errors and student-perplexity-weighted MLE for successes, with group normalization, yielding 11.42% relative Avg@32 gain on re...

  43. SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.

  44. A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Covariance-based entropy control selectively regularizes high-covariance tokens in softmax policies and achieves asymptotic unbiasedness upon annealing, unlike traditional regularization which introduces dense bias an...

  45. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  46. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  47. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · cited by 44 Pith papers · 24 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [5]

    arXiv preprint arXiv:2310.12036 , year=

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and R \'e mi Munos. A general theoretical paradigm to understand learning from human preferences. International Conference on Artificial Intelligence and Statistics, abs/2310.12036, 2024

  5. [8]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  6. [9]

    Ultrafeedback: Boosting language models with scaled ai feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback. In ICML, 2024

  7. [10]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  8. [11]

    Kto: Model alignment as prospect theoretic optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ICML, 2024

  9. [12]

    arXiv preprint arXiv:2410.15115 , year=

    Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weiling Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning. ArXiv, abs/2410.15115, 2024

  10. [13]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022

  11. [15]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pp.\ 1352--1361. ...

  12. [22]

    Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URL https://api.semanticscholar.org/CorpusID:198489118

  13. [23]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hanna H...

  14. [25]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022

  15. [26]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13: 0 9, 2024

  16. [29]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. ArXiv, abs/2305.20050, 2023

  17. [30]

    Rho-1: Not all tokens are what you need, 2024

    Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need, 2024

  18. [32]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

  19. [33]

    Ng, Daishi Harada, and Stuart Russell

    Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Ivan Bratko and Saso Dzeroski (eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999 , pp.\ 278--287. Morgan Kaufmann, 1999

  20. [35]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  21. [36]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

  22. [37]

    Fromr to Q∗: Your language model is secretly a Q-function,

    Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^ * : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024

  23. [38]

    arXiv preprint arXiv:2404.03715 , year=

    Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. ArXiv, abs/2404.03715, 2024

  24. [39]

    Jordan, and Pieter Abbeel

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016

  25. [42]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  26. [43]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  27. [44]

    reasoning-0.01, 2024

    SkunkworksAI. reasoning-0.01, 2024

  28. [45]

    The bitter lesson

    Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13 0 (1): 0 38, 2019

  29. [46]

    Learning to predict by the methods of temporal differences

    Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3: 0 9--44, 1988

  30. [47]

    Reinforcement learning: An introduction

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

  31. [49]

    Qwq: Reflect deeply on the boundaries of the unknown, November 2024

    Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/

  32. [50]

    Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data

    Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560, 2024

  33. [52]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023

  34. [53]

    Magicoder: Empowering code generation with oss-instruct

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, 2024

  35. [54]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

  36. [56]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122

  37. [57]

    Advancing llm reasoning generalists with preference trees

    Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing llm reasoning generalists with preference trees. ArXiv, 2024 a

  38. [58]

    Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels, 2024 b . URL https://arxiv.org/abs/2412.01981

  39. [60]

    Mammoth2: Scaling instructions from the web

    Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. ArXiv, abs/2405.03548, 2024

  40. [61]

    Ultramedical: Building specialized generalists in biomedicine, 2024

    Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024

  41. [64]

    Ziebart, Andrew L

    Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Dieter Fox and Carla P. Gomes (eds.), Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008 , pp.\ 1433--1438. AAAI Press, 2008. URL http://www.aaai.org/Library/...

  42. [65]

    Ziebart and Andrew L

    Brian D. Ziebart and Andrew L. Maas and J. Andrew Bagnell and Anind K. Dey , editor =. Maximum Entropy Inverse Reinforcement Learning , booktitle =. 2008 , url =

  43. [66]

    Reinforcement Learning with Deep Energy-Based Policies , booktitle =

    Tuomas Haarnoja and Haoran Tang and Pieter Abbeel and Sergey Levine , editor =. Reinforcement Learning with Deep Energy-Based Policies , booktitle =. 2017 , url =

  44. [67]

    Ng and Daishi Harada and Stuart Russell , editor =

    Andrew Y. Ng and Daishi Harada and Stuart Russell , editor =. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , booktitle =

  45. [68]

    ArXiv , year=

    MAmmoTH2: Scaling Instructions from the Web , author=. ArXiv , year=

  46. [69]

    Scalable agent alignment via reward modeling: a research direction

    Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

  47. [70]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  48. [71]

    ICML , year=

    ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. ICML , year=

  49. [72]

    International Conference on Artificial Intelligence and Statistics , year=

    A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. International Conference on Artificial Intelligence and Statistics , year=

  50. [73]

    ICML , year=

    KTO: Model Alignment as Prospect Theoretic Optimization , author=. ICML , year=

  51. [74]

    ArXiv , year=

    Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. ArXiv , year=

  52. [75]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=

  53. [76]

    ArXiv , year=

    Advancing LLM Reasoning Generalists with Preference Trees , author=. ArXiv , year=

  54. [77]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  55. [78]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  56. [79]

    arXiv preprint arXiv:2406.09760 , year=

    Bootstrapping Language Models with DPO Implicit Rewards , author=. arXiv preprint arXiv:2406.09760 , year=

  57. [80]

    From r to Q^

    Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to Q^

  58. [81]

    arXiv preprint arXiv:2405.19262 , year=

    Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models , author=. arXiv preprint arXiv:2405.19262 , year=

  59. [82]

    arXiv preprint arXiv:2410.01679 , year=

    Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=. arXiv preprint arXiv:2410.01679 , year=

  60. [83]

    ArXiv , year=

    Let's Verify Step by Step , author=. ArXiv , year=

  61. [84]

    ArXiv , year=

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. ArXiv , year=

  62. [85]

    International Conference on Machine Learning , year=

    Scaling Laws for Reward Model Overoptimization , author=. International Conference on Machine Learning , year=

  63. [86]

    ArXiv , year=

    Magicoder: Source Code Is All You Need , author=. ArXiv , year=

  64. [87]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling

    AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling , author=. arXiv preprint arXiv:2412.15084 , year=

  65. [88]

    & Kumar, A

    Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

  66. [89]

    2018 , publisher=

    Reinforcement learning: An introduction , author=. 2018 , publisher=

  67. [90]

    Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A

    Nathan Lambert and Jacob Daniel Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James Validad Miranda and Alisa Liu and Nouha Dziri and Xinxi Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and...

  68. [91]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  69. [92]

    DeepRLStructPred@ICLR , year=

    Buy 4 REINFORCE Samples, Get a Baseline for Free! , author=. DeepRLStructPred@ICLR , year=

  70. [93]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  71. [94]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  72. [95]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

  73. [96]

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

  74. [97]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

    ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models , author =. arXiv preprint arXiv:2310.10505 , year =

  75. [98]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models

    Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models , author=. arXiv preprint arXiv:2406.13542 , year=

  76. [99]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

  77. [100]

    arxiv preprint arXiv:2404.02823 , year=

    Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models , author=. arxiv preprint arXiv:2404.02823 , year=

  78. [101]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  79. [102]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  80. [103]

    2024 , eprint=

    Large Language Model Instruction Following: A Survey of Progresses and Challenges , author=. 2024 , eprint=

Showing first 80 references.