The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

Aakriti Agrawal; Amrit Singh Bedi; Armin Saghafian; C. Bayan Bruss; Furong Huang; Nam H Nguyen; Nihal Sharma; Rizal Fathony; Souradip Chakraborty

arxiv: 2606.09078 · v1 · pith:QBDVZZSTnew · submitted 2026-06-08 · 💻 cs.LG

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

Aakriti Agrawal , Souradip Chakraborty , Armin Saghafian , Nihal Sharma , Rizal Fathony , Nam H Nguyen , C. Bayan Bruss , Amrit Singh Bedi

show 1 more author

Furong Huang

This is my paper

Pith reviewed 2026-06-27 17:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords Process Reward ModelsPRISMContrastive LearningFalse PositivesStep-level FeedbackReasoningPolicy Optimization

0 comments

The pith

PRISM corrects hidden bias in process reward models by shifting from pointwise labels to contrastive step comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Process Reward Models assign step-level credit during reasoning but inherit a bias from severe imbalance in their training data. Standard cross-entropy loss amplifies this imbalance, causing the models to assign high rewards to plausible yet incorrect steps. These false positives produce asymmetric downstream damage: they actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning paths, while false negatives mainly slow exploration. The paper therefore advocates replacing pointwise label fitting with reliable relative comparisons. PRISM implements this shift through policy-aware contrastive training on hard negatives generated by a temporal lookahead strategy, together with a difficulty-aware curriculum, all without requiring new human annotations.

Core claim

Standard cross-entropy training on imbalanced step-level data causes PRMs to overcredit plausible but incorrect steps and produce high false-positive rates; these false positives steer downstream search and optimization toward flawed reasoning. PRISM addresses the bias by learning from contrastive step-level comparisons and hard negatives generated by temporal lookahead, using a difficulty-aware curriculum to set the contrastive margin, and requires no additional human labels.

What carries the argument

PRISM framework: policy-aware contrastive training that generates hard negative steps via temporal lookahead and optimizes a difficulty-aware contrastive margin.

If this is right

PRISM reduces false positives by 22 percent on PRMBench while raising macro F1 over strong discriminative baselines.
The same models improve accuracy by up to 22 percent when used for guided decoding.
The same models improve accuracy by up to 33 percent when used for Best-of-N selection.
The approach yields more robust performance across policy optimization and search tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Process supervision may need to prioritize the avoidance of false positives over the maximization of true positives.
The contrastive formulation could be tested on other sequential tasks where step-level labels are naturally imbalanced.
Because no new annotations are required, the method offers a route to improve supervision quality at scale using only existing trajectories.

Load-bearing premise

Hard negatives produced by the temporal lookahead strategy are sufficiently informative and unbiased to support effective contrastive training from existing data alone.

What would settle it

If PRISM applied to the same base models and data yields no reduction in false-positive rate on PRMBench relative to standard cross-entropy PRMs, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.09078 by Aakriti Agrawal, Amrit Singh Bedi, Armin Saghafian, C. Bayan Bruss, Furong Huang, Nam H Nguyen, Nihal Sharma, Rizal Fathony, Souradip Chakraborty.

**Figure 2.** Figure 2: Step-level labels are substantially more skewed toward positives than trajectory-level labels in PRM800K. Data-Imbalance Problem. We observed a key label imbalance in commonly available open-source PRM training datasets such as PRM800K (Lightman et al., 2023). PRM800K contains a much higher proportion of correct steps (73.1%) than incorrect steps, even though only a small fraction of full trajectories … view at source ↗

**Figure 3.** Figure 3: Effect of FP vs. FN on BoN Accuracy. FP induces a strict performance ceiling, while FN only slows the rate of convergence. Thus, FN merely delays convergence, whereas FP fundamentally caps final performance under BoN section. Therefore, objective is to prioritize minimizing α (reducing FPs) to lift the ceiling on alignment. Detailed proof is provided in appendix section 9.1. Key Insight: Reducing Fals… view at source ↗

**Figure 4.** Figure 4: Pairwise classification accuracy of Pointwise versus Pairwise models. Why Step-Contrastive Loss Is Better Aligned Than Step-BCE For Policy Learning To show this, we conducted simulation by training two identical network architectures using CE loss optimized to predict absolute binary labels, and a BT loss optimized directly on pairwise preferences on the same matched dataset D = {(x, y<t , y pos t , y … view at source ↗

**Figure 5.** Figure 5: Curriculum and threshold analyses for PRISM. Left: Later curriculum rounds reduce false positives while [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Best-of-N and guided beam-search alignment on MATH-500, AIME24, and LiveCodeBench. Top: Guided beam-search results on AIME24 (left), MATH-500 (middle), and LiveCodeBench (right). Across all settings, PRISM (blue) consistently outperforms the baseline (orange), highlighting its effectiveness in guided beamsearch decoding, where the PRM plays a critical role. Bottom: Best-of-N results on MATH-… view at source ↗

**Figure 7.** Figure 7: Reward distributions for negative labels across different QwenPRM variants. Green denotes the baseline [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: OOD guided beam-search results on MATH-500. PRISM (blue) substantially outperforms the baseline [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Error plot comparing best-of-N alignment on MATH-500 using two generator policies: LLaMA for the OOD policy (left) and Qwen for the ID policy (right). The baseline PRM is shown in orange and PRISM in blue. Across both settings, PRISM delivers a clear performance boost over the baseline, with the improvement being especially pronounced for the OOD policy. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM flags a genuine imbalance bias in PRMs and shows a workable contrastive fix, but the gains rest on unproven assumptions about the lookahead negatives.

read the letter

PRISM points out that standard cross-entropy PRM training amplifies step-level imbalance, leading to over-crediting of wrong but plausible steps. The asymmetric harm is the useful observation: false positives actively derail Best-of-N and guided decoding, while false negatives mostly just slow things down. They respond with a contrastive framework that pulls in hard negatives via temporal lookahead on existing trajectories and adds a difficulty-aware margin curriculum. No new human labels are needed.

The approach is new in its specific combination of policy-aware contrastive pairs and the lookahead generation step. The reported drops in false positives (22% on PRMBench) and lifts in downstream accuracy (up to 33% on Best-of-N) are the concrete results.

The soft spot is exactly the one the stress-test note flags. The paper needs to show that the lookahead negatives are not carrying forward the same bias or introducing fresh artifacts; without ablations or diagnostics on that point, it is hard to know whether the contrastive objective is doing the work or whether the gains are partly from other factors. The abstract numbers are given without protocol details or significance tests, so the strength of the evidence is still moderate.

This is for groups already training or using process reward models on math or code reasoning. It is worth a serious referee because the problem is real, the proposed change is cheap to try, and the claims are falsifiable even if the current write-up leaves some verification steps for the reviewers.

Referee Report

3 major / 2 minor

Summary. The paper identifies a hidden bias in Process Reward Models (PRMs) arising from severe step-level training data imbalance, which standard cross-entropy training amplifies into high false-positive rates that asymmetrically harm downstream tasks such as Best-of-N selection and guided decoding. It proposes PRISM, a policy-aware framework that replaces pointwise fitting with contrastive step-level comparisons using hard negatives generated via a temporal lookahead strategy (no new human labels) and a difficulty-aware curriculum for the contrastive margin. Experiments on PRMBench and ProcessBench report a 22% false-positive reduction and macro-F1 gains over strong discriminative PRMs; downstream applications show accuracy improvements up to 22% (guided decoding) and 33% (Best-of-N).

Significance. If the central claims hold, the work usefully reframes PRM training around reliable relative comparisons rather than absolute label fitting and demonstrates that bias mitigation can be achieved from existing data alone. The explicit separation of false-positive versus false-negative downstream effects and the policy-aware contrastive objective are substantive contributions to process supervision. The no-new-labels design and curriculum are practical strengths.

major comments (3)

[§3] §3 (PRISM framework) and the temporal-lookahead description: the central claim that lookahead-generated negatives are sufficiently unbiased and informative to mitigate the original step-imbalance bias lacks any derivation, correlation diagnostic, or ablation showing that the generated negatives are uncorrelated with the false-positive bias identified in §2. Without this, the reported 22% false-positive reduction cannot be attributed to the contrastive objective alone.
[§4, §5] Experimental protocol (throughout §4 and §5): the abstract and results sections state quantitative gains (22% FP reduction, 22–33% downstream accuracy) but supply no description of base policies, data splits, statistical tests, variance across seeds, or exact baseline implementations. This prevents assessment of whether the improvements support the load-bearing claim that PRISM reliably reduces the identified bias.
[§3.3] Difficulty-aware curriculum (Eq. for contrastive margin): the paper states that the curriculum optimizes the contrastive step margin, yet provides no ablation isolating its contribution versus a fixed margin or versus standard contrastive loss; if the gains collapse without the curriculum, the framework's necessity is overstated.

minor comments (2)

[§3] Notation for the contrastive loss and margin should be unified across equations and text; currently the margin symbol appears inconsistently.
[§4] Figure captions for PRMBench results should explicitly list the exact baselines and whether they are re-implemented or taken from prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [§3] §3 (PRISM framework) and the temporal-lookahead description: the central claim that lookahead-generated negatives are sufficiently unbiased and informative to mitigate the original step-imbalance bias lacks any derivation, correlation diagnostic, or ablation showing that the generated negatives are uncorrelated with the false-positive bias identified in §2. Without this, the reported 22% false-positive reduction cannot be attributed to the contrastive objective alone.

Authors: We acknowledge that additional analysis would strengthen attribution of the false-positive reduction to the contrastive objective. The temporal lookahead is designed to generate policy-aware hard negatives from existing trajectories. In the revised manuscript we will add a correlation diagnostic between the lookahead negatives and the step-imbalance bias identified in §2, together with an ablation that isolates the contrastive loss from the original pointwise training. revision: yes
Referee: [§4, §5] Experimental protocol (throughout §4 and §5): the abstract and results sections state quantitative gains (22% FP reduction, 22–33% downstream accuracy) but supply no description of base policies, data splits, statistical tests, variance across seeds, or exact baseline implementations. This prevents assessment of whether the improvements support the load-bearing claim that PRISM reliably reduces the identified bias.

Authors: We agree that the experimental protocol section requires substantially more detail. The revised manuscript will include explicit descriptions of the base policies, data splits, statistical tests, variance across random seeds, and precise baseline implementations to support reproducibility and evaluation of the reported gains. revision: yes
Referee: [§3.3] Difficulty-aware curriculum (Eq. for contrastive margin): the paper states that the curriculum optimizes the contrastive step margin, yet provides no ablation isolating its contribution versus a fixed margin or versus standard contrastive loss; if the gains collapse without the curriculum, the framework's necessity is overstated.

Authors: We will add an ablation study in the revised version that compares the difficulty-aware curriculum against both a fixed-margin contrastive loss and a standard contrastive loss without curriculum, thereby isolating the curriculum's contribution to the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper identifies an empirical bias in standard PRM training and introduces PRISM as a contrastive framework that generates hard negatives via temporal lookahead on existing data, without new labels. Reported gains (false-positive reduction, F1 improvements, downstream accuracy lifts) are measured on external benchmarks (PRMBench, ProcessBench) and task settings (guided decoding, Best-of-N). No equation or claim reduces a prediction to a fitted input by construction, no load-bearing self-citation chain is invoked, and the contrastive signals are derived externally to the final metrics. The central claims rest on experimental validation rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only access limits visibility into exact parameter counts and background assumptions; the main unverified premises concern the quality of lookahead-generated negatives and the causal role of data imbalance.

free parameters (1)

contrastive step margin
Explicitly optimized via the difficulty-aware curriculum described in the abstract.

axioms (2)

domain assumption Severe imbalance in step-level training data is the root cause of the false-positive bias under standard cross-entropy training.
This premise is stated as the starting point for the entire analysis.
domain assumption Temporal lookahead on existing trajectories produces hard negatives that are representative and free of new systematic biases.
Required for the claim that no new human labels are needed.

pith-pipeline@v0.9.1-grok · 5827 in / 1424 out tokens · 28861 ms · 2026-06-27T17:05:37.828183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 18 linked inside Pith

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

arXiv preprint arXiv:2503.11926 , year=

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2406.10162 , year=

Sycophancy to subterfuge: Investigating reward-tampering in large language models , author=. arXiv preprint arXiv:2406.10162 , year=

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2501.03124 , year=

PRMBench: A fine-grained and challenging benchmark for process-level reward models , author=. arXiv preprint arXiv:2501.03124 , year=

arXiv
[8]

arXiv preprint arXiv:2501.07301 , year=

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2312.08935 , year=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. arXiv preprint arXiv:2312.08935 , year=

Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2501.16513 , year=

Deception in LLMs: Self-preservation and autonomous goals in large language models , author=. arXiv preprint arXiv:2501.16513 , year=

arXiv
[11]

arXiv preprint arXiv:2509.21016 , year=

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? , author=. arXiv preprint arXiv:2509.21016 , year=

arXiv
[12]

arXiv preprint arXiv:2512.07783 , year=

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models , author=. arXiv preprint arXiv:2512.07783 , year=

arXiv
[13]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2412.16720 , year=

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2504.06141 , year=

Adversarial training of reward models , author=. arXiv preprint arXiv:2504.06141 , year=

arXiv
[16]

arXiv preprint arXiv:2502.09650 , year=

Principled data selection for alignment: The hidden risks of difficult examples , author=. arXiv preprint arXiv:2502.09650 , year=

arXiv
[17]

arXiv preprint arXiv:2503.09567 , year=

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2502.17419 , year=

From system 1 to system 2: A survey of reasoning large language models , author=. arXiv preprint arXiv:2502.17419 , year=

Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2409.15360 , year=

Reward-robust rlhf in llms , author=. arXiv preprint arXiv:2409.15360 , year=

arXiv
[20]

arXiv preprint arXiv:2402.13210 , year=

Bayesian reward models for LLM alignment , author=. arXiv preprint arXiv:2402.13210 , year=

arXiv
[21]

arXiv preprint arXiv:2409.13156 , year=

Rrm: Robust reward model training mitigates reward hacking , author=. arXiv preprint arXiv:2409.13156 , year=

arXiv
[22]

Scaling test-time compute with open models , author=
[23]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[24]

arXiv preprint arXiv:2310.17631 , year=

Judgelm: Fine-tuned large language models are scalable judges , author=. arXiv preprint arXiv:2310.17631 , year=

arXiv
[25]

arXiv preprint arXiv:2408.03314 , year=

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

Pith/arXiv arXiv
[26]

Advances in Neural Information Processing Systems , volume=

Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=
[27]

arXiv preprint arXiv:2408.00724 , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2211.14275 , year=

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2407.07880 , year=

Towards robust alignment of language models: Distributionally robustifying direct preference optimization , author=. arXiv preprint arXiv:2407.07880 , year=

arXiv
[30]

and Louradour, Jérôme and Collobert, Ronan and Weston, Jason , year =

Bengio, Y. and Louradour, Jérôme and Collobert, Ronan and Weston, Jason , year =. Curriculum learning , volume =. Journal of the American Podiatry Association , doi =
[31]

arXiv preprint arXiv:2506.04734 , year=

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design , author=. arXiv preprint arXiv:2506.04734 , year=

arXiv
[32]

arXiv preprint arXiv:2504.00891 , year=

Genprm: Scaling test-time compute of process reward models via generative reasoning , author=. arXiv preprint arXiv:2504.00891 , year=

arXiv
[33]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2406.06592 , year=

Improve mathematical reasoning in language models by automated process supervision , author=. arXiv preprint arXiv:2406.06592 , year=

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2504.13837 , year=

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[38]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2504.16828 , year=

Process reward models that think , author=. arXiv preprint arXiv:2504.16828 , year=

arXiv
[40]

Advances in Neural Information Processing Systems , volume=

Rest-mcts*: Llm self-training via process reward guided tree search , author=. Advances in Neural Information Processing Systems , volume=
[41]

Advances in Neural Information Processing Systems , volume=

Alphamath almost zero: process supervision without process , author=. Advances in Neural Information Processing Systems , volume=
[42]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[43]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Processbench: Identifying process errors in mathematical reasoning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[44]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Evaluating mathematical reasoning beyond accuracy , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[45]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[46]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[4] [4]

arXiv preprint arXiv:2503.11926 , year=

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv

[5] [6]

arXiv preprint arXiv:2406.10162 , year=

Sycophancy to subterfuge: Investigating reward-tampering in large language models , author=. arXiv preprint arXiv:2406.10162 , year=

Pith/arXiv arXiv

[6] [7]

arXiv preprint arXiv:2501.03124 , year=

PRMBench: A fine-grained and challenging benchmark for process-level reward models , author=. arXiv preprint arXiv:2501.03124 , year=

arXiv

[7] [8]

arXiv preprint arXiv:2501.07301 , year=

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

Pith/arXiv arXiv

[8] [9]

arXiv preprint arXiv:2312.08935 , year=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. arXiv preprint arXiv:2312.08935 , year=

Pith/arXiv arXiv

[9] [10]

arXiv preprint arXiv:2501.16513 , year=

Deception in LLMs: Self-preservation and autonomous goals in large language models , author=. arXiv preprint arXiv:2501.16513 , year=

arXiv

[10] [11]

arXiv preprint arXiv:2509.21016 , year=

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? , author=. arXiv preprint arXiv:2509.21016 , year=

arXiv

[11] [12]

arXiv preprint arXiv:2512.07783 , year=

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models , author=. arXiv preprint arXiv:2512.07783 , year=

arXiv

[12] [13]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv

[13] [14]

arXiv preprint arXiv:2412.16720 , year=

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

Pith/arXiv arXiv

[14] [15]

arXiv preprint arXiv:2504.06141 , year=

Adversarial training of reward models , author=. arXiv preprint arXiv:2504.06141 , year=

arXiv

[15] [16]

arXiv preprint arXiv:2502.09650 , year=

Principled data selection for alignment: The hidden risks of difficult examples , author=. arXiv preprint arXiv:2502.09650 , year=

arXiv

[16] [17]

arXiv preprint arXiv:2503.09567 , year=

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

Pith/arXiv arXiv

[17] [18]

arXiv preprint arXiv:2502.17419 , year=

From system 1 to system 2: A survey of reasoning large language models , author=. arXiv preprint arXiv:2502.17419 , year=

Pith/arXiv arXiv

[18] [19]

arXiv preprint arXiv:2409.15360 , year=

Reward-robust rlhf in llms , author=. arXiv preprint arXiv:2409.15360 , year=

arXiv

[19] [20]

arXiv preprint arXiv:2402.13210 , year=

Bayesian reward models for LLM alignment , author=. arXiv preprint arXiv:2402.13210 , year=

arXiv

[20] [21]

arXiv preprint arXiv:2409.13156 , year=

Rrm: Robust reward model training mitigates reward hacking , author=. arXiv preprint arXiv:2409.13156 , year=

arXiv

[21] [22]

Scaling test-time compute with open models , author=

[22] [23]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[23] [24]

arXiv preprint arXiv:2310.17631 , year=

Judgelm: Fine-tuned large language models are scalable judges , author=. arXiv preprint arXiv:2310.17631 , year=

arXiv

[24] [25]

arXiv preprint arXiv:2408.03314 , year=

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

Pith/arXiv arXiv

[25] [26]

Advances in Neural Information Processing Systems , volume=

Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=

[26] [27]

arXiv preprint arXiv:2408.00724 , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

Pith/arXiv arXiv

[27] [28]

arXiv preprint arXiv:2211.14275 , year=

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

Pith/arXiv arXiv

[28] [29]

arXiv preprint arXiv:2407.07880 , year=

Towards robust alignment of language models: Distributionally robustifying direct preference optimization , author=. arXiv preprint arXiv:2407.07880 , year=

arXiv

[29] [30]

and Louradour, Jérôme and Collobert, Ronan and Weston, Jason , year =

Bengio, Y. and Louradour, Jérôme and Collobert, Ronan and Weston, Jason , year =. Curriculum learning , volume =. Journal of the American Podiatry Association , doi =

[30] [31]

arXiv preprint arXiv:2506.04734 , year=

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design , author=. arXiv preprint arXiv:2506.04734 , year=

arXiv

[31] [32]

arXiv preprint arXiv:2504.00891 , year=

Genprm: Scaling test-time compute of process reward models via generative reasoning , author=. arXiv preprint arXiv:2504.00891 , year=

arXiv

[32] [33]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[33] [34]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv

[34] [35]

arXiv preprint arXiv:2406.06592 , year=

Improve mathematical reasoning in language models by automated process supervision , author=. arXiv preprint arXiv:2406.06592 , year=

Pith/arXiv arXiv

[35] [36]

arXiv preprint arXiv:2504.13837 , year=

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

Pith/arXiv arXiv

[36] [37]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[37] [38]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[38] [39]

arXiv preprint arXiv:2504.16828 , year=

Process reward models that think , author=. arXiv preprint arXiv:2504.16828 , year=

arXiv

[39] [40]

Advances in Neural Information Processing Systems , volume=

Rest-mcts*: Llm self-training via process reward guided tree search , author=. Advances in Neural Information Processing Systems , volume=

[40] [41]

Advances in Neural Information Processing Systems , volume=

Alphamath almost zero: process supervision without process , author=. Advances in Neural Information Processing Systems , volume=

[41] [42]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[42] [43]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Processbench: Identifying process errors in mathematical reasoning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[43] [44]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Evaluating mathematical reasoning beyond accuracy , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[44] [45]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[45] [46]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=