VeriGate: Verifier-Gated Step-Level Supervision for GRPO

Aakriti Agrawal; Furong Huang; Minghui Liu

arxiv: 2605.30451 · v1 · pith:UKG6RHDKnew · submitted 2026-05-28 · 💻 cs.LG

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

Aakriti Agrawal , Minghui Liu , Furong Huang This is my paper

Pith reviewed 2026-06-29 08:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords VeriGateGRPOprocess reward modelstep-level supervisionreasoning modelsreward hackingtoken-level advantagesverifier rewards

0 comments

The pith

VeriGate switches to process supervision only when verifier rewards are identical across trajectories to maintain informative gradients in GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VeriGate to fix two problems in Group Relative Policy Optimization for training reasoning models: when all sampled answers get the same verifier reward the advantage signal vanishes, and outcome rewards give no credit for individual reasoning steps. VeriGate uses the verifier reward whenever it distinguishes between trajectories, falling back to a process reward model only in degenerate cases. It turns step scores into future-cumulated rewards and normalizes them into token-level advantages within each group. This design is intended to deliver dense supervision while limiting exposure to reward hacking. A reader would care because the method reportedly raises accuracy by roughly 20 percent on 1.5 billion parameter models and cuts training stalls.

Core claim

VeriGate keeps the verifier in charge whenever verifier rewards induce a meaningful preference among sampled trajectories, and uses process supervision only when verifier rewards are degenerate; it converts process reward model step scores into future-cumulated rewards to assign continuation-aware credit and transforms these rewards into group-normalized token-level advantages, restoring informative gradients and fine-grained credit assignment while remaining less susceptible to reward hacking than methods that optimize aggregated PRM scores.

What carries the argument

The verifier-gated switch that invokes process rewards only on degenerate verifier cases, together with future-cumulated reward conversion and token-level group normalization.

If this is right

Training 1.5B and 7B models on MATH yields average accuracy gains of about 20% and 12% across six reasoning benchmarks.
Zero-gradient failures are substantially reduced compared with outcome-only GRPO.
Reward-hacking behavior decreases relative to PRM-as-outcome baselines.
Reasoning quality improves while keeping supervision tied to the verifier whenever possible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating rule could be tested on other policy optimization algorithms that suffer from sparse advantages.
Future experiments might measure whether future-cumulated rewards improve performance on problems requiring longer reasoning chains.
Adopting this hybrid approach may reduce the need for fully annotated step-level datasets in large-scale training.

Load-bearing premise

Invoking the process reward model on only degenerate verifier cases introduces no new reward-hacking surfaces or systematic bias in the advantages.

What would settle it

Running the same training setup but applying process rewards on all cases, including non-degenerate verifier outcomes, and checking whether accuracy gains disappear or reward hacking increases.

Figures

Figures reproduced from arXiv: 2605.30451 by Aakriti Agrawal, Furong Huang, Minghui Liu.

**Figure 1.** Figure 1: Overview of VeriGate. Left: VeriGate integrates process supervision into GRPO through three design choices: S1 gate PRM supervision by verifier informativeness, using standard GRPO when verifier rewards induce a preference and activating the PRM only for all-zero verifier groups; S2 use future-cumulated rewards for continuationaware credit assignment; and S3 convert PRM feedback into group-normalized toke… view at source ↗

**Figure 2.** Figure 2: Distribution of PRM product scores by correctness [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a): Comparison of verifier and PRM rewards over training steps to highlight reward hacking. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of hacked and unhacked responses for the same prompt. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Word cloud of the top 80 words most associated with reward-hacked traces. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of trace lengths for hacked and unhacked responses. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy across datasets for different outcome-reward weights [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) is an effective recipe for training reasoning models with verifier-based outcome rewards, but its supervision is sparse: when all sampled trajectories for a prompt receive the same verifier reward, the group-relative advantage collapses to zero and learning stalls. Outcome-only rewards also provide no step-level credit assignment, limiting exploration and making it harder to learn robust reasoning. We present VeriGate (Verifier-Gated Step-Level GRPO), a verifier-gated extension of GRPO that addresses these limitations with three design choices. First, VeriGate keeps the verifier in charge whenever verifier rewards induce a meaningful preference among sampled trajectories, and uses process supervision only when verifier rewards are degenerate. Second, instead of collapsing Process Reward Model (PRM) step scores into a single trajectory reward, VeriGate converts them into future-cumulated rewards to assign continuation-aware credit. Third, VeriGate transforms these rewards into group-normalized token-level advantages, restoring informative gradients and fine-grained credit assignment while remaining less susceptible to reward hacking than methods that optimize aggregated PRM scores. Empirically, training on MATH with 1.5B and 7B Qwen2.5-Instruct models and evaluating on six reasoning benchmarks, VeriGate improves average accuracy by about 20% and 12% for 1.5B and 7B models respectively, substantially reduces zero-gradient failures, decreases reward-hacking behavior, and improves reasoning quality relative to outcome-only GRPO and PRM-as-outcome baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeriGate gates PRM use behind degenerate verifier cases in GRPO and reports accuracy lifts, but the abstract leaves the no-new-bias claim untested.

read the letter

The main takeaway is that VeriGate keeps the verifier in charge unless every trajectory in a group gets the identical outcome reward, then falls back to a process reward model whose step scores are turned into future-cumulated token advantages and group-normalized at the token level.

The new pieces are the explicit gating rule and the cumulation step that supplies continuation-aware credit instead of collapsing PRM scores into one trajectory reward. The paper states the zero-advantage stall and the lack of step credit clearly, then gives three design choices that directly target both problems. The reported average accuracy gains of about 20 percent on the 1.5B model and 12 percent on the 7B model after MATH training, plus fewer zero-gradient failures, are the concrete results offered.

The soft spot is the unexamined premise that restricting the PRM to degenerate cases avoids systematic bias or fresh reward-hacking surfaces in the hybrid advantages. The abstract supplies no argument, ablation, or description of how degeneracy is detected, and it gives no variance numbers or statistical tests. If PRM scores still correlate with verifier signals outside those cases, the claimed reductions in hacking and the quality improvements remain hard to trust.

This is for researchers already running GRPO-style training on reasoning models. Anyone who has watched groups collapse to zero advantage would find the three design choices worth testing.

Send it to peer review. The failure mode is real, the method is straightforward to implement, and the empirical direction is worth checking even if the writeup needs tighter controls on the advantage calculation.

Referee Report

1 major / 1 minor

Summary. The paper introduces VeriGate, a verifier-gated extension of Group Relative Policy Optimization (GRPO) for training reasoning models. It addresses sparse outcome supervision and zero-gradient issues via three design choices: (1) retain verifier control unless all sampled trajectories receive identical rewards (degenerate case), (2) convert PRM step scores into future-cumulated token rewards rather than a single trajectory score, and (3) produce group-normalized token-level advantages. Training on MATH with 1.5B and 7B Qwen2.5-Instruct models yields reported average accuracy gains of ~20% and ~12% across six reasoning benchmarks, plus reductions in zero-gradient failures and reward-hacking relative to outcome-only GRPO and PRM-as-outcome baselines.

Significance. If the central empirical claims hold after addressing the noted gaps, VeriGate would supply a practical hybrid supervision recipe that preserves verifier reliability while adding step-level credit assignment only where needed, potentially improving stability and reasoning quality in RL for language models. The evaluation on two model scales and multiple external benchmarks is a positive aspect of the work.

major comments (1)

[Abstract, first design choice] Abstract, first design choice: the premise that invoking the PRM only on degenerate verifier cases (all trajectories receive identical outcome reward) safely avoids systematic bias in cumulated advantages or new reward-hacking surfaces is asserted without argument, robustness analysis of degeneracy detection, or ablation; this assumption is load-bearing for the claims of reduced zero-gradient failures and lower reward-hacking relative to baselines, as imperfect gating could still propagate PRM errors into the hybrid advantages.

minor comments (1)

[Abstract] Abstract: reported accuracy improvements are stated only as 'about 20%' and '12%' without exact deltas, standard deviations, confidence intervals, or the precise list of the six evaluation benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for stronger justification of the gating mechanism. We address the major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, first design choice] Abstract, first design choice: the premise that invoking the PRM only on degenerate verifier cases (all trajectories receive identical outcome reward) safely avoids systematic bias in cumulated advantages or new reward-hacking surfaces is asserted without argument, robustness analysis of degeneracy detection, or ablation; this assumption is load-bearing for the claims of reduced zero-gradient failures and lower reward-hacking relative to baselines, as imperfect gating could still propagate PRM errors into the hybrid advantages.

Authors: We agree that the current presentation asserts the safety of the verifier-gated design without sufficient supporting argument or analysis. The manuscript motivates the choice by noting that the verifier remains in control whenever it produces a non-degenerate preference among trajectories, thereby limiting PRM exposure to cases where outcome supervision provides no gradient signal. However, we acknowledge the absence of explicit robustness checks on degeneracy detection (e.g., sensitivity to near-degenerate reward distributions) and the lack of an ablation isolating the gate's contribution to the reported reductions in zero-gradient failures and reward-hacking. In the revised version we will add: (1) a dedicated paragraph in Section 3.1 formalizing the rationale and bounding the scope of PRM influence; (2) empirical statistics on degeneracy frequency across training prompts; and (3) an ablation that disables the gate (always using PRM) and measures changes in advantage bias, zero-gradient rate, and downstream accuracy. These additions will directly address the load-bearing nature of the assumption for the stability and hacking claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent algorithmic extension evaluated on external benchmarks

full rationale

The paper introduces VeriGate as a verifier-gated extension of GRPO defined by three explicit design choices for handling degenerate verifier rewards and converting PRM scores to cumulated token-level advantages. No equations, fitted parameters, or self-citations are shown that would make the reported accuracy gains or reductions in zero-gradient failures reduce to the inputs by construction. The empirical evaluation on MATH training and six external reasoning benchmarks provides independent content, so the derivation chain remains self-contained against external results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that verifier degeneracy is a reliable trigger for switching to process supervision and that cumulated PRM scores remain less hackable than aggregated scores; no free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Verifier rewards induce a meaningful preference among trajectories whenever they are not all identical.
Invoked in the first design choice to decide when to keep verifier control.

pith-pipeline@v0.9.1-grok · 5803 in / 1290 out tokens · 24103 ms · 2026-06-29T08:45:17.818641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 25 canonical work pages · 18 internal anchors

[1]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

work page arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Process reward models that think.arXiv preprint arXiv:2504.16828,

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828,

work page arXiv
[9]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Unveiling over-memorization in finetuning llms for reasoning tasks.arXiv preprint arXiv:2508.04117,

Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, and Guanhua Chen. Unveiling over-memorization in finetuning llms for reasoning tasks.arXiv preprint arXiv:2508.04117,

work page arXiv
[13]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

RL grokking recipe: How does RL unlock and transfer new algorithms in LLMs?arXiv preprint arXiv:2509.21016,

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, and Dawn Song. RL grokking recipe: How does RL unlock and transfer new algorithms in LLMs?arXiv preprint arXiv:2509.21016,

work page arXiv
[16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, and Chris Lee. Spurious rewards paradox: Mechanistically understanding how rlvr activates memorization shortcuts in llms.arXiv preprint arXiv:2601.11061,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183,

13 VeriGate : Verifier-Gated Step-Level Supervision for GRPO Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183,

work page arXiv
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

QiyingYu, ZhengZhang, RuofeiZhu, YufengYuan, XiaochenZuo, YuYue, WeinanDai, TiantianFan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforce- ment learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

On the interplay of pre-training, mid-training, and RL on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and RL on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a. Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. REST-MCTS*: LLM self- training via process reward guided tree search.Advances in Neural Information Processing ...

work page arXiv
[24]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

2024
[25]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025b. Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Limitations and Future Work Limitations.VeriGatestill depends on the availability and quality of process reward models

A. Limitations and Future Work Limitations.VeriGatestill depends on the availability and quality of process reward models. Although verifier-gating limits over-reliance on imperfect PRMs, systematic PRM biases can still shape updates on degenerate prompts. Our formulation also treats reasoning steps as units defined by the model’s generation format, so su...

2048
[27]

is an actor-critic RL method used for training an LLM policy. It stabilizes learning by constraining the update between the current policyπθ, the behavior policyπold used to generate 15 VeriGate : Verifier-Gated Step-Level Supervision for GRPO sampled trajectories, and a fixed reference policyπref used for KL regularization. Given a promptx, PPO samples a...

2024
[28]

Outcome Rewards and Process Rewards Reasoning-focused reinforcement learning uses two common kinds of feedback that differ in granularity

C.3. Outcome Rewards and Process Rewards Reasoning-focused reinforcement learning uses two common kinds of feedback that differ in granularity. Outcome rewards.Outcome rewards assign a single scalar to the full trajectory,Rout(x,y). In RLVR, this reward typically depends only on whether the final answer is correct. Outcome rewards are attractive because t...

2021
[29]

However, outcome-only supervision is inherently sparse

builds on this paradigm by using group-relative normalization to eliminate the need for learned reward or value models, yielding strong robustness to reward hacking and favorable scalability. However, outcome-only supervision is inherently sparse. GRPO assigns a single scalar reward to an entire reasoning trajectory and applies the same learning signal un...

2025
[30]

Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch

have been shown to improve credit assignment and learning efficiency. Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch. Policies optimized directly using PRM outputs can exploit systematic artifacts in the reward model, leading to reward hacking behaviors that ...

work page arXiv 2025

[1] [1]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward.arXiv preprint arXiv:2512.16912,

work page arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Process reward models that think.arXiv preprint arXiv:2504.16828,

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828,

work page arXiv

[9] [9]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Unveiling over-memorization in finetuning llms for reasoning tasks.arXiv preprint arXiv:2508.04117,

Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, and Guanhua Chen. Unveiling over-memorization in finetuning llms for reasoning tasks.arXiv preprint arXiv:2508.04117,

work page arXiv

[13] [13]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

RL grokking recipe: How does RL unlock and transfer new algorithms in LLMs?arXiv preprint arXiv:2509.21016,

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, and Dawn Song. RL grokking recipe: How does RL unlock and transfer new algorithms in LLMs?arXiv preprint arXiv:2509.21016,

work page arXiv

[16] [16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, and Chris Lee. Spurious rewards paradox: Mechanistically understanding how rlvr activates memorization shortcuts in llms.arXiv preprint arXiv:2601.11061,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183,

13 VeriGate : Verifier-Gated Step-Level Supervision for GRPO Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183,

work page arXiv

[21] [21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

QiyingYu, ZhengZhang, RuofeiZhu, YufengYuan, XiaochenZuo, YuYue, WeinanDai, TiantianFan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforce- ment learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

On the interplay of pre-training, mid-training, and RL on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and RL on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a. Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. REST-MCTS*: LLM self- training via process reward guided tree search.Advances in Neural Information Processing ...

work page arXiv

[24] [24]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

2024

[25] [25]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025b. Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Limitations and Future Work Limitations.VeriGatestill depends on the availability and quality of process reward models

A. Limitations and Future Work Limitations.VeriGatestill depends on the availability and quality of process reward models. Although verifier-gating limits over-reliance on imperfect PRMs, systematic PRM biases can still shape updates on degenerate prompts. Our formulation also treats reasoning steps as units defined by the model’s generation format, so su...

2048

[27] [27]

is an actor-critic RL method used for training an LLM policy. It stabilizes learning by constraining the update between the current policyπθ, the behavior policyπold used to generate 15 VeriGate : Verifier-Gated Step-Level Supervision for GRPO sampled trajectories, and a fixed reference policyπref used for KL regularization. Given a promptx, PPO samples a...

2024

[28] [28]

Outcome Rewards and Process Rewards Reasoning-focused reinforcement learning uses two common kinds of feedback that differ in granularity

C.3. Outcome Rewards and Process Rewards Reasoning-focused reinforcement learning uses two common kinds of feedback that differ in granularity. Outcome rewards.Outcome rewards assign a single scalar to the full trajectory,Rout(x,y). In RLVR, this reward typically depends only on whether the final answer is correct. Outcome rewards are attractive because t...

2021

[29] [29]

However, outcome-only supervision is inherently sparse

builds on this paradigm by using group-relative normalization to eliminate the need for learned reward or value models, yielding strong robustness to reward hacking and favorable scalability. However, outcome-only supervision is inherently sparse. GRPO assigns a single scalar reward to an entire reasoning trajectory and applies the same learning signal un...

2025

[30] [30]

Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch

have been shown to improve credit assignment and learning efficiency. Despite their promise, PRMs are learned models trained on imperfect supervision and are therefore vulnerable to bias and distributional mismatch. Policies optimized directly using PRM outputs can exploit systematic artifacts in the reward model, leading to reward hacking behaviors that ...

work page arXiv 2025