Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Feng Liu; Gang Niu; Masashi Sugiyama; Tongliang Liu; Wei Wang; Xin-Qiang Cai

arxiv: 2510.00915 · v4 · pith:GFZXOLDPnew · submitted 2025-10-01 · 💻 cs.LG · cs.AI

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Xin-Qiang Cai , Wei Wang , Feng Liu , Tongliang Liu , Gang Niu , Masashi Sugiyama This is my paper

Pith reviewed 2026-05-25 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningverifiable rewardsnoisy rewardsimperfect verifierspolicy gradientmath reasoningstochastic channel

0 comments

The pith

Two corrections from a stochastic reward channel model reduce the impact of imperfect verifiers on RLVR for math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models verifier errors as a memoryless stochastic channel with false-positive rate ρ0 and false-negative rate ρ1, then derives two lightweight fixes for binarized rewards in reinforcement learning. A backward correction produces an unbiased surrogate reward that yields an unbiased policy-gradient estimator in expectation. A forward correction reweights score-function terms so the expected update matches the clean gradient direction and needs only the false-negative rate. Both are implemented as hooks in a group relative policy optimization pipeline and improve results on math reasoning under synthetic and real verifier noise, with the forward version remaining more stable at higher noise levels. An appeals mechanism using a lightweight LLM verifier estimates the false-negative rate online and yields further gains.

Core claim

From the abstraction of verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ0 and ρ1, two corrections follow: the backward correction yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, while the forward correction reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the false-negative rate. Both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. An appeals mechanism with a lightweight LLM verifier estimates the false-negative rate online and further improves.

What carries the argument

Stochastic reward channel with false-positive rate ρ0 and false-negative rate ρ1, from which backward unbiased estimation and forward score-function reweighting are derived.

If this is right

Both corrections can be added as lightweight hooks inside existing group relative policy optimization pipelines.
Performance on math reasoning tasks improves under both synthetic and real verifier noise.
The forward correction maintains stability when noise rates are increased.
Online estimation of the false-negative rate via an appeals mechanism with a lightweight verifier yields additional gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same channel model and corrections could be applied to other domains that use automated binary verifiers, such as code generation or theorem proving.
If noise rates vary with the policy's outputs, the memoryless assumption would no longer hold and the corrections would need adaptive rate tracking.
A combined backward-forward correction might be derived for cases where both rates are known, potentially offering further robustness.

Load-bearing premise

Verifier errors can be captured by a memoryless stochastic channel whose rates are known or can be estimated online without depending on the current policy.

What would settle it

Run the corrected RLVR pipeline on a verifier whose error rates are deliberately made to depend on the policy's current outputs and check whether the reported performance gains over the uncorrected baseline disappear.

Figures

Figures reproduced from arXiv: 2510.00915 by Feng Liu, Gang Niu, Masashi Sugiyama, Tongliang Liu, Wei Wang, Xin-Qiang Cai.

**Figure 2.** Figure 2: Synthetic-Noise Results (pass@1) with 16 samples and 5 random seeds on the four backbones. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise BC: Training with noise under backward correction; Noise FC: Training with noise under forward correction. 4. Experiments We evaluate our approach under both synthetic and real-world verifier noise. We … view at source ↗

**Figure 3.** Figure 3: Synthetic-Noise Results (pass@8) with 16 samples and 5 random seeds on the four backbones Llama-3.2-3B-Instruct, and Qwen2.5-Math-7B. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise BC: Training with noise under backward correction; Noise FC: Training with noise under forward correction. advantage-construction stage within VERL. Evaluation … view at source ↗

**Figure 4.** Figure 4: Robustness results. (a) Backward correction (BC) with ˆρ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $\rho_0$ and $\rho_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives two corrections for asymmetric verifier noise in RLVR by treating errors as a fixed memoryless channel, but the independence assumption is a real limit.

read the letter

The core new piece is modeling verifier mistakes as an asymmetric stochastic channel with rates ρ0 (false positives) and ρ1 (false negatives), then deriving a backward correction that subtracts the known bias to produce an unbiased surrogate reward and a forward correction that reweights the score-function estimator to match the clean gradient direction. The forward version needs only ρ1, which is convenient. Both are plugged in as lightweight hooks inside a GRPO pipeline, and the paper also adds an online FN-rate estimator via a lightweight appeals verifier. This directly targets a scaling issue that already shows up in deployed RLVR systems for math reasoning, and the abstract reports gains under both synthetic and real noise, with the forward correction holding up better when noise is heavy. That is useful engineering work. The main soft spot is the assumption that ρ0 and ρ1 are constants independent of the sampled answer and the current policy. If error rates actually vary with answer length, syntax, or token distribution, and those properties shift as the policy updates, then the claimed unbiasedness and directional alignment no longer follow from the derivations. The stress-test note flags exactly this, and nothing in the abstract shows they tested or relaxed it. The online estimator inherits the same assumption. Without the full equations, proof details, or quantitative tables it is hard to judge how large the practical effect is or how sensitive the method is to mis-specified rates. This paper is aimed at people already running RLVR pipelines who want quick fixes for noisy verifiers rather than a full theoretical overhaul. A reader working on math-reasoning RL would get concrete ideas worth trying. It deserves peer review because the problem is real, the proposed fixes are lightweight and derived from a clear model, and the experiments (even if limited) address both synthetic and real noise.

Referee Report

2 major / 2 minor

Summary. The paper models imperfect verifiers in RLVR as a memoryless stochastic reward channel with fixed false-positive rate ρ₀ and false-negative rate ρ₁. From this model it derives (i) a backward correction producing an unbiased surrogate reward (and thus unbiased policy-gradient estimator) and (ii) a forward correction that reweights the score-function estimator so its expectation aligns with the clean gradient (requiring only ρ₁). Both are implemented as lightweight modifications to a GRPO pipeline; experiments on math-reasoning tasks report that both corrections improve performance under synthetic and real verifier noise, with the forward variant more stable under heavier noise. An appeals mechanism using a lightweight LLM verifier is introduced to estimate ρ₁ online.

Significance. If the derivations and empirical gains hold under the stated noise model, the work supplies practical, low-overhead corrections that can be dropped into existing RLVR pipelines without altering the core optimizer. The online FN-rate estimator via appeals is a concrete engineering contribution that addresses a practical deployment issue. The approach is directly relevant to scaling automated-verifier RL for reasoning tasks.

major comments (2)

[§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.
[Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.

minor comments (2)

[§3] Notation for the two rates is introduced as ρ₀ (FP) and ρ₁ (FN) in the abstract but should be restated with a short table or equation block at the start of §3 for readers who skip the abstract.
[Appeals mechanism] The appeals mechanism is described only at a high level; a short pseudocode block or explicit update rule for the online ρ₁ estimator would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the central role of the independence assumption in our noise model. We address each major comment below and commit to revisions that strengthen the presentation of the assumptions and provide additional validation.

read point-by-point responses

Referee: [§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.

Authors: We agree that the memoryless channel with policy- and answer-independent rates is a load-bearing assumption required for the exact unbiasedness of the backward correction and the directional alignment of the forward correction. The derivations in §3 are stated under this model. While a general proof of robustness to arbitrary dependence is outside the scope of the present work, we will revise the manuscript to (i) explicitly restate the assumption and discuss its practical relevance for math-reasoning verifiers (where error is driven primarily by semantic mismatch rather than policy-induced distributional shifts) and (ii) introduce a concrete diagnostic that bins answers by length and syntactic features, estimates empirical ρ₁ within each bin across training epochs, and flags statistically significant policy dependence. If dependence is observed, the appeals-based estimator can be extended to condition on these features. revision: yes
Referee: [Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.

Authors: We acknowledge that the current experimental suite uses stationary noise rates. In the revision we will add a new ablation in which the false-negative rate is made explicitly dependent on answer length (a property that evolves during training). We will generate synthetic data under this length-dependent noise model, re-run the GRPO pipeline with both corrections, and report whether performance gains relative to the uncorrected baseline persist. We will also apply the binning diagnostic described above to the existing real-verifier experiments and include the results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations are direct mathematical consequences of the explicitly stated noise-channel model

full rationale

The paper defines a memoryless stochastic reward channel with fixed rates ρ0 (FP) and ρ1 (FN), then algebraically derives the backward correction (unbiased surrogate reward) and forward correction (reweighted score-function estimator) as expectations conditional on the true label y. These steps follow immediately from the channel definition and do not reduce to any fitted quantity on the evaluation data, any self-citation chain, or any renaming of an empirical pattern. The online FN-rate estimator via appeals is presented as a separate practical mechanism under the same independence assumption; it does not feed back into the derivation of the corrections themselves. The central claims therefore remain independent of the results they produce.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating verifier errors as independent draws from a fixed two-parameter channel; the noise rates themselves function as free parameters of the model.

free parameters (2)

ρ0 (false-positive rate)
Parameter of the stochastic reward channel; required for the backward correction.
ρ1 (false-negative rate)
Parameter of the stochastic reward channel; required for both corrections and the online estimator.

axioms (2)

domain assumption Verifier errors are memoryless and independent of the policy being trained.
Invoked when the reward channel is defined and when expectations are taken over the noise.
standard math The policy-gradient theorem continues to hold when the observed reward is replaced by the corrected surrogate.
Background assumption needed to claim that the corrected estimator is unbiased or aligned.

pith-pipeline@v0.9.0 · 5751 in / 1363 out tokens · 21818 ms · 2026-05-25T07:37:20.698191+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ₀ and ρ₁ … instance-independent class-conditional noise rates (ρ₀, ρ₁) that do not vary with (x, y)
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

the estimator bR = (˜R − ρ₀) / (1 − ρ₀ − ρ₁) is an unbiased estimator … E[Δθ] = c ∇θJ(θ) with c = (1 − ρ₀ − ρ₁)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
cs.LG 2026-02 unverdicted novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
cs.LG 2026-04 unverdicted novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 5.0

The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
cs.LG 2026-02 unverdicted novelty 5.0

VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
High-Dimensional Statistics: Reflections on Progress and Open Problems
math.ST 2026-05 unverdicted novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 7 Pith papers · 4 internal anchors

[1]

Humans or llms as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327, 2024

work page 2024
[2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Tsang, and Masashi Sugiyama

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol` o Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing 15 Systems 31: A...

work page 2018
[5]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedin...

work page 2024
[6]

Association for Computational Linguistics, 2024

work page 2024
[7]

Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

work page arXiv 2025
[8]

Math-verify: A robust mathematical expression evaluator for llm outputs

Hugging Face. Math-verify: A robust mathematical expression evaluator for llm outputs. GitHub repository, 2025. URLhttps://github.com/huggingface/Math-Verify

work page 2025
[9]

Aime 2024 (dataset card)

HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URLhttps:// huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024
[10]

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Jennifer G. Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofPr...

work page 2018
[11]

On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

Vishesh Karwa and Edoardo M Airoldi. On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

work page arXiv 2023
[12]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Bel- grave, K. Cho, and A. ...

work page 2022
[13]

Junnan Li, Richard Socher, and Steven C. H. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

work page 2020
[14]

Provably end-to- end label-noise learning without anchor points

Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to- end label-noise learning without anchor points. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pp. 6403–6413. PMLR, 2021

work page 2021
[15]

Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

work page arXiv 2025
[16]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[17]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2,

work page
[18]

Amc 2023 (dataset card)

math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URLhttps://huggingface.co/ datasets/math-ai/amc23

work page 2023
[19]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

work page arXiv 2025
[20]

Dhillon, Pradeep Ravikumar, and Ambuj Tewari

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learn- ing with noisy labels. In Christopher J. C. Burges, L´ eon Bottou, Zoubin Ghahramani, 17 and Kilian Q. Weinberger (eds.),Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting h...

work page 2013
[21]

Aime 2025 (dataset card)

OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URLhttps://huggingface. co/datasets/opencompass/AIME2025

work page 2025
[22]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2233–2241. IEEE Computer Society, 2017

work page 2017
[23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Optimization-based prompt injection attack to llm-as-a-judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen- qiang Gong. Optimization-based prompt injection attack to llm-as-a-judge. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, Octob...

work page 2024
[25]

Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

work page arXiv 2025
[26]

Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

work page 2022
[27]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder Singh, and Yishay Mansour. Policy gra- dient methods for reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert M¨ uller (eds.),Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 10...

work page 1999
[28]

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

work page arXiv 2024
[29]

Reinforcement learning with perturbed rewards

Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA...

work page 2020
[30]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InThe Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[31]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural I...

work page 2022
[32]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

work page 1992
[34]

Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

work page arXiv 2025
[35]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. 19 In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

work page 2023
[36]

One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

Yulai Zhao, Haolin Liu, Dian Yu, SY Kung, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

work page arXiv 2025
[37]

Le, and Ed H

Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net...

work page 2023
[38]

The unconditional expectation is zero:E[G t] = 0 [32, 26]

work page
[39]

idx": 16,

The clean policy gradient is∇ θJ(θ) =E[R ∗Gt]. From property 1, we haveE[G t] =E[(1 {R∗=1} +1 {R∗=0})Gt] =E[R ∗Gt]+E[1 {R∗=0}Gt] = 0. This implies thatE[1 {R∗=0}Gt] =−E[R ∗Gt] =−∇ θJ(θ). Finally, we substitute this back into our expression for the expected update direction: E[ht] =E[w ˜RGt] =−(1−ρ 0 −ρ 1)·E[1 {R∗=0}Gt] =−(1−ρ 0 −ρ 1)·(−∇ θJ(θ)) = (1−ρ 0 −...

work page 2024

[1] [1]

Humans or llms as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327, 2024

work page 2024

[2] [2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Tsang, and Masashi Sugiyama

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol` o Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing 15 Systems 31: A...

work page 2018

[5] [5]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedin...

work page 2024

[6] [6]

Association for Computational Linguistics, 2024

work page 2024

[7] [7]

Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

work page arXiv 2025

[8] [8]

Math-verify: A robust mathematical expression evaluator for llm outputs

Hugging Face. Math-verify: A robust mathematical expression evaluator for llm outputs. GitHub repository, 2025. URLhttps://github.com/huggingface/Math-Verify

work page 2025

[9] [9]

Aime 2024 (dataset card)

HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URLhttps:// huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024

[10] [10]

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Jennifer G. Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofPr...

work page 2018

[11] [11]

On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

Vishesh Karwa and Edoardo M Airoldi. On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

work page arXiv 2023

[12] [12]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Bel- grave, K. Cho, and A. ...

work page 2022

[13] [13]

Junnan Li, Richard Socher, and Steven C. H. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

work page 2020

[14] [14]

Provably end-to- end label-noise learning without anchor points

Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to- end label-noise learning without anchor points. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pp. 6403–6413. PMLR, 2021

work page 2021

[15] [15]

Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

work page arXiv 2025

[16] [16]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024

[17] [17]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2,

work page

[18] [18]

Amc 2023 (dataset card)

math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URLhttps://huggingface.co/ datasets/math-ai/amc23

work page 2023

[19] [19]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

work page arXiv 2025

[20] [20]

Dhillon, Pradeep Ravikumar, and Ambuj Tewari

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learn- ing with noisy labels. In Christopher J. C. Burges, L´ eon Bottou, Zoubin Ghahramani, 17 and Kilian Q. Weinberger (eds.),Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting h...

work page 2013

[21] [21]

Aime 2025 (dataset card)

OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URLhttps://huggingface. co/datasets/opencompass/AIME2025

work page 2025

[22] [22]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2233–2241. IEEE Computer Society, 2017

work page 2017

[23] [23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Optimization-based prompt injection attack to llm-as-a-judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen- qiang Gong. Optimization-based prompt injection attack to llm-as-a-judge. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, Octob...

work page 2024

[25] [25]

Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

work page arXiv 2025

[26] [26]

Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

work page 2022

[27] [27]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder Singh, and Yishay Mansour. Policy gra- dient methods for reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert M¨ uller (eds.),Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 10...

work page 1999

[28] [28]

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

work page arXiv 2024

[29] [29]

Reinforcement learning with perturbed rewards

Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA...

work page 2020

[30] [30]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InThe Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023

[31] [31]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural I...

work page 2022

[32] [32]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

work page 1992

[34] [34]

Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

work page arXiv 2025

[35] [35]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. 19 In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

work page 2023

[36] [36]

One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

Yulai Zhao, Haolin Liu, Dian Yu, SY Kung, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

work page arXiv 2025

[37] [37]

Le, and Ed H

Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net...

work page 2023

[38] [38]

The unconditional expectation is zero:E[G t] = 0 [32, 26]

work page

[39] [39]

idx": 16,

The clean policy gradient is∇ θJ(θ) =E[R ∗Gt]. From property 1, we haveE[G t] =E[(1 {R∗=1} +1 {R∗=0})Gt] =E[R ∗Gt]+E[1 {R∗=0}Gt] = 0. This implies thatE[1 {R∗=0}Gt] =−E[R ∗Gt] =−∇ θJ(θ). Finally, we substitute this back into our expression for the expected update direction: E[ht] =E[w ˜RGt] =−(1−ρ 0 −ρ 1)·E[1 {R∗=0}Gt] =−(1−ρ 0 −ρ 1)·(−∇ θJ(θ)) = (1−ρ 0 −...

work page 2024