Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Pith reviewed 2026-05-25 07:37 UTC · model grok-4.3
The pith
Two corrections from a stochastic reward channel model reduce the impact of imperfect verifiers on RLVR for math reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From the abstraction of verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ0 and ρ1, two corrections follow: the backward correction yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, while the forward correction reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the false-negative rate. Both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. An appeals mechanism with a lightweight LLM verifier estimates the false-negative rate online and further improves.
What carries the argument
Stochastic reward channel with false-positive rate ρ0 and false-negative rate ρ1, from which backward unbiased estimation and forward score-function reweighting are derived.
If this is right
- Both corrections can be added as lightweight hooks inside existing group relative policy optimization pipelines.
- Performance on math reasoning tasks improves under both synthetic and real verifier noise.
- The forward correction maintains stability when noise rates are increased.
- Online estimation of the false-negative rate via an appeals mechanism with a lightweight verifier yields additional gains.
Where Pith is reading between the lines
- The same channel model and corrections could be applied to other domains that use automated binary verifiers, such as code generation or theorem proving.
- If noise rates vary with the policy's outputs, the memoryless assumption would no longer hold and the corrections would need adaptive rate tracking.
- A combined backward-forward correction might be derived for cases where both rates are known, potentially offering further robustness.
Load-bearing premise
Verifier errors can be captured by a memoryless stochastic channel whose rates are known or can be estimated online without depending on the current policy.
What would settle it
Run the corrected RLVR pipeline on a verifier whose error rates are deliberately made to depend on the policy's current outputs and check whether the reported performance gains over the uncorrected baseline disappear.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $\rho_0$ and $\rho_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper models imperfect verifiers in RLVR as a memoryless stochastic reward channel with fixed false-positive rate ρ₀ and false-negative rate ρ₁. From this model it derives (i) a backward correction producing an unbiased surrogate reward (and thus unbiased policy-gradient estimator) and (ii) a forward correction that reweights the score-function estimator so its expectation aligns with the clean gradient (requiring only ρ₁). Both are implemented as lightweight modifications to a GRPO pipeline; experiments on math-reasoning tasks report that both corrections improve performance under synthetic and real verifier noise, with the forward variant more stable under heavier noise. An appeals mechanism using a lightweight LLM verifier is introduced to estimate ρ₁ online.
Significance. If the derivations and empirical gains hold under the stated noise model, the work supplies practical, low-overhead corrections that can be dropped into existing RLVR pipelines without altering the core optimizer. The online FN-rate estimator via appeals is a concrete engineering contribution that addresses a practical deployment issue. The approach is directly relevant to scaling automated-verifier RL for reasoning tasks.
major comments (2)
- [§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.
- [Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.
minor comments (2)
- [§3] Notation for the two rates is introduced as ρ₀ (FP) and ρ₁ (FN) in the abstract but should be restated with a short table or equation block at the start of §3 for readers who skip the abstract.
- [Appeals mechanism] The appeals mechanism is described only at a high level; a short pseudocode block or explicit update rule for the online ρ₁ estimator would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the central role of the independence assumption in our noise model. We address each major comment below and commit to revisions that strengthen the presentation of the assumptions and provide additional validation.
read point-by-point responses
-
Referee: [§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.
Authors: We agree that the memoryless channel with policy- and answer-independent rates is a load-bearing assumption required for the exact unbiasedness of the backward correction and the directional alignment of the forward correction. The derivations in §3 are stated under this model. While a general proof of robustness to arbitrary dependence is outside the scope of the present work, we will revise the manuscript to (i) explicitly restate the assumption and discuss its practical relevance for math-reasoning verifiers (where error is driven primarily by semantic mismatch rather than policy-induced distributional shifts) and (ii) introduce a concrete diagnostic that bins answers by length and syntactic features, estimates empirical ρ₁ within each bin across training epochs, and flags statistically significant policy dependence. If dependence is observed, the appeals-based estimator can be extended to condition on these features. revision: yes
-
Referee: [Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.
Authors: We acknowledge that the current experimental suite uses stationary noise rates. In the revision we will add a new ablation in which the false-negative rate is made explicitly dependent on answer length (a property that evolves during training). We will generate synthetic data under this length-dependent noise model, re-run the GRPO pipeline with both corrections, and report whether performance gains relative to the uncorrected baseline persist. We will also apply the binning diagnostic described above to the existing real-verifier experiments and include the results. revision: yes
Circularity Check
No circularity: derivations are direct mathematical consequences of the explicitly stated noise-channel model
full rationale
The paper defines a memoryless stochastic reward channel with fixed rates ρ0 (FP) and ρ1 (FN), then algebraically derives the backward correction (unbiased surrogate reward) and forward correction (reweighted score-function estimator) as expectations conditional on the true label y. These steps follow immediately from the channel definition and do not reduce to any fitted quantity on the evaluation data, any self-citation chain, or any renaming of an empirical pattern. The online FN-rate estimator via appeals is presented as a separate practical mechanism under the same independence assumption; it does not feed back into the derivation of the corrections themselves. The central claims therefore remain independent of the results they produce.
Axiom & Free-Parameter Ledger
free parameters (2)
- ρ0 (false-positive rate)
- ρ1 (false-negative rate)
axioms (2)
- domain assumption Verifier errors are memoryless and independent of the policy being trained.
- standard math The policy-gradient theorem continues to hold when the observed reward is replaced by the corrected surrogate.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ₀ and ρ₁ … instance-independent class-conditional noise rates (ρ₀, ρ₁) that do not vary with (x, y)
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
the estimator bR = (˜R − ρ₀) / (1 − ρ₀ − ρ₁) is an unbiased estimator … E[Δθ] = c ∇θJ(θ) with c = (1 − ρ₀ − ρ₁)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
-
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
-
On Training in Imagination
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
-
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
On Training in Imagination
The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.
-
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
Reference graph
Works this paper leans on
-
[1]
Humans or llms as the judge? a study on judgement bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327, 2024
work page 2024
-
[2]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol` o Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing 15 Systems 31: A...
work page 2018
-
[5]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedin...
work page 2024
-
[6]
Association for Computational Linguistics, 2024
work page 2024
-
[7]
Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025
-
[8]
Math-verify: A robust mathematical expression evaluator for llm outputs
Hugging Face. Math-verify: A robust mathematical expression evaluator for llm outputs. GitHub repository, 2025. URLhttps://github.com/huggingface/Math-Verify
work page 2025
-
[9]
HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URLhttps:// huggingface.co/datasets/HuggingFaceH4/aime_2024
work page 2024
-
[10]
Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Jennifer G. Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofPr...
work page 2018
-
[11]
Vishesh Karwa and Edoardo M Airoldi. On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023
-
[12]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Bel- grave, K. Cho, and A. ...
work page 2022
-
[13]
Junnan Li, Richard Socher, and Steven C. H. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020
work page 2020
-
[14]
Provably end-to- end label-noise learning without anchor points
Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to- end label-noise learning without anchor points. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pp. 6403–6413. PMLR, 2021
work page 2021
-
[15]
Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025
-
[16]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[17]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2,
-
[18]
math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URLhttps://huggingface.co/ datasets/math-ai/amc23
work page 2023
-
[19]
Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025
-
[20]
Dhillon, Pradeep Ravikumar, and Ambuj Tewari
Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learn- ing with noisy labels. In Christopher J. C. Burges, L´ eon Bottou, Zoubin Ghahramani, 17 and Kilian Q. Weinberger (eds.),Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting h...
work page 2013
-
[21]
OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URLhttps://huggingface. co/datasets/opencompass/AIME2025
work page 2025
-
[22]
Making deep neural networks robust to label noise: A loss correction approach
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2233–2241. IEEE Computer Society, 2017
work page 2017
-
[23]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Optimization-based prompt injection attack to llm-as-a-judge
Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen- qiang Gong. Optimization-based prompt injection attack to llm-as-a-judge. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, Octob...
work page 2024
-
[25]
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025
-
[26]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022
work page 2022
-
[27]
Richard S. Sutton, David A. McAllester, Satinder Singh, and Yishay Mansour. Policy gra- dient methods for reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert M¨ uller (eds.),Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 10...
work page 1999
-
[28]
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024
-
[29]
Reinforcement learning with perturbed rewards
Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA...
work page 2020
-
[30]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InThe Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[31]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural I...
work page 2022
-
[32]
Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992
work page 1992
-
[34]
Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025
-
[35]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. 19 In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...
work page 2023
-
[36]
One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025
Yulai Zhao, Haolin Liu, Dian Yu, SY Kung, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025
-
[37]
Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net...
work page 2023
-
[38]
The unconditional expectation is zero:E[G t] = 0 [32, 26]
-
[39]
The clean policy gradient is∇ θJ(θ) =E[R ∗Gt]. From property 1, we haveE[G t] =E[(1 {R∗=1} +1 {R∗=0})Gt] =E[R ∗Gt]+E[1 {R∗=0}Gt] = 0. This implies thatE[1 {R∗=0}Gt] =−E[R ∗Gt] =−∇ θJ(θ). Finally, we substitute this back into our expression for the expected update direction: E[ht] =E[w ˜RGt] =−(1−ρ 0 −ρ 1)·E[1 {R∗=0}Gt] =−(1−ρ 0 −ρ 1)·(−∇ θJ(θ)) = (1−ρ 0 −...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.