RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Pith reviewed 2026-05-21 22:06 UTC · model grok-4.3
The pith
Extracting binary yes-no principles from human feedback lets reward models beat traditional preference models on alignment benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposing natural language feedback into binary principles that a response either satisfies or does not satisfy, then training reward models to judge entailment against those principles, produces reward models that surpass Bradley-Terry models trained on matched data and reach 86.2 percent on RM-Bench and 81.4 percent on JudgeBench while allowing principle selection at inference time.
What carries the argument
Binary Flexible Feedback extraction that turns natural language comments into yes-no principles and frames reward modeling as an entailment task between a response and each principle.
If this is right
- Reward models achieve 86.2 percent on RM-Bench and 81.4 percent on JudgeBench.
- An aligned Qwen3-32B model matches or exceeds o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard v2 at under five percent of the inference cost.
- Users can specify any set of principles at inference time to steer the reward model toward chosen quality aspects.
- The same data produces stronger results than Bradley-Terry training because the binary format supplies explicit criteria.
Where Pith is reading between the lines
- Custom principle selection at inference time could support domain-specific alignment without new training runs.
- Making feedback criteria explicit may reduce reward hacking by limiting the reward model to stated principles.
- The binary decomposition approach might extend to other feedback sources such as automated verifiers or multi-turn conversations.
Load-bearing premise
Natural language feedback can be split into binary principles that keep the main aspects of response quality without losing important detail or adding extraction mistakes.
What would settle it
A head-to-head test on identical data where the binary-principle reward models score no higher than Bradley-Terry models on RM-Bench or JudgeBench would falsify the performance claim.
Figures
read the original abstract
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RLBFF, which extracts binary principles (e.g., 'accuracy of information: yes') from natural language human feedback to train reward models as an entailment task rather than standard Bradley-Terry ranking. This enables interpretable rewards and inference-time customization by specifying principles of interest. The authors claim RLBFF reward models outperform matched Bradley-Terry models, achieve state-of-the-art results on RM-Bench (86.2%) and JudgeBench (81.4%, #1 as of September 24, 2025), and provide a fully open recipe (data and code) to align Qwen3-32B via RLBFF to match or exceed o3-mini and DeepSeek-R1 on MT-Bench, WildBench, and Arena Hard v2 at <5% inference cost.
Significance. If the results hold under scrutiny, RLBFF offers a practical bridge between the flexibility of RLHF and the precision of RLVR, with added benefits of interpretability and user-specified customization at inference time. The open release of models, data, and training recipe is a clear strength that supports reproducibility and adoption. The approach could influence post-training practices if the binary decomposition reliably captures nuanced preferences without systematic loss.
major comments (3)
- [§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.
- [§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.
- [§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.
minor comments (2)
- [Abstract] The abstract's reference to the JudgeBench leaderboard position 'as of September 24, 2025' would benefit from a direct link or archived snapshot for independent verification.
- [§3] Notation for the entailment task (response satisfies principle or not) could be formalized with a short equation or pseudocode for clarity, especially when contrasting with Bradley-Terry loss.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify areas where the manuscript can be strengthened. We address each major comment point by point below, indicating the revisions we will make in the next version of the paper.
read point-by-point responses
-
Referee: [§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.
Authors: We agree that greater specificity on the extraction procedure is needed to support the central claims and enable reproducibility. While §3 outlines the conceptual approach, the revised manuscript will add a dedicated subsection detailing the exact LLM employed for decomposition, the complete prompts, all filtering steps, observed error rates, and direct validation comparing the extracted binary principles against the original natural language feedback distributions. This addition will allow readers to evaluate potential loss of nuance or introduction of artifacts. revision: yes
-
Referee: [§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.
Authors: We acknowledge that additional controls and ablations are required to confidently attribute performance to the entailment formulation rather than data curation or extraction choices. The current version describes the data-matching procedure, but the revised manuscript will incorporate expanded ablation studies, full hyperparameter details, and targeted controls that isolate the contribution of entailment training from the extraction process. These will be presented in §4 and §5 alongside the main results. revision: yes
-
Referee: [§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.
Authors: The referee correctly notes the absence of quantitative fidelity assessment. To directly evaluate whether binary principles preserve key aspects of the original feedback, the revised manuscript will add agreement metrics and human validation results in §3. These will include inter-annotator agreement scores and human ratings of how faithfully the extracted principles capture the original natural language feedback. revision: yes
Circularity Check
No circularity: central claims rest on external benchmark evaluation independent of training inputs
full rationale
The paper introduces RLBFF by extracting binary principles from natural-language feedback and training reward models as an entailment task. Performance is reported on independent public benchmarks (RM-Bench 86.2%, JudgeBench 81.4%) and alignment suites (MT-Bench, WildBench, Arena Hard v2) rather than on any quantity defined from the training data or fitted parameters. No equations, derivations, or self-citations are shown that reduce the claimed improvements to the input feedback or extraction process by construction. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Principle extraction procedure
axioms (1)
- domain assumption Binary yes/no answers to extracted principles preserve the essential information in human feedback for reward modeling
Reference graph
Works this paper leans on
-
[1]
David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models, 2025. URL https://arxiv.org/abs/2505.13388
-
[2]
rapidfuzz/rapidfuzz: Release 3.13.0, April 2025
Max Bachmann. rapidfuzz/rapidfuzz: Release 3.13.0, April 2025. URL https://doi.org/10.5281/zenodo.15133267
-
[3]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[4]
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning, 2025. URL https://arxiv.org/abs/2505.02387
-
[5]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024
work page 2024
-
[6]
ContextualAI. Lmunit-llama3.1-70b. https://huggingface.co/ContextualAI/LMUnit-llama3.1-70b, 2025 a
work page 2025
-
[7]
ContextualAI. Lmunit-qwen2.5-72b. https://huggingface.co/ContextualAI/LMUnit-qwen2.5-72b, 2025 b
work page 2025
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2025. URL https://arxiv.org/abs/2404.04475
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/2402.01306
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Team Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025
Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025. URL https://arxiv.org/abs/2505.22203
-
[13]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Daniel Keller and Maria Kostromitina. Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews. International Journal of Hospitality Management, 86: 0 102440, 2020. ISSN 0278-4319. doi:https://doi.org/10.1016/j.ijhm.2019.102440. URL https://www.sciencedirect.com/science/article/pii/S02784...
-
[15]
Prometheus 2: An open source language model specialized in evaluating other language models, 2024
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models, 2024. URL https://arxiv.org/abs/2405.01535
-
[16]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The Arena-Hard pipeline. https://lmsys.org/blog/2024-04-19-arena-hard/, April 2024
work page 2024
-
[18]
Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MKEHCx25xp
work page 2025
-
[19]
Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024
Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024. URL https://arxiv.org/abs/2410.16184
-
[20]
RM -bench: Benchmarking reward models of language models with subtlety and style
Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. RM -bench: Benchmarking reward models of language models with subtlety and style. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=QEHrmQPBdd
work page 2025
-
[21]
Inference-time scaling for generalist reward modeling,
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025 b . URL https://arxiv.org/abs/2504.02495
-
[22]
LMSys. Arena-hard-auto leaderboard. https://github.com/lm-sys/arena-hard-auto, 2024
work page 2024
-
[23]
SimPO : Simple preference optimization with a reference-free reward, 2024
Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO : Simple preference optimization with a reference-free reward, 2024
work page 2024
-
[24]
Rule based rewards for language model safety, 2024
Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety, 2024. URL https://arxiv.org/abs/2411.01111
-
[25]
MTEB : Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...
-
[26]
NVIDIA. nvidia/HelpSteer3\#feedback . https://huggingface.co/datasets/nvidia/HelpSteer3#feedback, 2025 a
work page 2025
-
[27]
Nemo rl: A scalable and efficient post-training library
Nemo NVIDIA. Nemo rl: A scalable and efficient post-training library. https://github.com/NVIDIA-NeMo/RL, 2025 b . GitHub repository
work page 2025
-
[28]
OpenAI. Openai model spec, Apr 2025. URL https://model-spec.openai.com/2025-04-11.html
work page 2025
- [29]
-
[30]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022
work page 2022
-
[31]
Generalizing Verifiable Instruction Following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025. URL https://arxiv.org/abs/2507.02833
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Lmunit: Fine-grained evaluation with natural language unit tests, 2024
Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests, 2024. URL https://arxiv.org/abs/2412.13091
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023
work page 2023
-
[35]
NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024
work page 2024
-
[36]
Judgebench: A benchmark for evaluating LLM -based judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq
work page 2025
-
[37]
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
THU-KEG. Rm-bench leaderboard. https://github.com/THU-KEG/RM-Bench-Leaderboard, 2025
work page 2025
-
[39]
Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiming Yang, Shiyong Li, Tianhang Zhu, Wen Xie, Wenhao Huang, Xi...
-
[40]
Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev
Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/...
work page 2024
-
[41]
Helpsteer2-preference: Complementing ratings with preferences
Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with preferences. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=MnfHxPP5gs
work page 2025
-
[42]
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, and Oleksii Kuchaiev. H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings o...
-
[43]
HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages,
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages, 2025 c . URL https://arxiv.org/abs/2505.11475
-
[44]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Reward hacking in reinforcement learning
Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
work page 2024
-
[47]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Rewardanything: Generalizable principle-following reward models, 2025
Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models, 2025. URL https://arxiv.org/abs/2506.03637
-
[49]
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2024. URL https://arxiv. org/abs/2408.15240, 2024
-
[50]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https://arxiv.org/abs/2506.05176
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[52]
Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf, 2024. URL https://arxiv.org/abs/2401.16335
-
[53]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[54]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[55]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[56]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.