General Preference Reinforcement Learning
Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3
The pith
Embedding responses in k skew-symmetric subspaces lets reinforcement learning align open-ended language models without single-axis reward hacking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the k-way structure of the General Preference Model can be propagated through group-relative advantage estimation and eigenvalue aggregation in GPRL to produce stable multi-dimensional policy updates that resist reward hacking, achieving a 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct and outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench over extended training runs.
What carries the argument
The General Preference Model embedding of responses into k skew-symmetric subspaces, which represents preference as structured intransitivity-aware comparisons and supplies the structure for per-dimension group-relative advantages, scale-normalized aggregation via context-dependent eigenvalues, and closed-loop drift monitoring.
If this is right
- GPRL enables online RL on open-ended tasks where no programmatic verifier exists by supplying a multi-dimensional quality signal.
- Per-dimension normalization and eigenvalue aggregation prevent any single quality axis from dominating the policy update.
- The closed-loop drift monitor detects single-axis exploitation and corrects it by reweighting dimensions and tightening the trust region.
- Performance gains from the method hold across extended training runs rather than degrading from reward hacking.
Where Pith is reading between the lines
- If the k-subspace representation remains stable across runs, similar multi-dimensional structures could be inserted into other preference optimization methods to reduce hacking.
- Applying GPRL to larger base models or additional open-ended tasks would test whether resistance to collapse scales with model capacity.
- The emphasis on intransitivity-aware subspaces suggests that capturing preference cycles explicitly may be necessary for stable alignment beyond current scalar approaches.
Load-bearing premise
The General Preference Model's embedding of responses into k skew-symmetric subspaces provides a faithful, non-collapsing representation of multi-dimensional quality that can be stably propagated through group-relative advantage estimation and eigenvalue aggregation without introducing new optimization instabilities.
What would settle it
If long training runs show one dimension's advantage dominating despite per-scale normalization and eigenvalue aggregation, or if win rates decline due to exploitation on any benchmark despite the drift monitor, the claim that the multi-dimensional structure prevents collapse would be falsified.
Figures
read the original abstract
Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes General Preference Reinforcement Learning (GPRL) to address the limitations of scalar reward models in open-ended LLM alignment tasks. It introduces the General Preference Model (GPM) that embeds responses into k skew-symmetric subspaces for intransitivity-aware preference representation. GPRL extends this by computing per-dimension group-relative advantages, applying per-axis normalization, using context-dependent eigenvalue aggregation, and incorporating a closed-loop drift monitor to detect and correct single-axis exploitation. The authors demonstrate that GPRL applied to Llama-3-8B-Instruct achieves a length-controlled win rate of 56.51% on AlpacaEval 2.0 and superior performance on Arena-Hard, MT-Bench, and WildBench compared to SimPO and SPPO, attributing this to resistance against reward hacking over extended training.
Significance. If validated, this approach could provide a valuable framework for multi-dimensional preference optimization in LLMs, potentially enabling more robust online RL for open-ended tasks without the collapse associated with scalar rewards. The integration of structured preference modeling with a monitoring mechanism for stability is a notable contribution. The reported empirical improvements across multiple benchmarks indicate potential practical impact, provided the underlying assumptions about the GPM's representation and the drift monitor's effectiveness are substantiated.
major comments (3)
- [Method section (drift monitor description)] The central claim that GPRL resists reward hacking across extended training runs depends on the closed-loop drift monitor reliably detecting single-axis exploitation and correcting it via reweighting and trust region adjustment. However, the manuscript provides no explicit detection threshold, reweighting rule, or ablation studies isolating the monitor's contribution. Furthermore, no per-dimension exploitation metrics are reported for the training runs that produced the 56.51% AlpacaEval score. This omission makes it challenging to confirm that the performance gains arise from the full structured mechanism rather than simpler normalization effects.
- [Experiments section, AlpacaEval results] The reported length-controlled win rate of 56.51% is presented without accompanying error bars, statistical significance tests, or comparisons to ablated versions of GPRL (e.g., without eigenvalue aggregation or without the drift monitor). Given that the soundness of the empirical claims rests on these controls, their absence weakens the ability to attribute improvements specifically to the proposed components.
- [Method section (GPM embedding and propagation)] The assumption that embedding responses into k skew-symmetric subspaces provides a faithful, non-collapsing representation of multi-dimensional quality that propagates stably through group-relative advantage estimation and eigenvalue aggregation is central but not empirically verified. No analysis of potential dimension oscillation or cross-dimension interference over long runs is included, which is necessary to support the stability claims.
minor comments (2)
- [Notation and equations] The notation for skew-symmetric subspaces and context-dependent eigenvalue aggregation could benefit from additional explicit definitions or illustrative examples to improve clarity for readers.
- [Related Work] Consider expanding the related work section to include recent advances in multi-objective reinforcement learning and preference optimization for better contextualization of the novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. We address each of the major comments below, indicating the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [Method section (drift monitor description)] The central claim that GPRL resists reward hacking across extended training runs depends on the closed-loop drift monitor reliably detecting single-axis exploitation and correcting it via reweighting and trust region adjustment. However, the manuscript provides no explicit detection threshold, reweighting rule, or ablation studies isolating the monitor's contribution. Furthermore, no per-dimension exploitation metrics are reported for the training runs that produced the 56.51% AlpacaEval score. This omission makes it challenging to confirm that the performance gains arise from the full structured mechanism rather than simpler normalization effects.
Authors: We agree that more details on the drift monitor are necessary to support our claims. In the revised manuscript, we will expand the Method section to explicitly state the detection threshold (0.15 standard deviations from the mean per-dimension advantage), the reweighting rule (increasing weights for under-exploited dimensions by a factor proportional to the drift), and the trust region adjustment (reducing the KL penalty coefficient when drift is detected). We will also include ablation experiments comparing GPRL with and without the drift monitor, as well as per-dimension exploitation metrics such as the maximum advantage deviation per axis over the course of training for the reported runs. These additions will demonstrate that the monitor contributes to the observed resistance to reward hacking. revision: yes
-
Referee: [Experiments section, AlpacaEval results] The reported length-controlled win rate of 56.51% is presented without accompanying error bars, statistical significance tests, or comparisons to ablated versions of GPRL (e.g., without eigenvalue aggregation or without the drift monitor). Given that the soundness of the empirical claims rests on these controls, their absence weakens the ability to attribute improvements specifically to the proposed components.
Authors: We acknowledge the importance of statistical rigor and ablations for validating our empirical results. In the revision, we will add error bars based on three independent training runs with different random seeds, report p-values from paired t-tests against SimPO and SPPO baselines, and include results for ablated GPRL variants: one without context-dependent eigenvalue aggregation and one without the drift monitor. This will allow readers to better assess the contribution of each component to the 56.51% win rate. revision: yes
-
Referee: [Method section (GPM embedding and propagation)] The assumption that embedding responses into k skew-symmetric subspaces provides a faithful, non-collapsing representation of multi-dimensional quality that propagates stably through group-relative advantage estimation and eigenvalue aggregation is central but not empirically verified. No analysis of potential dimension oscillation or cross-dimension interference over long runs is included, which is necessary to support the stability claims.
Authors: To empirically verify the stability of the GPM representation, we will add a new subsection in the Experiments or Appendix analyzing the training dynamics. This will include time-series plots of per-dimension advantage norms to check for oscillation, and cross-dimension correlation heatmaps at different training checkpoints to assess interference. We expect these analyses to show that the skew-symmetric structure maintains distinct dimensions without collapse or excessive interference, supporting the propagation through the advantage estimation and aggregation steps. revision: yes
Circularity Check
No significant circularity detected in GPRL derivation
full rationale
The paper introduces GPRL as an algorithmic extension of the General Preference Model, carrying k-subspace structure through per-dimension group-relative advantages, per-axis normalization, context-dependent eigenvalue aggregation, and a closed-loop drift monitor. These elements are presented as explicit design decisions and structural choices rather than quantities derived from or reduced to fitted parameters by construction. Reported results such as the 56.51% length-controlled win rate on AlpacaEval 2.0 and outperformance on Arena-Hard, MT-Bench, and WildBench are empirical outcomes from training experiments, not predictions that collapse to inputs. No self-definitional loops, fitted-input-as-prediction patterns, uniqueness theorems imported via self-citation, or ansatz smuggling are identifiable in the provided abstract and method description. The derivation chain remains self-contained as a proposed method with independent experimental validation on standard benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
General Preference Model (GPM)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
per-dimension normalization in Eq. (4) is what makes Eq. (7) likely to hold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M., Cholakkal, H., Shah, M., Yang, M.-H., Torr, P
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025
-
[2]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[3]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[6]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[7]
Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024
Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024
-
[8]
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024
work page 2024
-
[9]
Nash learning from human feedback
Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024. 10
work page 2024
-
[10]
Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024
-
[11]
Qi Gou and Cam-Tu Nguyen. Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020
work page 2020
-
[14]
Panacea: Pareto alignment via preference adaptation for llms
Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024
work page 2024
-
[15]
Beyond bradley-terry models: a general preference model for language model alignment
Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, and Quanquan Gu. Beyond bradley-terry models: a general preference model for language model alignment. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025
work page 2025
-
[16]
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[18]
Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025
-
[19]
Pareto multi-objective alignment for language models
Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 257–272. Springer, 2025
work page 2025
-
[20]
Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, and Hua Wu. Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025
-
[21]
Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770,
Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025
-
[22]
Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982
Peter C Fishburn. Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982
work page 1982
-
[23]
Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991
Peter C Fishburn. Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991
work page 1991
-
[24]
Peter C Fishburn. An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981
work page 1981
-
[25]
Yutaka Nakamura. Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998
work page 1998
-
[26]
Yuda Song, Gokul Swamy, Aarti Singh, J Bagnell, and Wen Sun. The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024
work page 2024
-
[27]
Indirect online preference optimization via reinforcement learning
En Wang, Xingyu Lin, Chenfu Du Su, Zhonghou Lv Bao, Funing Yang, Yuanbo Xu, and Wenbin Liu. Indirect online preference optimization via reinforcement learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 538–546, 2025. 11
work page 2025
-
[28]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[29]
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
A survey on llm-as-a-judge.The Innovation, 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024
work page 2024
-
[32]
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024
-
[34]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[35]
Judging the judges: A systematic study of position bias in llm-as-a-judge
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...
work page 2025
-
[36]
Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025
-
[37]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024
work page 2024
-
[38]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023
work page 2023
-
[41]
Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023
work page 2023
-
[42]
Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization
Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12
work page 2024
-
[43]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023
work page 2023
-
[44]
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022
work page 2022
-
[45]
A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716,
Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023
-
[46]
Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319,
Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024
-
[47]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024
work page 2024
-
[48]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Thomas Kwa, Drake Thomas, and Adrià Garriga-Alonso. Catastrophic goodhart: regularizing rlhf with kl divergence does not mitigate heavy-tailed reward misspecification.Advances in Neural Information Processing Systems, 37:14608–14633, 2024
work page 2024
-
[50]
Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025
-
[51]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 13 Appendix A More on general preference embeddings This appendix expands on the embedding construction that GPRL inherits from GPM [15], focusing on the structural prop...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
characterized this empirically in RLHF as reward over-optimization, showing that as the policy spends KL budget against a learned RM, the gold reward traces a hill-shaped curve that initially climbs and then falls, with the peak depending on RM size, KL coefficient, and amount of preference data. The same qualitative shape, namely a peak followed by susta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.