General Preference Reinforcement Learning

Ahsan Bilal; Andreas Haupt; Arslan Chaudhry; Emily Fox; John M. Cioffi; Muhammad Ahmed Mohsin; Muhammad Umer; Sanmi Koyejo

arxiv: 2605.18721 · v1 · pith:MRW6BJ2Inew · submitted 2026-05-18 · 💻 cs.LG · cs.CL

General Preference Reinforcement Learning

Muhammad Umer , Muhammad Ahmed Mohsin , Ahsan Bilal , Arslan Chaudhry , Andreas Haupt , Sanmi Koyejo , Emily Fox , John M. Cioffi This is my paper

Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords preference optimizationreinforcement learningLLM alignmentreward hackingmulti-dimensional preferencespolicy gradientgeneral preference modeldrift monitoring

0 comments

The pith

Structuring preferences across multiple skew-symmetric dimensions lets reinforcement learning align LLMs on open-ended tasks without single-axis reward hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that scalar reward models in online RL force collapse onto whichever axis the score rewards most, limiting their use for open-ended generation. Instead, the General Preference Model embeds responses into k skew-symmetric subspaces to capture intransitive, multi-dimensional preferences. GPRL extends this structure to the policy update by computing per-dimension group-relative advantages, normalizing each axis independently, and aggregating them with context-dependent eigenvalues. A closed-loop drift monitor detects single-axis exploitation and corrects it by reweighting dimensions and tightening the trust region. If the approach holds, it connects the continuous exploration of verifiable-reward RL with the flexibility of preference optimization, producing more stable alignment across extended runs.

Core claim

GPRL carries the k-way structure of the General Preference Model through to policy updates by computing per-dimension group-relative advantages, normalizing each on its own scale so no axis dominates, and aggregating them via context-dependent eigenvalues; the same structure powers a drift monitor that detects and corrects single-axis exploitation on the fly. Starting from Llama-3-8B-Instruct, this yields a length-controlled win rate of 56.51% on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across longer training.

What carries the argument

The General Preference Model that embeds responses into k skew-symmetric subspaces for intransitivity-aware comparisons; GPRL propagates this structure into per-dimension normalized advantages aggregated by context-dependent eigenvalues plus an on-the-fly drift monitor.

If this is right

GPRL starting from Llama-3-8B-Instruct reaches 56.51% length-controlled win rate on AlpacaEval 2.0.
It outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench.
The method resists reward hacking across extended training runs by balancing multiple preference dimensions.
Per-dimension normalization and eigenvalue aggregation keep no single axis able to dominate the policy update.
The drift monitor enables on-the-fly correction without requiring post-hoc fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-dimensional structure could support hybrid training that mixes verifiable rewards for math and code with preference signals for open-ended tasks.
If the skew-symmetric embedding generalizes beyond the tested models, it might reduce reliance on scalar reward models in other multi-objective RL settings such as robotics or dialogue systems.
A natural extension would test whether the drift monitor still prevents collapse when the number of subspaces k is varied or when preference data contains more intransitive cycles.
Longer runs on additional open-ended benchmarks could reveal whether the balanced updates produce qualitatively different generations than scalar baselines.

Load-bearing premise

Embedding responses into k skew-symmetric subspaces produces intransitivity-aware comparisons whose per-dimension group-relative advantages, when normalized separately and aggregated via context-dependent eigenvalues, yield policy updates that resist single-axis exploitation without new collapse modes.

What would settle it

Run GPRL for substantially longer than the reported schedules on the same base model and benchmarks; if length-controlled win rates drop or single-axis exploitation (such as excessive length or stylistic artifacts) reappears despite the drift monitor, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.18721 by Ahsan Bilal, Andreas Haupt, Arslan Chaudhry, Emily Fox, John M. Cioffi, Muhammad Ahmed Mohsin, Muhammad Umer, Sanmi Koyejo.

**Figure 1.** Figure 1: Landscape of LLM post-training methods, organized by supervision source and training regime. Online RL with a scalar RM reaches open-ended tasks but suffers reward hacking; GPRL fills the gap with a structured, multi-dimensional reward. In response, the field has split into two largely disconnected tracks. The first avoids explicit reward modeling and optimizes the policy directly on preference data. Offl… view at source ↗

**Figure 2.** Figure 2: Overview of GPRL. The policy πθ samples G responses per prompt, GPM embeds them, and R≻ produces k pairwise score matrices that yield per-dimension advantages. The aggregate drives the GRPO-style clipped surrogate, while a drift monitor D(t) adapts the dimensional weights and β to suppress reward hacking. responses per prompt, estimates sˆ(yi ≻ µ | x) = 1 K PK k=1 s(yi ≻ yk | x), and regresses log πθ/πθt o… view at source ↗

**Figure 3.** Figure 3: Dimensional drift distinguishes healthy training from reward hacking. (a) The variance profile α (t) holds its initial shape on a healthy GPRL run. (b) Under hacking, it collapses onto a single dimension l ⋆ . (c) D(t) stays near zero on the healthy run and crosses τ at t ′ on the hacked one, allowing the corrected trajectory to engage the controller at t ′ and pull back as the profile rebalances, while a … view at source ↗

**Figure 4.** Figure 4: Scaling and per-category breakdown. (a) AlpacaEval 2.0 LC. WR across five training epochs at both reward-model scales, with the controller enabled holding near its peak through epoch 5 and the controller disabled degrading once drift develops. (b, c) Per-category scores on MT-Bench and WildBench, where GPRL leads on the categories that match the supervision and on structural categories while remaining with… view at source ↗

read the original abstract

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces General Preference Reinforcement Learning (GPRL) for LLM post-training alignment. It defines a General Preference Model (GPM) that embeds responses into k skew-symmetric subspaces to capture intransitive, multi-dimensional preferences. The policy update computes per-dimension group-relative advantages, normalizes each dimension independently, aggregates them via context-dependent eigenvalues, and incorporates a closed-loop drift monitor to detect and correct single-axis exploitation on the fly. Starting from Llama-3-8B-Instruct, the method reports a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench, with the gains attributed to resistance to reward hacking over extended training runs.

Significance. If the empirical results and the underlying mechanism hold, GPRL could help close the gap between verifiable-reward online RL and preference optimization for open-ended tasks. The structured multi-dimensional treatment of preferences offers a concrete way to mitigate the single-axis collapse that scalar reward models permit, which is a recurring practical problem in current LLM alignment pipelines. The combination of subspace embeddings, independent normalization, eigenvalue aggregation, and an online drift monitor constitutes a distinctive technical contribution whose validation would be of interest to the preference-optimization and RL-for-LLM communities.

major comments (2)

[Abstract] Abstract: the central empirical claim (56.51% length-controlled win rate plus resistance to reward hacking across extended runs) is presented without any reported ablation, statistical test, or hyper-parameter table; it is therefore impossible to determine whether the performance is driven by the k-subspace structure, the per-dimension normalization, the eigenvalue aggregation, or other unstated implementation choices.
[Method] Method description (as summarized in the abstract): the context-dependent eigenvalues and per-dimension normalization are described as part of the update rule, yet it is unclear from the provided text whether these quantities are computed from the current batch in a manner that avoids circular dependence on the very policy being optimized; this circularity risk directly affects the load-bearing claim that the method prevents single-axis exploitation without introducing new instabilities.

minor comments (2)

The abstract would be strengthened by an explicit statement of the value of k used in the reported experiments and by a one-sentence description of the drift-monitor correction rule.
Notation for the skew-symmetric subspaces and the eigenvalue aggregation could be introduced earlier and used consistently to improve readability for readers outside the immediate sub-area.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications where possible and indicating planned revisions to improve the presentation of our results and method.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (56.51% length-controlled win rate plus resistance to reward hacking across extended runs) is presented without any reported ablation, statistical test, or hyper-parameter table; it is therefore impossible to determine whether the performance is driven by the k-subspace structure, the per-dimension normalization, the eigenvalue aggregation, or other unstated implementation choices.

Authors: We agree that the abstract, due to length constraints, does not include ablations, statistical details, or hyperparameter tables. The full manuscript reports these in Section 4 (ablations on k and normalization), Table 3 (hyperparameters), and with standard errors in the main results tables. To address the concern directly, we will revise the abstract to briefly note the key ablation outcomes supporting the contribution of the multi-dimensional structure and add an explicit pointer to the hyperparameter and statistical details in the main text. revision: yes
Referee: [Method] Method description (as summarized in the abstract): the context-dependent eigenvalues and per-dimension normalization are described as part of the update rule, yet it is unclear from the provided text whether these quantities are computed from the current batch in a manner that avoids circular dependence on the very policy being optimized; this circularity risk directly affects the load-bearing claim that the method prevents single-axis exploitation without introducing new instabilities.

Authors: The General Preference Model is held fixed after pretraining and is not updated during policy optimization. Context-dependent eigenvalues are computed from the preference embeddings of the current batch using this fixed model, while per-dimension normalization is applied to group-relative advantages derived from the same batch before any policy parameter update occurs. The drift monitor relies on running statistics from prior batches. This ordering is specified in Algorithm 1 and Equations (4)–(7). We will add a clarifying paragraph and a flowchart in Section 3 to make the non-circular computation explicit and remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation chain starts from the General Preference Model (GPM) embedding responses into k skew-symmetric subspaces and carries this structure into the GPRL policy update via per-dimension group-relative advantages, independent normalization, aggregation with context-dependent eigenvalues, and a closed-loop drift monitor. None of these steps are shown in the abstract or described mechanism to be defined in terms of the final performance metrics or to reduce by construction to fitted inputs from the same run. The empirical results on AlpacaEval 2.0, Arena-Hard, MT-Bench, and WildBench are presented as external validation of the mechanism's ability to resist reward hacking, rather than as a tautological outcome of the update rule itself. No self-citation chain, ansatz smuggling, or renaming of known results is invoked as load-bearing in the provided text. The construction remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the validity of representing preferences via k skew-symmetric subspaces and on the effectiveness of the per-dimension normalization plus drift-monitor correction; both are introduced in this work without external benchmarks or independent verification cited in the abstract.

free parameters (1)

number of subspaces k
The dimension k that determines how many skew-symmetric subspaces are used to embed responses is a modeling choice whose specific value is not reported in the abstract.

axioms (2)

domain assumption Quality judgments are multi-dimensional and any scalar proxy allows collapse onto the most sensitive axis.
Stated in the abstract as the motivation for moving beyond scalar reward models.
domain assumption Preferences admit structured intransitive comparisons that can be captured by skew-symmetric subspace embeddings.
Invoked when defining the General Preference Model.

invented entities (2)

General Preference Model (GPM) no independent evidence
purpose: Embed responses into k skew-symmetric subspaces to represent multi-dimensional, intransitivity-aware preferences.
Newly proposed structure to replace scalar reward models.
closed-loop drift monitor no independent evidence
purpose: Detect single-axis exploitation and correct it by reweighting dimensions and tightening the trust region.
Introduced as part of GPRL to maintain balance across dimensions during training.

pith-pipeline@v0.9.0 · 5828 in / 1814 out tokens · 55766 ms · 2026-05-20T12:39:49.888933+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. ... per-dimension normalization in Eq. (4) is what makes Eq. (7) likely to hold, since rescaling every Â(i)_l to unit variance bounds how much any single dimension can grow its contribution to the aggregate.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

If X_{l≠l*} λ_l(x) (Â* _l − Â† _l) > λ_{l*}(x) (Â† _{l*} − Â* _{l*}) then Â†(x) < Â*(x). ... the per-dimension normalization ... prevents any one axis from inflating its share of the aggregate by simply growing in magnitude.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 13 internal anchors

[1]

Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

work page arXiv 2025
[2]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[3]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[6]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[7]

Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

work page arXiv 2024
[8]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

work page 2024
[9]

Nash learning from human feedback

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024. 10

work page 2024
[10]

Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

work page arXiv 2024
[11]

Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

Qi Gou and Cam-Tu Nguyen. Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

work page arXiv 2024
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

work page 2020
[14]

Panacea: Pareto alignment via preference adaptation for llms

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024

work page 2024
[15]

Beyond bradley-terry models: a general preference model for language model alignment

Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, and Quanquan Gu. Beyond bradley-terry models: a general preference model for language model alignment. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025

work page 2025
[16]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[18]

Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

work page arXiv 2025
[19]

Pareto multi-objective alignment for language models

Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 257–272. Springer, 2025

work page 2025
[20]

Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, and Hua Wu. Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

work page arXiv 2025
[21]

Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025

work page arXiv 2025
[22]

Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

Peter C Fishburn. Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

work page 1982
[23]

Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

Peter C Fishburn. Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

work page 1991
[24]

An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

Peter C Fishburn. An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

work page 1981
[25]

Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

Yutaka Nakamura. Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

work page 1998
[26]

The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

Yuda Song, Gokul Swamy, Aarti Singh, J Bagnell, and Wen Sun. The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

work page 2024
[27]

Indirect online preference optimization via reinforcement learning

En Wang, Xingyu Lin, Chenfu Du Su, Zhonghou Lv Bao, Funing Yang, Yuanbo Xu, and Wenbin Liu. Indirect online preference optimization via reinforcement learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 538–546, 2025. 11

work page 2025
[28]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[29]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

work page 2024
[32]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

work page arXiv 2024
[34]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[35]

Judging the judges: A systematic study of position bias in llm-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...

work page 2025
[36]

Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

work page arXiv 2025
[37]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

work page 2024
[38]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

work page 2023
[41]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

work page 2023
[42]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

work page 2024
[43]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

work page 2023
[44]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

work page 2022
[45]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023
[46]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

work page arXiv 2024
[47]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024

work page 2024
[48]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Thomas Kwa, Drake Thomas, and Adrià Garriga-Alonso. Catastrophic goodhart: regularizing rlhf with kl divergence does not mitigate heavy-tailed reward misspecification.Advances in Neural Information Processing Systems, 37:14608–14633, 2024

work page 2024
[50]

Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025

Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025

work page arXiv 2025
[51]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 13 Appendix A More on general preference embeddings This appendix expands on the embedding construction that GPRL inherits from GPM [15], focusing on the structural prop...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Limitations

characterized this empirically in RLHF as reward over-optimization, showing that as the policy spends KL budget against a learned RM, the gold reward traces a hill-shaped curve that initially climbs and then falls, with the peak depending on RM size, KL coefficient, and amount of preference data. The same qualitative shape, namely a peak followed by susta...

work page
[55]

All preference data and prompts used come from previously released public corpora

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

work page arXiv 2025

[2] [2]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[3] [3]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023

[6] [6]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[7] [7]

Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

work page arXiv 2024

[8] [8]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

work page 2024

[9] [9]

Nash learning from human feedback

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024. 10

work page 2024

[10] [10]

Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

work page arXiv 2024

[11] [11]

Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

Qi Gou and Cam-Tu Nguyen. Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

work page arXiv 2024

[12] [12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

work page 2020

[14] [14]

Panacea: Pareto alignment via preference adaptation for llms

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024

work page 2024

[15] [15]

Beyond bradley-terry models: a general preference model for language model alignment

Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, and Quanquan Gu. Beyond bradley-terry models: a general preference model for language model alignment. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025

work page 2025

[16] [16]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[18] [18]

Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

work page arXiv 2025

[19] [19]

Pareto multi-objective alignment for language models

Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 257–272. Springer, 2025

work page 2025

[20] [20]

Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, and Hua Wu. Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

work page arXiv 2025

[21] [21]

Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025

work page arXiv 2025

[22] [22]

Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

Peter C Fishburn. Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

work page 1982

[23] [23]

Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

Peter C Fishburn. Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

work page 1991

[24] [24]

An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

Peter C Fishburn. An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

work page 1981

[25] [25]

Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

Yutaka Nakamura. Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

work page 1998

[26] [26]

The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

Yuda Song, Gokul Swamy, Aarti Singh, J Bagnell, and Wen Sun. The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

work page 2024

[27] [27]

Indirect online preference optimization via reinforcement learning

En Wang, Xingyu Lin, Chenfu Du Su, Zhonghou Lv Bao, Funing Yang, Yuanbo Xu, and Wenbin Liu. Indirect online preference optimization via reinforcement learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 538–546, 2025. 11

work page 2025

[28] [28]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[29] [29]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

work page 2024

[32] [32]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

work page arXiv 2024

[34] [34]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023

[35] [35]

Judging the judges: A systematic study of position bias in llm-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...

work page 2025

[36] [36]

Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

work page arXiv 2025

[37] [37]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

work page 2024

[38] [38]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

work page 2023

[41] [41]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

work page 2023

[42] [42]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

work page 2024

[43] [43]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

work page 2023

[44] [44]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

work page 2022

[45] [45]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023

[46] [46]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

work page arXiv 2024

[47] [47]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024

work page 2024

[48] [48]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Thomas Kwa, Drake Thomas, and Adrià Garriga-Alonso. Catastrophic goodhart: regularizing rlhf with kl divergence does not mitigate heavy-tailed reward misspecification.Advances in Neural Information Processing Systems, 37:14608–14633, 2024

work page 2024

[50] [50]

Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025

Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025

work page arXiv 2025

[51] [51]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 13 Appendix A More on general preference embeddings This appendix expands on the embedding construction that GPRL inherits from GPM [15], focusing on the structural prop...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Limitations

characterized this empirically in RLHF as reward over-optimization, showing that as the policy spends KL budget against a learned RM, the gold reward traces a hill-shaped curve that initially climbs and then falls, with the peak depending on RM size, KL coefficient, and amount of preference data. The same qualitative shape, namely a peak followed by susta...

work page

[55] [55]

All preference data and prompts used come from previously released public corpora

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page