Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Jing Li; Kaiqi Zhao; Xiucheng Li; Yucong Huang

arxiv: 2605.17342 · v1 · pith:72WCD6QSnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Yucong Huang , Xiucheng Li , Kaiqi Zhao , Jing Li This is my paper

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords preference decompositioncyclic preferencesLLM alignmentRLHFgame-theoretic decompositiondynamic self-playNash equilibriumtransitive preferences

0 comments

The pith

Explicitly decomposing human preferences into orthogonal transitive scalar and cyclic vector components enables more effective large language model alignment than implicit models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLHF methods model preferences as transitive scalars, but human preferences frequently contain cyclic elements that a single reward cannot represent. Prior implicit approaches like the General Preference Model entangle hierarchy with cyclicity and therefore cannot guarantee dominant solutions. The paper introduces the Hybrid Reward-Cyclic model, which applies game-theoretic decomposition to separate preferences into an orthogonal transitive scalar part and a cyclic vector part. A complementary Dynamic Self-Play Preference Optimization procedure then treats alignment as a time-varying game that steers the policy toward Nash equilibrium. Experiments on synthetic mixed settings and real benchmarks show faster convergence and higher accuracy when both components are modeled explicitly.

Core claim

The Hybrid Reward-Cyclic (HRC) model utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components, addressing the limitation of implicit formulations in prior models like GPM that fail to guarantee dominant solutions. Complementing this, Dynamic Self-Play Preference Optimization (DSPPO) treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments validate HRC's structural superiority in mixed transitive-cyclic settings, while evaluations on RewardBench 2, AlpacaEval 2.0, Arena-Hard, and MT-Bench confirm consistent gains over BT and GPM.

What carries the argument

The Hybrid Reward-Cyclic (HRC) model, which uses game-theoretic decomposition to separate preferences into an orthogonal transitive scalar component and a cyclic vector component.

If this is right

HRC converges faster and reaches higher accuracy than GPM on synthetic data containing both transitive and cyclic preferences.
HRC improves over both BT and GPM baselines on RewardBench 2, with particular gains in the Ties domain that tests complex non-strict preferences.
When paired with DSPPO, HRC produces higher length-controlled win rates on AlpacaEval 2.0 and Arena-Hard than SPPO baselines trained with BT or GPM.
The explicit separation enables robust handling of preferences that violate strict transitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orthogonal decomposition could be applied to preference data in recommendation systems or multi-agent coordination where cycles commonly appear.
Isolating the cyclic vector component may allow targeted diagnostics for inconsistent outputs in deployed language models.
Extending the dynamic self-play procedure to other time-varying preference settings could produce more stable training trajectories.

Load-bearing premise

Human preferences admit an orthogonal decomposition into transitive scalar and cyclic vector components that preserves all relevant information and that the resulting game admits a dominant solution reachable by the proposed procedure.

What would settle it

A controlled experiment on data with known cyclic preferences in which HRC fails to converge faster or reach higher accuracy than GPM would falsify the claim of structural superiority.

Figures

Figures reproduced from arXiv: 2605.17342 by Jing Li, Kaiqi Zhao, Xiucheng Li, Yucong Huang.

**Figure 1.** Figure 1: Comparison of the Bradley-Terry (BT) model and the proposed Hybrid Reward-Cyclic (HRC) model. (a) The BT model maps each instruction-response pair to a scalar reward, assuming transitive preferences. (b) The HRC model explicitly decomposes preferences into a transitive scalar component (via BT) and a cyclic vector component (via GPM), combining them to produce the final preference signal sHRC. composed, we… view at source ↗

**Figure 2.** Figure 2: Iteration 3 Alignment Performance across Three Benchmarks. We compare BT+SPPO, GPM+SPPO, HRC+SPPO, and HRC+DSPPO at Iteration 3 on AlpacaEval 2.0 (LC. Win Rate), MT-Bench (Average Score), and Arena-Hard-v0.1 (Win Rate). Each panel reports results using both Gemma-2B-it and Llama-3.1-8B-Instruct as preference models. signals: BT model (Bradley & Terry, 1952), GPM (Zhang et al., 2025c) and our HRC model. (2)… view at source ↗

read the original abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries an explicit game-theoretic split of preferences into transitive scalar and cyclic vector parts with some benchmark gains, but the orthogonality and invertibility of that split are not derived.

read the letter

The main thing to know is that this work claims to fix a limitation in models like GPM by explicitly decomposing preferences into an orthogonal transitive scalar reward and a cyclic vector component via game theory, then using dynamic self-play to reach Nash equilibrium. It reports faster convergence on synthetic data and modest lifts on RewardBench 2 plus downstream win rates on AlpacaEval and Arena-Hard. The code is public, which helps. What is actually new is the HRC decomposition and the DSPPO procedure framed as a time-varying game. That framing is distinct from the implicit entanglement in prior preference models, and the experiments do show the method handling the Ties domain better than the BT and GPM baselines they compare against. The gains are small but consistent across the reported settings. The soft spots are in the foundations and the evidence. The central claim needs the decomposition to be invertible and to preserve all information while guaranteeing a dominant solution, yet the abstract and stress-test note give no derivation or proof that the cyclic residual is orthogonal to the transitive projection. Without that, the asserted structural superiority could be an artifact of how the synthetic data was generated rather than a general property. The experimental write-up mentions accuracy improvements but supplies no error bars, ablation controls, or statistical tests, so it is hard to tell how robust the results are. The self-play loop also risks becoming circular if there are no fixed external anchors. This paper is aimed at people working on preference modeling and RLHF for LLMs. A reader who cares about cyclic preferences and game-theoretic alignment will find the ideas worth examining even if the current version needs more rigor on the math. It deserves a serious referee to verify the derivations and tighten the experiments rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard RLHF with transitive scalar rewards fails to capture cyclic human preferences, and that implicit models like GPM entangle hierarchy with cyclicity without guaranteeing dominant solutions. It introduces the Hybrid Reward-Cyclic (HRC) model, which applies game-theoretic decomposition to explicitly separate preferences into orthogonal transitive (scalar) and cyclic (vector) components, and Dynamic Self-Play Preference Optimization (DSPPO) to treat alignment as a time-varying game converging to Nash equilibrium. Synthetic experiments show faster convergence and higher accuracy for HRC in mixed settings; benchmark results on RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench report consistent gains over BT and GPM baselines (e.g., +1.23% on Gemma-2B-it, 44.75% length-controlled win-rate on AlpacaEval). Code is released publicly.

Significance. If the claimed orthogonal decomposition is formally invertible and information-preserving, and if DSPPO reliably reaches dominant strategies, the framework could improve LLM alignment in domains with non-transitive preferences such as Ties. Public code availability aids reproducibility. The reported empirical improvements on multiple downstream tasks provide initial evidence of practical value, though attribution to the structural innovation requires further verification.

major comments (3)

[Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.
[DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.
[Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.

minor comments (2)

The abstract states results on RewardBench 2 and downstream tasks but does not specify the exact preference model architecture or training hyperparameters used for the HRC+DSPPO runs, hindering direct replication.
Notation for the cyclic vector component and the game payoff matrix should be introduced with explicit definitions of the inner product used to enforce orthogonality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We have addressed each of the major comments point by point below, making revisions to the manuscript where appropriate to enhance the theoretical rigor and experimental validation.

read point-by-point responses

Referee: [Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.

Authors: We agree that the presentation would benefit from an explicit derivation. The revised manuscript expands the theoretical framework section to define the inner product on preference relations and includes a proof that the decomposition operator is invertible, with the transitive and cyclic components orthogonal by construction and their combination exactly recovering the input preference. This addition also clarifies the distinction from GPM's implicit approach. revision: yes
Referee: [DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.

Authors: We note that synthetic experiments use known ground-truth preferences to directly measure convergence to the Nash equilibrium, providing external grounding. The revised manuscript adds a discussion of fixed-point properties and observed convergence behavior across initializations. A full theoretical convergence proof for arbitrary time-varying games is challenging and noted as future work. revision: partial
Referee: [Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.

Authors: We have updated the synthetic experiments section to report error bars over multiple random seeds, include ablations isolating the decomposition operator, and add statistical significance tests. These revisions support that observed gains arise from the explicit orthogonal structure rather than data synthesis details. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The paper proposes the HRC decomposition and DSPPO procedure as new constructs, then validates them via synthetic data experiments (where the data generator is external to the model) and downstream benchmarks including RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1 and MT-Bench. These evaluations are independent of the fitted parameters and do not reduce to self-citation chains or tautological redefinitions. The orthogonality claim is presented as part of the model definition rather than derived from prior results by the same authors, and the Nash-equilibrium guidance is tested against external win-rate metrics rather than being self-referential by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central modeling step rests on an unproven domain assumption that preferences admit an orthogonal transitive-cyclic split; no numerical free parameters are named in the abstract, and the two new constructs (HRC and DSPPO) are introduced without external falsifiable handles.

axioms (1)

domain assumption Human preferences can be orthogonally decomposed into transitive scalar and cyclic vector components without loss of relevant structure
This premise is required for the HRC model to disentangle hierarchy from cyclicity as described in the abstract.

invented entities (2)

Hybrid Reward-Cyclic (HRC) model no independent evidence
purpose: Explicit game-theoretic decomposition of preferences into orthogonal transitive and cyclic parts
Newly defined model introduced to overcome limitations of implicit formulations such as GPM.
Dynamic Self-Play Preference Optimization (DSPPO) no independent evidence
purpose: Treats alignment as a time-varying game that converges to Nash equilibrium
New optimization algorithm proposed to guide policy under the HRC decomposition.

pith-pipeline@v0.9.0 · 5849 in / 1371 out tokens · 53114 ms · 2026-05-20T14:21:58.732990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

any preference function ϕ(v,w) can be uniquely decomposed into the sum of a transitive component ϕT and a cyclic component ϕC: ϕ(v,w)=ϕT(v,w)+ϕC(v,w), where ϕT(v,w)=f(v)−f(w)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HRC model explicitly disentangles human preferences into two orthogonal components: a transitive scalar component ... and a cyclic vector component

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 15 internal anchors

[1]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[2]

Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

Deep Reinforcement Learning from Human Preferences , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[3]

Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

work page
[4]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

work page
[5]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2024
[6]

, author=

Intransitivity of preferences. , author=. Psychological review , volume=. 1969 , publisher=

work page 1969
[7]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Nash Learning From Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

work page
[8]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[9]

Transactions on Machine Learning Research , volume=

RLHF Workflow: From Reward Modeling to Online RLHF , author=. Transactions on Machine Learning Research , volume=. 2024 , publisher=

work page 2024
[10]

Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page
[11]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

A Minimaximalist Approach to Reinforcement Learning from Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page
[13]

Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

Self-Play Preference Optimization for Language Model Alignment , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

work page
[14]

Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

work page
[15]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

work page
[17]

Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

Rewardbench: Evaluating Reward Models for Language Modeling , author=. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

work page 2025
[18]

Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

Open-ended Learning in Symmetric Zero-sum Games , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

work page
[23]

Games and Economic Behavior , volume=

Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

work page 1999
[26]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=. 2020 , organization=

work page 2020
[27]

Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

PyTorch: an imperative style, high-performance deep learning library , author=. Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2025
[29]

Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

work page 2020
[30]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page
[31]

Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Training language models to follow instructions with human feedback , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[32]

Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Re-evaluating evaluation , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[33]

Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Real world games look like spinning tops , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[36]

Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Towards better value principles for large language model alignment: a systematic evaluation and enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[39]

Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Reward learning from human preferences and demonstrations in Atari , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[40]

The American Mathematical Monthly , volume=

The paradox of nontransitive dice , author=. The American Mathematical Monthly , volume=. 1994 , publisher=

work page 1994
[41]

Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

Curriculum learning , author=. Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

work page
[43]

Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page
[44]

2006 , publisher=

Condorcet’s paradox , author=. 2006 , publisher=

work page 2006
[45]

Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

On the limitations of the elo, real-world games are transitive, not additive , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page
[46]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

Modeling intransitivity in matchup and comparison data , author=. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

work page
[48]

Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

On the power of curriculum learning in training deep networks , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

work page
[49]

2012 , publisher=

Matrix analysis , author=. 2012 , publisher=

work page 2012
[50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Fundamental capabilities of large language models and their applications in domain scenarios: A survey , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[52]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=. 2025 , organization=

work page 2025
[57]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

G., Guo, Z

Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 4447--4455, 2024

work page 2024
[59]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Re-evaluating evaluation

Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating evaluation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 3272--3283, 2018

work page 2018
[61]

Open-ended learning in symmetric zero-sum games

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 434--443, 2019

work page 2019
[62]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp.\ 41--48, 2009

work page 2009
[63]

M., and Gidel, G

Bertrand, Q., Czarnecki, W. M., and Gidel, G. On the limitations of the elo, real-world games are transitive, not additive. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 2905--2921, 2023

work page 2023
[64]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

work page 1952
[65]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 1877--1901, 2020

work page 1901
[66]

and Joachims, T

Chen, S. and Joachims, T. Modeling intransitivity in matchup and comparison data. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 227--236, 2016

work page 2016
[67]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[68]

Ultrafeedback: Boosting language models with scaled ai feedback

Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., et al. Ultrafeedback: Boosting language models with scaled ai feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 9722--9744, 2024

work page 2024
[69]

M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M

Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M. Real world games look like spinning tops. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 17443--17454, 2020

work page 2020
[70]

Rlhf workflow: From reward modeling to online rlhf

Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. Transactions on Machine Learning Research, 2024, 2024

work page 2024
[71]

Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

and Schapire, R

Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29 0 (1-2): 0 79--103, 1999

work page 1999
[73]

Gehrlein, W. V. Condorcet’s paradox. Springer, 2006

work page 2006
[74]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

and Weinshall, D

Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 2535--2544, 2019

work page 2019
[76]

Energy-based preference model offers better offline alignment than the bradley-terry preference model

Hong, Y., Zhang, H., Bao, J., Jiang, H., et al. Energy-based preference model offers better offline alignment than the bradley-terry preference model. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025
[77]

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012

work page 2012
[78]

K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al

Hu, J., Wu, X., Shen, W., Liu, J. K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al. Openrlhf: A ray-based easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 656--666, 2025

work page 2025
[79]

Reward learning from human preferences and demonstrations in atari

Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8022--8034, 2018

work page 2018
[80]

AI Alignment: A Comprehensive Survey

Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 14165--14178, 2023

work page 2023
[82]

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp.\ 1755--1797, 2025

work page 2025
[83]

Fundamental capabilities of large language models and their applications in domain scenarios: A survey

Li, J., Yang, Y., Bai, Y., Zhou, X., Li, Y., Sun, H., Liu, Y., Si, X., Ye, Y., Wu, Y., et al. Fundamental capabilities of large language models and their applications in domain scenarios: A survey. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 11116--11141, 2024

work page 2024
[84]

E., and Stoica, I

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pp.\ 34209--34231. PMLR, 2025

work page 2025
[85]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[86]

RewardBench 2: Advancing Reward Model Evaluation

Malik, S., Pyatkin, V., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[87]

G., Rowland, M., Guo, Z

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Fiegel, C., et al. Nash learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[88]

Pytorch: an imperative style, high-performance deep learning library

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8026--8037, 2019

work page 2019
[89]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 53728--53741, 2023

work page 2023
[90]

Zero: Memory optimizations toward training trillion parameter models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), pp.\ 1--16. IEEE, 2020

work page 2020
[91]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024

work page arXiv 2024
[92]

Savage Jr, R. P. The paradox of nontransitive dice. The American Mathematical Monthly, 101 0 (5): 0 429--436, 1994

work page 1994
[93]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[94]

A minimaximalist approach to reinforcement learning from human feedback

Swamy, G., Dann, C., Kidambi, R., Wu, S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 47345--47377, 2024

work page 2024
[95]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[96]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[97]

Intransitivity of preferences

Tversky, A. Intransitivity of preferences. Psychological review, 76 0 (1): 0 31, 1969

work page 1969

Showing first 80 references.

[1] [1]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952

[2] [2]

Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

Deep Reinforcement Learning from Human Preferences , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

work page

[3] [3]

Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

work page

[4] [4]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

work page

[5] [5]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2024

[6] [6]

, author=

Intransitivity of preferences. , author=. Psychological review , volume=. 1969 , publisher=

work page 1969

[7] [7]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Nash Learning From Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

work page

[8] [8]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page

[9] [9]

Transactions on Machine Learning Research , volume=

RLHF Workflow: From Reward Modeling to Online RLHF , author=. Transactions on Machine Learning Research , volume=. 2024 , publisher=

work page 2024

[10] [10]

Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page

[11] [11]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

A Minimaximalist Approach to Reinforcement Learning from Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page

[12] [13]

Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

Self-Play Preference Optimization for Language Model Alignment , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

work page

[13] [14]

Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

work page

[14] [15]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

work page

[15] [17]

Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

Rewardbench: Evaluating Reward Models for Language Modeling , author=. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

work page 2025

[16] [18]

Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

Open-ended Learning in Symmetric Zero-sum Games , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

work page

[17] [23]

Games and Economic Behavior , volume=

Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

work page 1999

[18] [26]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=. 2020 , organization=

work page 2020

[19] [27]

Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

PyTorch: an imperative style, high-performance deep learning library , author=. Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[20] [28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2025

[21] [29]

Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

work page 2020

[22] [30]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page

[23] [31]

Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Training language models to follow instructions with human feedback , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[24] [32]

Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Re-evaluating evaluation , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[25] [33]

Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Real world games look like spinning tops , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[26] [36]

Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[27] [37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Towards better value principles for large language model alignment: a systematic evaluation and enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page

[28] [39]

Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Reward learning from human preferences and demonstrations in Atari , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[29] [40]

The American Mathematical Monthly , volume=

The paradox of nontransitive dice , author=. The American Mathematical Monthly , volume=. 1994 , publisher=

work page 1994

[30] [41]

Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

Curriculum learning , author=. Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

work page

[31] [43]

Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

work page

[32] [44]

2006 , publisher=

Condorcet’s paradox , author=. 2006 , publisher=

work page 2006

[33] [45]

Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

On the limitations of the elo, real-world games are transitive, not additive , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

work page

[34] [46]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [47]

Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

Modeling intransitivity in matchup and comparison data , author=. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

work page

[36] [48]

Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

On the power of curriculum learning in training deep networks , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

work page

[37] [49]

2012 , publisher=

Matrix analysis , author=. 2012 , publisher=

work page 2012

[38] [50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Fundamental capabilities of large language models and their applications in domain scenarios: A survey , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page

[39] [52]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=. 2025 , organization=

work page 2025

[40] [57]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [58]

G., Guo, Z

Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 4447--4455, 2024

work page 2024

[42] [59]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [60]

Re-evaluating evaluation

Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating evaluation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 3272--3283, 2018

work page 2018

[44] [61]

Open-ended learning in symmetric zero-sum games

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 434--443, 2019

work page 2019

[45] [62]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp.\ 41--48, 2009

work page 2009

[46] [63]

M., and Gidel, G

Bertrand, Q., Czarnecki, W. M., and Gidel, G. On the limitations of the elo, real-world games are transitive, not additive. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 2905--2921, 2023

work page 2023

[47] [64]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

work page 1952

[48] [65]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 1877--1901, 2020

work page 1901

[49] [66]

and Joachims, T

Chen, S. and Joachims, T. Modeling intransitivity in matchup and comparison data. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 227--236, 2016

work page 2016

[50] [67]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[51] [68]

Ultrafeedback: Boosting language models with scaled ai feedback

Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., et al. Ultrafeedback: Boosting language models with scaled ai feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 9722--9744, 2024

work page 2024

[52] [69]

M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M

Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M. Real world games look like spinning tops. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 17443--17454, 2020

work page 2020

[53] [70]

Rlhf workflow: From reward modeling to online rlhf

Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. Transactions on Machine Learning Research, 2024, 2024

work page 2024

[54] [71]

Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [72]

and Schapire, R

Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29 0 (1-2): 0 79--103, 1999

work page 1999

[56] [73]

Gehrlein, W. V. Condorcet’s paradox. Springer, 2006

work page 2006

[57] [74]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [75]

and Weinshall, D

Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 2535--2544, 2019

work page 2019

[59] [76]

Energy-based preference model offers better offline alignment than the bradley-terry preference model

Hong, Y., Zhang, H., Bao, J., Jiang, H., et al. Energy-based preference model offers better offline alignment than the bradley-terry preference model. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025

[60] [77]

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012

work page 2012

[61] [78]

K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al

Hu, J., Wu, X., Shen, W., Liu, J. K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al. Openrlhf: A ray-based easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 656--666, 2025

work page 2025

[62] [79]

Reward learning from human preferences and demonstrations in atari

Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8022--8034, 2018

work page 2018

[63] [80]

AI Alignment: A Comprehensive Survey

Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [81]

Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 14165--14178, 2023

work page 2023

[65] [82]

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp.\ 1755--1797, 2025

work page 2025

[66] [83]

Fundamental capabilities of large language models and their applications in domain scenarios: A survey

Li, J., Yang, Y., Bai, Y., Zhou, X., Li, Y., Sun, H., Liu, Y., Si, X., Ye, Y., Wu, Y., et al. Fundamental capabilities of large language models and their applications in domain scenarios: A survey. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 11116--11141, 2024

work page 2024

[67] [84]

E., and Stoica, I

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pp.\ 34209--34231. PMLR, 2025

work page 2025

[68] [85]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [86]

RewardBench 2: Advancing Reward Model Evaluation

Malik, S., Pyatkin, V., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [87]

G., Rowland, M., Guo, Z

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Fiegel, C., et al. Nash learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[71] [88]

Pytorch: an imperative style, high-performance deep learning library

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8026--8037, 2019

work page 2019

[72] [89]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 53728--53741, 2023

work page 2023

[73] [90]

Zero: Memory optimizations toward training trillion parameter models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), pp.\ 1--16. IEEE, 2020

work page 2020

[74] [91]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024

work page arXiv 2024

[75] [92]

Savage Jr, R. P. The paradox of nontransitive dice. The American Mathematical Monthly, 101 0 (5): 0 429--436, 1994

work page 1994

[76] [93]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[77] [94]

A minimaximalist approach to reinforcement learning from human feedback

Swamy, G., Dann, C., Kidambi, R., Wu, S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 47345--47377, 2024

work page 2024

[78] [95]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [96]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [97]

Intransitivity of preferences

Tversky, A. Intransitivity of preferences. Psychological review, 76 0 (1): 0 31, 1969

work page 1969