pith. sign in

arxiv: 2605.17342 · v1 · pith:72WCD6QSnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords preference decompositioncyclic preferencesLLM alignmentRLHFgame-theoretic decompositiondynamic self-playNash equilibriumtransitive preferences
0
0 comments X

The pith

Explicitly decomposing human preferences into orthogonal transitive scalar and cyclic vector components enables more effective large language model alignment than implicit models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLHF methods model preferences as transitive scalars, but human preferences frequently contain cyclic elements that a single reward cannot represent. Prior implicit approaches like the General Preference Model entangle hierarchy with cyclicity and therefore cannot guarantee dominant solutions. The paper introduces the Hybrid Reward-Cyclic model, which applies game-theoretic decomposition to separate preferences into an orthogonal transitive scalar part and a cyclic vector part. A complementary Dynamic Self-Play Preference Optimization procedure then treats alignment as a time-varying game that steers the policy toward Nash equilibrium. Experiments on synthetic mixed settings and real benchmarks show faster convergence and higher accuracy when both components are modeled explicitly.

Core claim

The Hybrid Reward-Cyclic (HRC) model utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components, addressing the limitation of implicit formulations in prior models like GPM that fail to guarantee dominant solutions. Complementing this, Dynamic Self-Play Preference Optimization (DSPPO) treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments validate HRC's structural superiority in mixed transitive-cyclic settings, while evaluations on RewardBench 2, AlpacaEval 2.0, Arena-Hard, and MT-Bench confirm consistent gains over BT and GPM.

What carries the argument

The Hybrid Reward-Cyclic (HRC) model, which uses game-theoretic decomposition to separate preferences into an orthogonal transitive scalar component and a cyclic vector component.

If this is right

  • HRC converges faster and reaches higher accuracy than GPM on synthetic data containing both transitive and cyclic preferences.
  • HRC improves over both BT and GPM baselines on RewardBench 2, with particular gains in the Ties domain that tests complex non-strict preferences.
  • When paired with DSPPO, HRC produces higher length-controlled win rates on AlpacaEval 2.0 and Arena-Hard than SPPO baselines trained with BT or GPM.
  • The explicit separation enables robust handling of preferences that violate strict transitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same orthogonal decomposition could be applied to preference data in recommendation systems or multi-agent coordination where cycles commonly appear.
  • Isolating the cyclic vector component may allow targeted diagnostics for inconsistent outputs in deployed language models.
  • Extending the dynamic self-play procedure to other time-varying preference settings could produce more stable training trajectories.

Load-bearing premise

Human preferences admit an orthogonal decomposition into transitive scalar and cyclic vector components that preserves all relevant information and that the resulting game admits a dominant solution reachable by the proposed procedure.

What would settle it

A controlled experiment on data with known cyclic preferences in which HRC fails to converge faster or reach higher accuracy than GPM would falsify the claim of structural superiority.

Figures

Figures reproduced from arXiv: 2605.17342 by Jing Li, Kaiqi Zhao, Xiucheng Li, Yucong Huang.

Figure 1
Figure 1. Figure 1: Comparison of the Bradley-Terry (BT) model and the proposed Hybrid Reward-Cyclic (HRC) model. (a) The BT model maps each instruction-response pair to a scalar reward, assuming transitive preferences. (b) The HRC model explicitly decomposes preferences into a transitive scalar component (via BT) and a cyclic vector component (via GPM), combining them to produce the final preference signal sHRC. composed, we… view at source ↗
Figure 2
Figure 2. Figure 2: Iteration 3 Alignment Performance across Three Benchmarks. We compare BT+SPPO, GPM+SPPO, HRC+SPPO, and HRC+DSPPO at Iteration 3 on AlpacaEval 2.0 (LC. Win Rate), MT-Bench (Average Score), and Arena-Hard-v0.1 (Win Rate). Each panel reports results using both Gemma-2B-it and Llama-3.1-8B-Instruct as preference models. signals: BT model (Bradley & Terry, 1952), GPM (Zhang et al., 2025c) and our HRC model. (2)… view at source ↗
read the original abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard RLHF with transitive scalar rewards fails to capture cyclic human preferences, and that implicit models like GPM entangle hierarchy with cyclicity without guaranteeing dominant solutions. It introduces the Hybrid Reward-Cyclic (HRC) model, which applies game-theoretic decomposition to explicitly separate preferences into orthogonal transitive (scalar) and cyclic (vector) components, and Dynamic Self-Play Preference Optimization (DSPPO) to treat alignment as a time-varying game converging to Nash equilibrium. Synthetic experiments show faster convergence and higher accuracy for HRC in mixed settings; benchmark results on RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench report consistent gains over BT and GPM baselines (e.g., +1.23% on Gemma-2B-it, 44.75% length-controlled win-rate on AlpacaEval). Code is released publicly.

Significance. If the claimed orthogonal decomposition is formally invertible and information-preserving, and if DSPPO reliably reaches dominant strategies, the framework could improve LLM alignment in domains with non-transitive preferences such as Ties. Public code availability aids reproducibility. The reported empirical improvements on multiple downstream tasks provide initial evidence of practical value, though attribution to the structural innovation requires further verification.

major comments (3)
  1. [Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.
  2. [DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.
  3. [Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.
minor comments (2)
  1. The abstract states results on RewardBench 2 and downstream tasks but does not specify the exact preference model architecture or training hyperparameters used for the HRC+DSPPO runs, hindering direct replication.
  2. Notation for the cyclic vector component and the game payoff matrix should be introduced with explicit definitions of the inner product used to enforce orthogonality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We have addressed each of the major comments point by point below, making revisions to the manuscript where appropriate to enhance the theoretical rigor and experimental validation.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.

    Authors: We agree that the presentation would benefit from an explicit derivation. The revised manuscript expands the theoretical framework section to define the inner product on preference relations and includes a proof that the decomposition operator is invertible, with the transitive and cyclic components orthogonal by construction and their combination exactly recovering the input preference. This addition also clarifies the distinction from GPM's implicit approach. revision: yes

  2. Referee: [DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.

    Authors: We note that synthetic experiments use known ground-truth preferences to directly measure convergence to the Nash equilibrium, providing external grounding. The revised manuscript adds a discussion of fixed-point properties and observed convergence behavior across initializations. A full theoretical convergence proof for arbitrary time-varying games is challenging and noted as future work. revision: partial

  3. Referee: [Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.

    Authors: We have updated the synthetic experiments section to report error bars over multiple random seeds, include ablations isolating the decomposition operator, and add statistical significance tests. These revisions support that observed gains arise from the explicit orthogonal structure rather than data synthesis details. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The paper proposes the HRC decomposition and DSPPO procedure as new constructs, then validates them via synthetic data experiments (where the data generator is external to the model) and downstream benchmarks including RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1 and MT-Bench. These evaluations are independent of the fitted parameters and do not reduce to self-citation chains or tautological redefinitions. The orthogonality claim is presented as part of the model definition rather than derived from prior results by the same authors, and the Nash-equilibrium guidance is tested against external win-rate metrics rather than being self-referential by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central modeling step rests on an unproven domain assumption that preferences admit an orthogonal transitive-cyclic split; no numerical free parameters are named in the abstract, and the two new constructs (HRC and DSPPO) are introduced without external falsifiable handles.

axioms (1)
  • domain assumption Human preferences can be orthogonally decomposed into transitive scalar and cyclic vector components without loss of relevant structure
    This premise is required for the HRC model to disentangle hierarchy from cyclicity as described in the abstract.
invented entities (2)
  • Hybrid Reward-Cyclic (HRC) model no independent evidence
    purpose: Explicit game-theoretic decomposition of preferences into orthogonal transitive and cyclic parts
    Newly defined model introduced to overcome limitations of implicit formulations such as GPM.
  • Dynamic Self-Play Preference Optimization (DSPPO) no independent evidence
    purpose: Treats alignment as a time-varying game that converges to Nash equilibrium
    New optimization algorithm proposed to guide policy under the HRC decomposition.

pith-pipeline@v0.9.0 · 5849 in / 1371 out tokens · 53114 ms · 2026-05-20T14:21:58.732990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 15 internal anchors

  1. [1]

    the method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  2. [2]

    Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

    Deep Reinforcement Learning from Human Preferences , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=

  3. [3]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

    Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

  4. [4]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

    Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

  5. [5]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  6. [6]

    , author=

    Intransitivity of preferences. , author=. Psychological review , volume=. 1969 , publisher=

  7. [7]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

    Nash Learning From Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

  8. [8]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  9. [9]

    Transactions on Machine Learning Research , volume=

    RLHF Workflow: From Reward Modeling to Online RLHF , author=. Transactions on Machine Learning Research , volume=. 2024 , publisher=

  10. [10]

    Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  11. [11]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

    A Minimaximalist Approach to Reinforcement Learning from Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

  12. [13]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

    Self-Play Preference Optimization for Language Model Alignment , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

  13. [14]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

    Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=

  14. [15]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

    Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

  15. [17]

    Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

    Rewardbench: Evaluating Reward Models for Language Modeling , author=. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

  16. [18]

    Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

    Open-ended Learning in Symmetric Zero-sum Games , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

  17. [23]

    Games and Economic Behavior , volume=

    Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

  18. [26]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=. 2020 , organization=

  19. [27]

    Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    PyTorch: an imperative style, high-performance deep learning library , author=. Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  20. [28]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  21. [29]

    Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

    Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

  22. [30]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

    ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

  23. [31]

    Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Training language models to follow instructions with human feedback , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  24. [32]

    Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Re-evaluating evaluation , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  25. [33]

    Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Real world games look like spinning tops , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  26. [36]

    Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  27. [37]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Towards better value principles for large language model alignment: a systematic evaluation and enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  28. [39]

    Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Reward learning from human preferences and demonstrations in Atari , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  29. [40]

    The American Mathematical Monthly , volume=

    The paradox of nontransitive dice , author=. The American Mathematical Monthly , volume=. 1994 , publisher=

  30. [41]

    Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

    Curriculum learning , author=. Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=

  31. [43]

    Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

    Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=

  32. [44]

    2006 , publisher=

    Condorcet’s paradox , author=. 2006 , publisher=

  33. [45]

    Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    On the limitations of the elo, real-world games are transitive, not additive , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  34. [46]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  35. [47]

    Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

    Modeling intransitivity in matchup and comparison data , author=. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=

  36. [48]

    Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

    On the power of curriculum learning in training deep networks , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=

  37. [49]

    2012 , publisher=

    Matrix analysis , author=. 2012 , publisher=

  38. [50]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Fundamental capabilities of large language models and their applications in domain scenarios: A survey , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  39. [52]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=

    From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=. 2025 , organization=

  40. [57]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  41. [58]

    G., Guo, Z

    Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 4447--4455, 2024

  42. [59]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  43. [60]

    Re-evaluating evaluation

    Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating evaluation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 3272--3283, 2018

  44. [61]

    Open-ended learning in symmetric zero-sum games

    Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 434--443, 2019

  45. [62]

    Curriculum learning

    Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp.\ 41--48, 2009

  46. [63]

    M., and Gidel, G

    Bertrand, Q., Czarnecki, W. M., and Gidel, G. On the limitations of the elo, real-world games are transitive, not additive. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 2905--2921, 2023

  47. [64]

    Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

  48. [65]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 1877--1901, 2020

  49. [66]

    and Joachims, T

    Chen, S. and Joachims, T. Modeling intransitivity in matchup and comparison data. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 227--236, 2016

  50. [67]

    F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

    Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017

  51. [68]

    Ultrafeedback: Boosting language models with scaled ai feedback

    Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., et al. Ultrafeedback: Boosting language models with scaled ai feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 9722--9744, 2024

  52. [69]

    M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M

    Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M. Real world games look like spinning tops. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 17443--17454, 2020

  53. [70]

    Rlhf workflow: From reward modeling to online rlhf

    Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. Transactions on Machine Learning Research, 2024, 2024

  54. [71]

    Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  55. [72]

    and Schapire, R

    Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29 0 (1-2): 0 79--103, 1999

  56. [73]

    Gehrlein, W. V. Condorcet’s paradox. Springer, 2006

  57. [74]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  58. [75]

    and Weinshall, D

    Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 2535--2544, 2019

  59. [76]

    Energy-based preference model offers better offline alignment than the bradley-terry preference model

    Hong, Y., Zhang, H., Bao, J., Jiang, H., et al. Energy-based preference model offers better offline alignment than the bradley-terry preference model. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  60. [77]

    Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012

  61. [78]

    K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al

    Hu, J., Wu, X., Shen, W., Liu, J. K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al. Openrlhf: A ray-based easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 656--666, 2025

  62. [79]

    Reward learning from human preferences and demonstrations in atari

    Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8022--8034, 2018

  63. [80]

    AI Alignment: A Comprehensive Survey

    Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023

  64. [81]

    Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 14165--14178, 2023

  65. [82]

    Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp.\ 1755--1797, 2025

  66. [83]

    Fundamental capabilities of large language models and their applications in domain scenarios: A survey

    Li, J., Yang, Y., Bai, Y., Zhou, X., Li, Y., Sun, H., Liu, Y., Si, X., Ye, Y., Wu, Y., et al. Fundamental capabilities of large language models and their applications in domain scenarios: A survey. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 11116--11141, 2024

  67. [84]

    E., and Stoica, I

    Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pp.\ 34209--34231. PMLR, 2025

  68. [85]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024

  69. [86]

    RewardBench 2: Advancing Reward Model Evaluation

    Malik, S., Pyatkin, V., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025

  70. [87]

    G., Rowland, M., Guo, Z

    Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Fiegel, C., et al. Nash learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  71. [88]

    Pytorch: an imperative style, high-performance deep learning library

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8026--8037, 2019

  72. [89]

    D., Ermon, S., and Finn, C

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 53728--53741, 2023

  73. [90]

    Zero: Memory optimizations toward training trillion parameter models

    Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), pp.\ 1--16. IEEE, 2020

  74. [91]

    Direct nash optimization: Teaching language models to self-improve with general preferences,

    Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024

  75. [92]

    Savage Jr, R. P. The paradox of nontransitive dice. The American Mathematical Monthly, 101 0 (5): 0 429--436, 1994

  76. [93]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  77. [94]

    A minimaximalist approach to reinforcement learning from human feedback

    Swamy, G., Dann, C., Kidambi, R., Wu, S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 47345--47377, 2024

  78. [95]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024 a

  79. [96]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024 b

  80. [97]

    Intransitivity of preferences

    Tversky, A. Intransitivity of preferences. Psychological review, 76 0 (1): 0 31, 1969

Showing first 80 references.