Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3
The pith
Explicitly decomposing human preferences into orthogonal transitive scalar and cyclic vector components enables more effective large language model alignment than implicit models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Hybrid Reward-Cyclic (HRC) model utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components, addressing the limitation of implicit formulations in prior models like GPM that fail to guarantee dominant solutions. Complementing this, Dynamic Self-Play Preference Optimization (DSPPO) treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments validate HRC's structural superiority in mixed transitive-cyclic settings, while evaluations on RewardBench 2, AlpacaEval 2.0, Arena-Hard, and MT-Bench confirm consistent gains over BT and GPM.
What carries the argument
The Hybrid Reward-Cyclic (HRC) model, which uses game-theoretic decomposition to separate preferences into an orthogonal transitive scalar component and a cyclic vector component.
If this is right
- HRC converges faster and reaches higher accuracy than GPM on synthetic data containing both transitive and cyclic preferences.
- HRC improves over both BT and GPM baselines on RewardBench 2, with particular gains in the Ties domain that tests complex non-strict preferences.
- When paired with DSPPO, HRC produces higher length-controlled win rates on AlpacaEval 2.0 and Arena-Hard than SPPO baselines trained with BT or GPM.
- The explicit separation enables robust handling of preferences that violate strict transitivity.
Where Pith is reading between the lines
- The same orthogonal decomposition could be applied to preference data in recommendation systems or multi-agent coordination where cycles commonly appear.
- Isolating the cyclic vector component may allow targeted diagnostics for inconsistent outputs in deployed language models.
- Extending the dynamic self-play procedure to other time-varying preference settings could produce more stable training trajectories.
Load-bearing premise
Human preferences admit an orthogonal decomposition into transitive scalar and cyclic vector components that preserves all relevant information and that the resulting game admits a dominant solution reachable by the proposed procedure.
What would settle it
A controlled experiment on data with known cyclic preferences in which HRC fails to converge faster or reach higher accuracy than GPM would falsify the claim of structural superiority.
Figures
read the original abstract
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard RLHF with transitive scalar rewards fails to capture cyclic human preferences, and that implicit models like GPM entangle hierarchy with cyclicity without guaranteeing dominant solutions. It introduces the Hybrid Reward-Cyclic (HRC) model, which applies game-theoretic decomposition to explicitly separate preferences into orthogonal transitive (scalar) and cyclic (vector) components, and Dynamic Self-Play Preference Optimization (DSPPO) to treat alignment as a time-varying game converging to Nash equilibrium. Synthetic experiments show faster convergence and higher accuracy for HRC in mixed settings; benchmark results on RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench report consistent gains over BT and GPM baselines (e.g., +1.23% on Gemma-2B-it, 44.75% length-controlled win-rate on AlpacaEval). Code is released publicly.
Significance. If the claimed orthogonal decomposition is formally invertible and information-preserving, and if DSPPO reliably reaches dominant strategies, the framework could improve LLM alignment in domains with non-transitive preferences such as Ties. Public code availability aids reproducibility. The reported empirical improvements on multiple downstream tasks provide initial evidence of practical value, though attribution to the structural innovation requires further verification.
major comments (3)
- [Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.
- [DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.
- [Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.
minor comments (2)
- The abstract states results on RewardBench 2 and downstream tasks but does not specify the exact preference model architecture or training hyperparameters used for the HRC+DSPPO runs, hindering direct replication.
- Notation for the cyclic vector component and the game payoff matrix should be introduced with explicit definitions of the inner product used to enforce orthogonality.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work. We have addressed each of the major comments point by point below, making revisions to the manuscript where appropriate to enhance the theoretical rigor and experimental validation.
read point-by-point responses
-
Referee: [Abstract / Theoretical Framework] Abstract / HRC model description: the claim that game-theoretic decomposition yields an 'orthogonal' split into transitive scalar and cyclic vector components that 'exactly' recovers the original preference is asserted without a derivation showing invertibility of the operator or orthogonality under a specified inner product. This directly underpins the asserted superiority over GPM's implicit entanglement and the guarantee of dominant solutions; without it, faster synthetic convergence could be an artifact of the data generator rather than the decomposition property.
Authors: We agree that the presentation would benefit from an explicit derivation. The revised manuscript expands the theoretical framework section to define the inner product on preference relations and includes a proof that the decomposition operator is invertible, with the transitive and cyclic components orthogonal by construction and their combination exactly recovering the input preference. This addition also clarifies the distinction from GPM's implicit approach. revision: yes
-
Referee: [DSPPO Description] DSPPO section: framing alignment as iterative self-play toward Nash equilibrium in a time-varying game risks circularity if no external fixed benchmarks or grounding are used to validate convergence; the abstract supplies no convergence proof or fixed-point analysis showing that the dynamic procedure reaches a dominant strategy independent of the self-generated data.
Authors: We note that synthetic experiments use known ground-truth preferences to directly measure convergence to the Nash equilibrium, providing external grounding. The revised manuscript adds a discussion of fixed-point properties and observed convergence behavior across initializations. A full theoretical convergence proof for arbitrary time-varying games is challenging and noted as future work. revision: partial
-
Referee: [Experiments / Synthetic Validation] Synthetic data experiments: the reported faster convergence and higher accuracy for HRC versus GPM lack error bars, ablation controls on the decomposition operator, or statistical tests, making it impossible to confirm that gains stem from the explicit orthogonal structure rather than from how the mixed transitive-cyclic data was synthesized.
Authors: We have updated the synthetic experiments section to report error bars over multiple random seeds, include ablations isolating the decomposition operator, and add statistical significance tests. These revisions support that observed gains arise from the explicit orthogonal structure rather than data synthesis details. revision: yes
Circularity Check
No significant circularity detected; derivation remains self-contained
full rationale
The paper proposes the HRC decomposition and DSPPO procedure as new constructs, then validates them via synthetic data experiments (where the data generator is external to the model) and downstream benchmarks including RewardBench 2, AlpacaEval 2.0, Arena-Hard-v0.1 and MT-Bench. These evaluations are independent of the fitted parameters and do not reduce to self-citation chains or tautological redefinitions. The orthogonality claim is presented as part of the model definition rather than derived from prior results by the same authors, and the Nash-equilibrium guidance is tested against external win-rate metrics rather than being self-referential by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preferences can be orthogonally decomposed into transitive scalar and cyclic vector components without loss of relevant structure
invented entities (2)
-
Hybrid Reward-Cyclic (HRC) model
no independent evidence
-
Dynamic Self-Play Preference Optimization (DSPPO)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
any preference function ϕ(v,w) can be uniquely decomposed into the sum of a transitive component ϕT and a cyclic component ϕC: ϕ(v,w)=ϕT(v,w)+ϕC(v,w), where ϕT(v,w)=f(v)−f(w)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HRC model explicitly disentangles human preferences into two orthogonal components: a transitive scalar component ... and a cyclic vector component
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
the method of paired comparisons , author=
Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
work page 1952
-
[2]
Deep Reinforcement Learning from Human Preferences , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[3]
Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=
-
[4]
Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=
Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=
-
[5]
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
work page 2024
- [6]
-
[7]
Proceedings of the 41st International Conference on Machine Learning (ICML) , year=
Nash Learning From Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=
-
[8]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , pages=
-
[9]
Transactions on Machine Learning Research , volume=
RLHF Workflow: From Reward Modeling to Online RLHF , author=. Transactions on Machine Learning Research , volume=. 2024 , publisher=
work page 2024
-
[10]
A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=
-
[11]
Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=
A Minimaximalist Approach to Reinforcement Learning from Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=
-
[13]
Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=
Self-Play Preference Optimization for Language Model Alignment , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=
-
[14]
Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning , author=. Proceedings of the 13th International Conference on Learning Representations (ICLR) , volume=
-
[15]
Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=
-
[17]
Rewardbench: Evaluating Reward Models for Language Modeling , author=. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages=
work page 2025
-
[18]
Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=
Open-ended Learning in Symmetric Zero-sum Games , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=
-
[23]
Games and Economic Behavior , volume=
Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=
work page 1999
-
[26]
Zero: Memory optimizations toward training trillion parameter models , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20) , pages=. 2020 , organization=
work page 2020
-
[27]
PyTorch: an imperative style, high-performance deep learning library , author=. Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[28]
OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
work page 2025
-
[29]
Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=
work page 2020
-
[30]
Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=
ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=
-
[31]
Training language models to follow instructions with human feedback , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[32]
Re-evaluating evaluation , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[33]
Real world games look like spinning tops , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[36]
Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[37]
Towards better value principles for large language model alignment: a systematic evaluation and enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=
-
[39]
Reward learning from human preferences and demonstrations in Atari , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS) , pages=
-
[40]
The American Mathematical Monthly , volume=
The paradox of nontransitive dice , author=. The American Mathematical Monthly , volume=. 1994 , publisher=
work page 1994
-
[41]
Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=
Curriculum learning , author=. Proceedings of the 26th International Conference on Machine Learning (ICML) , pages=
-
[43]
Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS) , pages=
- [44]
-
[45]
On the limitations of the elo, real-world games are transitive, not additive , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=
-
[46]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=
Modeling intransitivity in matchup and comparison data , author=. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM) , pages=
-
[48]
Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=
On the power of curriculum learning in training deep networks , author=. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages=
- [49]
-
[50]
Fundamental capabilities of large language models and their applications in domain scenarios: A survey , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=
-
[52]
Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=
From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , pages=. 2025 , organization=
work page 2025
-
[57]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 4447--4455, 2024
work page 2024
-
[59]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating evaluation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 3272--3283, 2018
work page 2018
-
[61]
Open-ended learning in symmetric zero-sum games
Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 434--443, 2019
work page 2019
-
[62]
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp.\ 41--48, 2009
work page 2009
-
[63]
Bertrand, Q., Czarnecki, W. M., and Gidel, G. On the limitations of the elo, real-world games are transitive, not additive. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.\ 2905--2921, 2023
work page 2023
-
[64]
Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952
work page 1952
-
[65]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 1877--1901, 2020
work page 1901
-
[66]
Chen, S. and Joachims, T. Modeling intransitivity in matchup and comparison data. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 227--236, 2016
work page 2016
-
[67]
F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[68]
Ultrafeedback: Boosting language models with scaled ai feedback
Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., et al. Ultrafeedback: Boosting language models with scaled ai feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 9722--9744, 2024
work page 2024
-
[69]
M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M
Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M. Real world games look like spinning tops. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 17443--17454, 2020
work page 2020
-
[70]
Rlhf workflow: From reward modeling to online rlhf
Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T. Rlhf workflow: From reward modeling to online rlhf. Transactions on Machine Learning Research, 2024, 2024
work page 2024
-
[71]
Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29 0 (1-2): 0 79--103, 1999
work page 1999
-
[73]
Gehrlein, W. V. Condorcet’s paradox. Springer, 2006
work page 2006
-
[74]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 2535--2544, 2019
work page 2019
-
[76]
Hong, Y., Zhang, H., Bao, J., Jiang, H., et al. Energy-based preference model offers better offline alignment than the bradley-terry preference model. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025
work page 2025
-
[77]
Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012
work page 2012
-
[78]
K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al
Hu, J., Wu, X., Shen, W., Liu, J. K., Wang, W., Jiang, S., Wang, H., Chen, H., Chen, B., Fang, W., et al. Openrlhf: A ray-based easy-to-use, scalable and high-performance rlhf framework. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 656--666, 2025
work page 2025
-
[79]
Reward learning from human preferences and demonstrations in atari
Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8022--8034, 2018
work page 2018
-
[80]
AI Alignment: A Comprehensive Survey
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 14165--14178, 2023
work page 2023
-
[82]
Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp.\ 1755--1797, 2025
work page 2025
-
[83]
Li, J., Yang, Y., Bai, Y., Zhou, X., Li, Y., Sun, H., Liu, Y., Si, X., Ye, Y., Wu, Y., et al. Fundamental capabilities of large language models and their applications in domain scenarios: A survey. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 11116--11141, 2024
work page 2024
-
[84]
Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pp.\ 34209--34231. PMLR, 2025
work page 2025
-
[85]
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[86]
RewardBench 2: Advancing Reward Model Evaluation
Malik, S., Pyatkin, V., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[87]
Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Fiegel, C., et al. Nash learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[88]
Pytorch: an imperative style, high-performance deep learning library
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 8026--8037, 2019
work page 2019
-
[89]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), pp.\ 53728--53741, 2023
work page 2023
-
[90]
Zero: Memory optimizations toward training trillion parameter models
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), pp.\ 1--16. IEEE, 2020
work page 2020
-
[91]
Direct nash optimization: Teaching language models to self-improve with general preferences,
Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024
-
[92]
Savage Jr, R. P. The paradox of nontransitive dice. The American Mathematical Monthly, 101 0 (5): 0 429--436, 1994
work page 1994
-
[93]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[94]
A minimaximalist approach to reinforcement learning from human feedback
Swamy, G., Dann, C., Kidambi, R., Wu, S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 47345--47377, 2024
work page 2024
-
[95]
Gemma: Open Models Based on Gemini Research and Technology
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[96]
Gemma 2: Improving Open Language Models at a Practical Size
Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[97]
Tversky, A. Intransitivity of preferences. Psychological review, 76 0 (1): 0 31, 1969
work page 1969
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.