pith. machine review for the scientific record. sign in

arxiv: 2605.08037 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords GraphDPOpreference graphsdirect preference optimizationPlackett-Lucetransitivitylanguage model alignmentrollout rankingsRLHF
0
0 comments X

The pith

Language models align more effectively by optimizing over full preference graphs from multiple rollouts instead of isolated pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct preference optimization reduces multiple responses per prompt to independent pairs, which discards transitivity and can create conflicting signals. The paper introduces GraphDPO to represent these responses as a directed acyclic graph with dominance edges and optimize a Plackett-Luce-inspired objective over graph neighborhoods. This aggregates richer supervision while recovering standard DPO as the special case of isolated pairs. Readers care because the approach uses the same rollout data more efficiently, avoids redundant loss terms, and maintains linear complexity, yielding stronger results on reasoning and program synthesis.

Core claim

Graph Direct Preference Optimization generalizes DPO to directed acyclic preference graphs induced by rollout rankings, encoding dominance as edges and optimizing a graph-structured Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods while enforcing transitivity. Equivalence classes group identical-preference responses into layers that contribute zero loss, and optional ground-truth anchoring with an annealed schedule stabilizes training.

What carries the argument

Directed acyclic preference graph with a Plackett-Luce-inspired objective aggregated over neighborhoods, which enforces transitivity and sets intra-layer loss to zero for equivalence classes.

Load-bearing premise

Rollout rankings can be turned into a directed acyclic graph without cycles or irresolvable conflicts, and the neighborhood aggregation supplies unbiased training signals.

What would settle it

Train the same model with GraphDPO and with standard pairwise DPO on identical multi-response datasets, then compare win rates or accuracy on held-out reasoning benchmarks; absence of consistent gains would refute the benefit of graph structure.

Figures

Figures reproduced from arXiv: 2605.08037 by Chuanneng Sun, Kristina Klinkner, Ning Liu, Shervin Malmasi.

Figure 1
Figure 1. Figure 1: GraphDPO pipeline for LLM alignment. For each prompt, the policy samples [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity of GraphDPO to the initial anchoring weight [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Graph Direct Preference Optimization (GraphDPO) as a generalization of DPO for language model alignment. It constructs directed acyclic preference graphs from multiple rollout rankings per prompt, encodes dominance as edges, and optimizes a Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods to enforce transitivity. The method recovers standard DPO when the graph reduces to pairs, uses an equivalence-class layer construction for identical-preference responses (intra-layer edges contribute zero loss), maintains linear per-prompt complexity via log-sum-exp, and optionally anchors with ground-truth solutions under an annealed schedule. Experiments on reasoning and program synthesis tasks report superior performance over pairwise and listwise baselines.

Significance. If the claims hold, GraphDPO offers a principled way to exploit richer multi-rollout preference data without quadratic complexity or loss of transitivity, potentially improving alignment stability and performance on tasks where pairwise DPO discards structure. The explicit reduction to DPO, linear aggregation, and optional oracle anchoring are positive features that could make the approach practical.

major comments (3)
  1. [Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.
  2. [Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.
  3. [Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.
minor comments (2)
  1. [Abstract] The abstract states that the method 'recovers standard DPO as a special case' but does not specify the exact graph configuration (e.g., whether isolated pairs or a collection of pairs) under which the reduction is exact.
  2. [Method] Notation for the Plackett-Luce-inspired objective and the annealing schedule parameters should be introduced with explicit equations rather than descriptive prose only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below. Revisions have been made to incorporate additional derivations, proofs, pseudocode, and experimental diagnostics as appropriate.

read point-by-point responses
  1. Referee: [Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.

    Authors: We agree that an explicit derivation strengthens the central claim. In the revised manuscript we add a formal proof sketch (new Appendix A) showing that the layer construction assigns zero loss to intra-layer edges by definition, so the Plackett-Luce neighborhood aggregation remains unbiased with respect to the observed ranking data. Transitivity is enforced because the topological order of layers precludes cycles from contributing to the gradient. The reduction to standard DPO follows immediately when each equivalence class has size one and the graph contains only pairwise edges. We also note that the DAG assumption is induced from the input rankings; any residual cycles are handled by the layer grouping of ties. revision: yes

  2. Referee: [Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.

    Authors: We accept that the original manuscript lacked sufficient detail. The revision now includes the fully expanded loss expression (Equation 4) and a new Algorithm 1 that implements the log-sum-exp aggregation. We prove that the per-prompt cost remains linear in the number of responses because the DAG admits a topological traversal and each neighborhood sum is computed once via a single forward pass of log-sum-exp; no quadratic terms appear even for dense graphs. A brief error analysis bounding the difference from the exact Plackett-Luce likelihood is added in Section 3.3. revision: yes

  3. Referee: [Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.

    Authors: This is a fair request for additional rigor. We have added a new subsection (5.4) and Table 3 that report cycle-detection rates (average 2.8 % across tasks) and intransitivity frequencies for the collected rollouts. We also include a controlled sensitivity study in which ranking noise is injected at varying levels; GraphDPO retains its advantage over pairwise DPO even under moderate noise. These diagnostics support that the observed gains are not artifacts of unusually clean data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from standard Plackett-Luce model

full rationale

The paper presents GraphDPO as a direct generalization of DPO by extending the Plackett-Luce ranking model to operate over directed acyclic preference graphs induced by multiple rollouts. The objective aggregates log-sum-exp terms over graph neighborhoods and is explicitly constructed to recover pairwise DPO when the graph degenerates to independent edges; this is a designed special case rather than an input being renamed as output. No load-bearing step relies on self-citation chains, imported uniqueness theorems, or ansatzes from prior author work. The equivalence-class construction for intra-layer zero loss is introduced as a new handling mechanism for sparse signals, not fitted from the target result. The overall derivation chain remains independent of its claimed predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that preference rankings form usable DAGs and on the Plackett-Luce model for aggregating neighborhood supervision; the annealed anchoring schedule introduces at least one tunable hyperparameter.

free parameters (1)
  • annealing schedule parameters
    The optional ground-truth anchoring uses an annealed schedule whose rate and strength are not derived from first principles and must be chosen for training stability.
axioms (2)
  • domain assumption Rollout rankings induce a directed acyclic graph without cycles
    Invoked when the paper states that preference data consists of multiple rollouts inducing rich preference structure that can be represented as DAGs.
  • domain assumption Plackett-Luce model accurately captures dominance relations over graph neighborhoods
    The objective is explicitly Plackett-Luce-inspired and aggregates over graph neighborhoods.

pith-pipeline@v0.9.0 · 5551 in / 1371 out tokens · 49560 ms · 2026-05-11T02:22:33.045353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  2. [2]

    arXiv preprint arXiv:2202.07785 , year=

    Deep Ganguli et al. Predictability and surprise in large generative models.arXiv preprint arXiv:2202.07785, 2022

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  4. [4]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017

  5. [5]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize with human feedback. In NeurIPS, 2020

  6. [6]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 2022

  7. [7]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  8. [8]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Alec Radford, Dario Amodei, and Paul Christiano. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

  9. [9]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

  10. [10]

    Alpacaeval: An automatic evaluator of instruction-following models, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023

  11. [11]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

  12. [12]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh et al. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  13. [13]

    In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.arXiv preprint arXiv:2405.14734, 2024

  14. [14]

    Orpo: Monolithic preference optimization without reference model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024

  15. [15]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  16. [16]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  17. [17]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

  18. [18]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  20. [20]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  21. [21]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  22. [22]

    Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015a

    Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015

  23. [23]

    Learning to rank: from pairwise approach to listwise approach

    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th international conference on Machine learning, pages 129–136, 2007

  24. [24]

    Preference ranking optimization for human alignment

    Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998, 2024

  25. [25]

    Lipo: Listwise preference optimization through learning-to-rank

    Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. Lipo: Listwise preference optimization through learning-to-rank. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

  26. [26]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  27. [27]

    Wiley New York, 1959

    R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

  28. [28]

    From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010

    Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010

  29. [29]

    Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009

    Tie-Yan Liu. Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009

  30. [30]

    Rescue: Ranking llm responses with partial ordering to improve response generation

    Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, and Fei Liu. Rescue: Ranking llm responses with partial ordering to improve response generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 261–272, 2024

  31. [31]

    Preference learning algorithms do not learn preference rankings.Advances in Neural Information Processing Systems, 37:101928–101968, 2024

    Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. Preference learning algorithms do not learn preference rankings.Advances in Neural Information Processing Systems, 37:101928–101968, 2024

  32. [32]

    MIT press, 2009

    Daphne Koller and Nir Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009

  33. [33]

    Rrhf: Rank responses to align language models with human feedback

    Weizhe Yuan et al. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2023

  34. [34]

    Offline preference optimization via maximum marginal like- lihood estimation

    Saeed Najafi and Alona Fyshe. Offline preference optimization via maximum marginal like- lihood estimation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6751–6764, 2026. 11

  35. [35]

    Margin matching preference optimization: Enhanced model alignment with granular feedback

    Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, and Kimin Lee. Margin matching preference optimization: Enhanced model alignment with granular feedback. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13554–13570, 2024

  36. [36]

    Direct preference optimization with an offset

    Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9954–9972, 2024

  37. [37]

    Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026

    Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, and Yang Liu. Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026

  38. [38]

    Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

    Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

  39. [39]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  40. [40]

    Multi-preference optimization: Generalizing dpo via set-level contrasts

    Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts. arXiv preprint arXiv:2412.04628, 2024

  41. [41]

    Swepo: Simultaneous weighted preference optimization for group contrastive alignment

    Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Swepo: Simultaneous weighted preference optimization for group contrastive alignment. In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, 2025

  42. [42]

    Rela- tive preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958, 2024

    Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. Rela- tive preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958, 2024

  43. [43]

    The lambdaloss framework for ranking metric optimization

    Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. The lambdaloss framework for ranking metric optimization. InProceedings of the 27th ACM international conference on information and knowledge management, pages 1313–1322, 2018

  44. [44]

    Bayesian inference for plackett–luce ranking models.Pro- ceedings of the 26th International Conference on Machine Learning, 2009

    John Guiver and Edward Snelson. Bayesian inference for plackett–luce ranking models.Pro- ceedings of the 26th International Conference on Machine Learning, 2009

  45. [45]

    Neural ranking models with weak supervision

    Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural ranking models with weak supervision. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 65–74, 2017

  46. [46]

    Semi-supervised classification with graph convolutional networks

    Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InICLR, 2017

  47. [47]

    Graph attention networks

    Petar Veli ˇckovi´c et al. Graph attention networks. InICLR, 2018

  48. [48]

    Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36:11261–11295, 2023

    Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36:11261–11295, 2023

  49. [49]

    Towards acyclic preference evaluation of language models via multiple evaluators

    Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, and Ranjay Krishna. Towards acyclic preference evaluation of language models via multiple evaluators. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21903–21911, 2026. 12 A Experimental Settings Optimization.Across all experiments, we use AdamW with (...