arxiv: 2605.08037 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Ning Liu , Chuanneng Sun , Kristina Klinkner , Shervin Malmasi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GraphDPOpreference graphsdirect preference optimizationPlackett-Lucetransitivitylanguage model alignmentrollout rankingsRLHF

0 comments

The pith

Language models align more effectively by optimizing over full preference graphs from multiple rollouts instead of isolated pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct preference optimization reduces multiple responses per prompt to independent pairs, which discards transitivity and can create conflicting signals. The paper introduces GraphDPO to represent these responses as a directed acyclic graph with dominance edges and optimize a Plackett-Luce-inspired objective over graph neighborhoods. This aggregates richer supervision while recovering standard DPO as the special case of isolated pairs. Readers care because the approach uses the same rollout data more efficiently, avoids redundant loss terms, and maintains linear complexity, yielding stronger results on reasoning and program synthesis.

Core claim

Graph Direct Preference Optimization generalizes DPO to directed acyclic preference graphs induced by rollout rankings, encoding dominance as edges and optimizing a graph-structured Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods while enforcing transitivity. Equivalence classes group identical-preference responses into layers that contribute zero loss, and optional ground-truth anchoring with an annealed schedule stabilizes training.

What carries the argument

Directed acyclic preference graph with a Plackett-Luce-inspired objective aggregated over neighborhoods, which enforces transitivity and sets intra-layer loss to zero for equivalence classes.

Load-bearing premise

Rollout rankings can be turned into a directed acyclic graph without cycles or irresolvable conflicts, and the neighborhood aggregation supplies unbiased training signals.

What would settle it

Train the same model with GraphDPO and with standard pairwise DPO on identical multi-response datasets, then compare win rates or accuracy on held-out reasoning benchmarks; absence of consistent gains would refute the benefit of graph structure.

Figures

Figures reproduced from arXiv: 2605.08037 by Chuanneng Sun, Kristina Klinkner, Ning Liu, Shervin Malmasi.

**Figure 2.** Figure 2: Sensitivity of GraphDPO to the initial anchoring weight [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphDPO generalizes DPO to multi-response graphs but its edge over pairwise methods depends on clean DAGs from rollouts that may not hold under noise.

read the letter

The core idea is to stop collapsing multiple rollouts per prompt into independent pairs and instead build a directed acyclic preference graph, then optimize a Plackett-Luce style loss that aggregates over neighborhoods while recovering ordinary DPO as the special case of isolated edges. Equivalence classes for tied responses add zero-loss intra-layer terms to avoid bad gradients, and an optional annealed oracle anchor can be inserted for early stability. The linear complexity via log-sum-exp is a practical plus if the implementation matches the claim. Experiments on reasoning and program synthesis show gains over standard baselines, which suggests the extra structure can be useful when the data actually supports it. What stands out is the explicit handling of transitivity that pairwise DPO discards by design. The soft spots sit where the stress-test note points: rollout rankings scored by noisy models or humans often contain inconsistencies that produce cycles, and the abstract gives no derivation, error bounds, or exclusion rules to show the aggregation stays unbiased once those cycles appear. The equivalence-class fix is presented as solving the discrete-signal problem, but without the full math it is hard to judge whether it removes gradient bias or just hides it. The annealing schedule also introduces extra hyperparameters whose sensitivity is not discussed. This is aimed at alignment researchers who already generate several responses per prompt and want to use more of the ranking information than current pairwise losses allow. A reader working on preference optimization or RLHF variants would get concrete value from the construction even if they end up modifying it. The paper deserves a serious referee because the generalization is well-motivated and the reported results are positive, though review should focus on the robustness of the DAG assumption and the missing derivation details. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Graph Direct Preference Optimization (GraphDPO) as a generalization of DPO for language model alignment. It constructs directed acyclic preference graphs from multiple rollout rankings per prompt, encodes dominance as edges, and optimizes a Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods to enforce transitivity. The method recovers standard DPO when the graph reduces to pairs, uses an equivalence-class layer construction for identical-preference responses (intra-layer edges contribute zero loss), maintains linear per-prompt complexity via log-sum-exp, and optionally anchors with ground-truth solutions under an annealed schedule. Experiments on reasoning and program synthesis tasks report superior performance over pairwise and listwise baselines.

Significance. If the claims hold, GraphDPO offers a principled way to exploit richer multi-rollout preference data without quadratic complexity or loss of transitivity, potentially improving alignment stability and performance on tasks where pairwise DPO discards structure. The explicit reduction to DPO, linear aggregation, and optional oracle anchoring are positive features that could make the approach practical.

major comments (3)

[Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.
[Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.
[Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.

minor comments (2)

[Abstract] The abstract states that the method 'recovers standard DPO as a special case' but does not specify the exact graph configuration (e.g., whether isolated pairs or a collection of pairs) under which the reduction is exact.
[Method] Notation for the Plackett-Luce-inspired objective and the annealing schedule parameters should be introduced with explicit equations rather than descriptive prose only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below. Revisions have been made to incorporate additional derivations, proofs, pseudocode, and experimental diagnostics as appropriate.

read point-by-point responses

Referee: [Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.

Authors: We agree that an explicit derivation strengthens the central claim. In the revised manuscript we add a formal proof sketch (new Appendix A) showing that the layer construction assigns zero loss to intra-layer edges by definition, so the Plackett-Luce neighborhood aggregation remains unbiased with respect to the observed ranking data. Transitivity is enforced because the topological order of layers precludes cycles from contributing to the gradient. The reduction to standard DPO follows immediately when each equivalence class has size one and the graph contains only pairwise edges. We also note that the DAG assumption is induced from the input rankings; any residual cycles are handled by the layer grouping of ties. revision: yes
Referee: [Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.

Authors: We accept that the original manuscript lacked sufficient detail. The revision now includes the fully expanded loss expression (Equation 4) and a new Algorithm 1 that implements the log-sum-exp aggregation. We prove that the per-prompt cost remains linear in the number of responses because the DAG admits a topological traversal and each neighborhood sum is computed once via a single forward pass of log-sum-exp; no quadratic terms appear even for dense graphs. A brief error analysis bounding the difference from the exact Plackett-Luce likelihood is added in Section 3.3. revision: yes
Referee: [Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.

Authors: This is a fair request for additional rigor. We have added a new subsection (5.4) and Table 3 that report cycle-detection rates (average 2.8 % across tasks) and intransitivity frequencies for the collected rollouts. We also include a controlled sensitivity study in which ranking noise is injected at varying levels; GraphDPO retains its advantage over pairwise DPO even under moderate noise. These diagnostics support that the observed gains are not artifacts of unusually clean data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from standard Plackett-Luce model

full rationale

The paper presents GraphDPO as a direct generalization of DPO by extending the Plackett-Luce ranking model to operate over directed acyclic preference graphs induced by multiple rollouts. The objective aggregates log-sum-exp terms over graph neighborhoods and is explicitly constructed to recover pairwise DPO when the graph degenerates to independent edges; this is a designed special case rather than an input being renamed as output. No load-bearing step relies on self-citation chains, imported uniqueness theorems, or ansatzes from prior author work. The equivalence-class construction for intra-layer zero loss is introduced as a new handling mechanism for sparse signals, not fitted from the target result. The overall derivation chain remains independent of its claimed predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that preference rankings form usable DAGs and on the Plackett-Luce model for aggregating neighborhood supervision; the annealed anchoring schedule introduces at least one tunable hyperparameter.

free parameters (1)

annealing schedule parameters
The optional ground-truth anchoring uses an annealed schedule whose rate and strength are not derived from first principles and must be chosen for training stability.

axioms (2)

domain assumption Rollout rankings induce a directed acyclic graph without cycles
Invoked when the paper states that preference data consists of multiple rollouts inducing rich preference structure that can be represented as DAGs.
domain assumption Plackett-Luce model accurately captures dominance relations over graph neighborhoods
The objective is explicitly Plackett-Luce-inspired and aggregates over graph neighborhoods.

pith-pipeline@v0.9.0 · 5551 in / 1371 out tokens · 49560 ms · 2026-05-11T02:22:33.045353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GraphDPO constructs a directed acyclic graph (DAG) over sampled responses... optimizes a graph-structured Plackett–Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

arXiv preprint arXiv:2202.07785 , year=

Deep Ganguli et al. Predictability and surprise in large generative models.arXiv preprint arXiv:2202.07785, 2022

work page arXiv 2022
[3]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017

work page 2017
[5]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize with human feedback. In NeurIPS, 2020

work page 2020
[6]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Alec Radford, Dario Amodei, and Paul Christiano. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[10]

Alpacaeval: An automatic evaluator of instruction-following models, 2023

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023

work page 2023
[11]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

work page 2024
[12]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh et al. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review arXiv 2024
[13]

In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.arXiv preprint arXiv:2405.14734, 2024

work page arXiv 2024
[14]

Orpo: Monolithic preference optimization without reference model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024

work page 2024
[15]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review arXiv 2024
[16]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review arXiv 2021
[17]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[21]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015a

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015

work page arXiv 2015
[23]

Learning to rank: from pairwise approach to listwise approach

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th international conference on Machine learning, pages 129–136, 2007

work page 2007
[24]

Preference ranking optimization for human alignment

Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998, 2024

work page 2024
[25]

Lipo: Listwise preference optimization through learning-to-rank

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. Lipo: Listwise preference optimization through learning-to-rank. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page 2025
[26]

The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

work page 1975
[27]

Wiley New York, 1959

R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

work page 1959
[28]

From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010

Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010

work page 2010
[29]

Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009

Tie-Yan Liu. Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009

work page 2009
[30]

Rescue: Ranking llm responses with partial ordering to improve response generation

Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, and Fei Liu. Rescue: Ranking llm responses with partial ordering to improve response generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 261–272, 2024

work page 2024
[31]

Preference learning algorithms do not learn preference rankings.Advances in Neural Information Processing Systems, 37:101928–101968, 2024

Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. Preference learning algorithms do not learn preference rankings.Advances in Neural Information Processing Systems, 37:101928–101968, 2024

work page 2024
[32]

MIT press, 2009

Daphne Koller and Nir Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009

work page 2009
[33]

Rrhf: Rank responses to align language models with human feedback

Weizhe Yuan et al. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[34]

Offline preference optimization via maximum marginal like- lihood estimation

Saeed Najafi and Alona Fyshe. Offline preference optimization via maximum marginal like- lihood estimation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6751–6764, 2026. 11

work page 2026
[35]

Margin matching preference optimization: Enhanced model alignment with granular feedback

Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, and Kimin Lee. Margin matching preference optimization: Enhanced model alignment with granular feedback. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13554–13570, 2024

work page 2024
[36]

Direct preference optimization with an offset

Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9954–9972, 2024

work page 2024
[37]

Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026

Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, and Yang Liu. Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026

work page arXiv 2026
[38]

Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

work page 2024
[39]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Multi-preference optimization: Generalizing dpo via set-level contrasts

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts. arXiv preprint arXiv:2412.04628, 2024

work page arXiv 2024
[41]

Swepo: Simultaneous weighted preference optimization for group contrastive alignment

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Swepo: Simultaneous weighted preference optimization for group contrastive alignment. In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, 2025

work page 2025
[42]

Rela- tive preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958, 2024

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. Rela- tive preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958, 2024

work page arXiv 2024
[43]

The lambdaloss framework for ranking metric optimization

Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. The lambdaloss framework for ranking metric optimization. InProceedings of the 27th ACM international conference on information and knowledge management, pages 1313–1322, 2018

work page 2018
[44]

Bayesian inference for plackett–luce ranking models.Pro- ceedings of the 26th International Conference on Machine Learning, 2009

John Guiver and Edward Snelson. Bayesian inference for plackett–luce ranking models.Pro- ceedings of the 26th International Conference on Machine Learning, 2009

work page 2009
[45]

Neural ranking models with weak supervision

Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural ranking models with weak supervision. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 65–74, 2017

work page 2017
[46]

Semi-supervised classification with graph convolutional networks

Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InICLR, 2017

work page 2017
[47]

Graph attention networks

Petar Veli ˇckovi´c et al. Graph attention networks. InICLR, 2018

work page 2018
[48]

Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36:11261–11295, 2023

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36:11261–11295, 2023

work page 2023
[49]

Towards acyclic preference evaluation of language models via multiple evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, and Ranjay Krishna. Towards acyclic preference evaluation of language models via multiple evaluators. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21903–21911, 2026. 12 A Experimental Settings Optimization.Across all experiments, we use AdamW with (...

work page 2026