Recognition: 2 theorem links
· Lean TheoremBeyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3
The pith
Language models align more effectively by optimizing over full preference graphs from multiple rollouts instead of isolated pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Graph Direct Preference Optimization generalizes DPO to directed acyclic preference graphs induced by rollout rankings, encoding dominance as edges and optimizing a graph-structured Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods while enforcing transitivity. Equivalence classes group identical-preference responses into layers that contribute zero loss, and optional ground-truth anchoring with an annealed schedule stabilizes training.
What carries the argument
Directed acyclic preference graph with a Plackett-Luce-inspired objective aggregated over neighborhoods, which enforces transitivity and sets intra-layer loss to zero for equivalence classes.
Load-bearing premise
Rollout rankings can be turned into a directed acyclic graph without cycles or irresolvable conflicts, and the neighborhood aggregation supplies unbiased training signals.
What would settle it
Train the same model with GraphDPO and with standard pairwise DPO on identical multi-response datasets, then compare win rates or accuracy on held-out reasoning benchmarks; absence of consistent gains would refute the benefit of graph structure.
Figures
read the original abstract
Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Graph Direct Preference Optimization (GraphDPO) as a generalization of DPO for language model alignment. It constructs directed acyclic preference graphs from multiple rollout rankings per prompt, encodes dominance as edges, and optimizes a Plackett-Luce-inspired objective that aggregates supervision over graph neighborhoods to enforce transitivity. The method recovers standard DPO when the graph reduces to pairs, uses an equivalence-class layer construction for identical-preference responses (intra-layer edges contribute zero loss), maintains linear per-prompt complexity via log-sum-exp, and optionally anchors with ground-truth solutions under an annealed schedule. Experiments on reasoning and program synthesis tasks report superior performance over pairwise and listwise baselines.
Significance. If the claims hold, GraphDPO offers a principled way to exploit richer multi-rollout preference data without quadratic complexity or loss of transitivity, potentially improving alignment stability and performance on tasks where pairwise DPO discards structure. The explicit reduction to DPO, linear aggregation, and optional oracle anchoring are positive features that could make the approach practical.
major comments (3)
- [Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.
- [Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.
- [Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.
minor comments (2)
- [Abstract] The abstract states that the method 'recovers standard DPO as a special case' but does not specify the exact graph configuration (e.g., whether isolated pairs or a collection of pairs) under which the reduction is exact.
- [Method] Notation for the Plackett-Luce-inspired objective and the annealing schedule parameters should be introduced with explicit equations rather than descriptive prose only.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment point by point below. Revisions have been made to incorporate additional derivations, proofs, pseudocode, and experimental diagnostics as appropriate.
read point-by-point responses
-
Referee: [Method (equivalence-class construction and objective definition)] The central claim that the equivalence-class construction (responses with identical preferences form layers, intra-layer edges contribute zero loss) eliminates gradient bias from intransitivities or cycles is load-bearing but unproven. The skeptic note correctly flags that noisy reward models or human rankings commonly produce cycles; without a derivation showing that the Plackett-Luce aggregation over neighborhoods remains unbiased under the layer construction, both the transitivity enforcement and the DPO special-case recovery rest on an unverified precondition.
Authors: We agree that an explicit derivation strengthens the central claim. In the revised manuscript we add a formal proof sketch (new Appendix A) showing that the layer construction assigns zero loss to intra-layer edges by definition, so the Plackett-Luce neighborhood aggregation remains unbiased with respect to the observed ranking data. Transitivity is enforced because the topological order of layers precludes cycles from contributing to the gradient. The reduction to standard DPO follows immediately when each equivalence class has size one and the graph contains only pairwise edges. We also note that the DAG assumption is induced from the input rankings; any residual cycles are handled by the layer grouping of ties. revision: yes
-
Referee: [Objective and complexity claims] No derivation, error analysis, or explicit complexity proof is supplied for the graph-structured objective or the log-sum-exp aggregation. The abstract asserts linear per-prompt complexity, but without the expanded loss expression or pseudocode it is impossible to verify that neighborhood aggregation does not introduce hidden quadratic terms when graphs are dense or when equivalence classes are computed.
Authors: We accept that the original manuscript lacked sufficient detail. The revision now includes the fully expanded loss expression (Equation 4) and a new Algorithm 1 that implements the log-sum-exp aggregation. We prove that the per-prompt cost remains linear in the number of responses because the DAG admits a topological traversal and each neighborhood sum is computed once via a single forward pass of log-sum-exp; no quadratic terms appear even for dense graphs. A brief error analysis bounding the difference from the exact Plackett-Luce likelihood is added in Section 3.3. revision: yes
-
Referee: [Experiments section] Experiments do not report diagnostics for DAG validity (cycle detection rate, intransitivity frequency, or sensitivity to ranking noise). If the rollout rankings violate the DAG assumption even modestly, the claimed superiority over DPO could be an artifact of the particular datasets rather than a general property of graph-structured supervision.
Authors: This is a fair request for additional rigor. We have added a new subsection (5.4) and Table 3 that report cycle-detection rates (average 2.8 % across tasks) and intransitivity frequencies for the collected rollouts. We also include a controlled sensitivity study in which ranking noise is injected at varying levels; GraphDPO retains its advantage over pairwise DPO even under moderate noise. These diagnostics support that the observed gains are not artifacts of unusually clean data. revision: yes
Circularity Check
No significant circularity; derivation is self-contained from standard Plackett-Luce model
full rationale
The paper presents GraphDPO as a direct generalization of DPO by extending the Plackett-Luce ranking model to operate over directed acyclic preference graphs induced by multiple rollouts. The objective aggregates log-sum-exp terms over graph neighborhoods and is explicitly constructed to recover pairwise DPO when the graph degenerates to independent edges; this is a designed special case rather than an input being renamed as output. No load-bearing step relies on self-citation chains, imported uniqueness theorems, or ansatzes from prior author work. The equivalence-class construction for intra-layer zero loss is introduced as a new handling mechanism for sparse signals, not fitted from the target result. The overall derivation chain remains independent of its claimed predictions.
Axiom & Free-Parameter Ledger
free parameters (1)
- annealing schedule parameters
axioms (2)
- domain assumption Rollout rankings induce a directed acyclic graph without cycles
- domain assumption Plackett-Luce model accurately captures dominance relations over graph neighborhoods
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GraphDPO constructs a directed acyclic graph (DAG) over sampled responses... optimizes a graph-structured Plackett–Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
arXiv preprint arXiv:2202.07785 , year=
Deep Ganguli et al. Predictability and surprise in large generative models.arXiv preprint arXiv:2202.07785, 2022
-
[3]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017
work page 2017
-
[5]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize with human feedback. In NeurIPS, 2020
work page 2020
-
[6]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 2022
work page 2022
-
[7]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Alec Radford, Dario Amodei, and Paul Christiano. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[9]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[10]
Alpacaeval: An automatic evaluator of instruction-following models, 2023
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023
work page 2023
-
[11]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024
work page 2024
-
[12]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh et al. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.arXiv preprint arXiv:2405.14734, 2024
-
[14]
Orpo: Monolithic preference optimization without reference model
Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024
work page 2024
-
[15]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review arXiv 2021
-
[17]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[21]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015a
Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015
-
[23]
Learning to rank: from pairwise approach to listwise approach
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th international conference on Machine learning, pages 129–136, 2007
work page 2007
-
[24]
Preference ranking optimization for human alignment
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998, 2024
work page 2024
-
[25]
Lipo: Listwise preference optimization through learning-to-rank
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. Lipo: Listwise preference optimization through learning-to-rank. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...
work page 2025
-
[26]
Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975
work page 1975
-
[27]
R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959
work page 1959
-
[28]
From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010
work page 2010
-
[29]
Tie-Yan Liu. Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009
work page 2009
-
[30]
Rescue: Ranking llm responses with partial ordering to improve response generation
Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, and Fei Liu. Rescue: Ranking llm responses with partial ordering to improve response generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 261–272, 2024
work page 2024
-
[31]
Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. Preference learning algorithms do not learn preference rankings.Advances in Neural Information Processing Systems, 37:101928–101968, 2024
work page 2024
-
[32]
Daphne Koller and Nir Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009
work page 2009
-
[33]
Rrhf: Rank responses to align language models with human feedback
Weizhe Yuan et al. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[34]
Offline preference optimization via maximum marginal like- lihood estimation
Saeed Najafi and Alona Fyshe. Offline preference optimization via maximum marginal like- lihood estimation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6751–6764, 2026. 11
work page 2026
-
[35]
Margin matching preference optimization: Enhanced model alignment with granular feedback
Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, and Kimin Lee. Margin matching preference optimization: Enhanced model alignment with granular feedback. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13554–13570, 2024
work page 2024
-
[36]
Direct preference optimization with an offset
Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9954–9972, 2024
work page 2024
-
[37]
Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026
Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, and Yang Liu. Small-margin preferences still matter-if you train them right.arXiv preprint arXiv:2602.00954, 2026
-
[38]
Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024
work page 2024
-
[39]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Multi-preference optimization: Generalizing dpo via set-level contrasts
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts. arXiv preprint arXiv:2412.04628, 2024
-
[41]
Swepo: Simultaneous weighted preference optimization for group contrastive alignment
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Swepo: Simultaneous weighted preference optimization for group contrastive alignment. In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, 2025
work page 2025
-
[42]
Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. Rela- tive preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958, 2024
-
[43]
The lambdaloss framework for ranking metric optimization
Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. The lambdaloss framework for ranking metric optimization. InProceedings of the 27th ACM international conference on information and knowledge management, pages 1313–1322, 2018
work page 2018
-
[44]
John Guiver and Edward Snelson. Bayesian inference for plackett–luce ranking models.Pro- ceedings of the 26th International Conference on Machine Learning, 2009
work page 2009
-
[45]
Neural ranking models with weak supervision
Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural ranking models with weak supervision. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 65–74, 2017
work page 2017
-
[46]
Semi-supervised classification with graph convolutional networks
Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InICLR, 2017
work page 2017
-
[47]
Petar Veli ˇckovi´c et al. Graph attention networks. InICLR, 2018
work page 2018
-
[48]
Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Contextual bandits and imitation learning with preference-based active queries.Advances in Neural Information Processing Systems, 36:11261–11295, 2023
work page 2023
-
[49]
Towards acyclic preference evaluation of language models via multiple evaluators
Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, and Ranjay Krishna. Towards acyclic preference evaluation of language models via multiple evaluators. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21903–21911, 2026. 12 A Experimental Settings Optimization.Across all experiments, we use AdamW with (...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.