Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Bo Han; Dong Fang; Wei Xue; Yike Guo; Yonggang Zhang; Zhiqin Yang

arxiv: 2605.20834 · v1 · pith:QYVVT3Y4new · submitted 2026-05-20 · 💻 cs.AI · cs.LG

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Zhiqin Yang , Yonggang Zhang , Wei Xue , Dong Fang , Bo Han , Yike Guo This is my paper

Pith reviewed 2026-05-21 04:56 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords Direct Preference OptimizationReinforcement Learning from Human FeedbackAI alignmentpreference optimizationConstrained Preference Optimizationconditional equivalencesoft margin ranking

0 comments

The pith

DPO is equivalent to RLHF only when the optimal policy prefers human-chosen responses, an assumption often violated in practice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the claimed equivalence between DPO and RLHF is not always true but holds only under an implicit assumption that the optimal policy from RLHF would choose the responses humans prefer. A sympathetic reader cares because when this assumption is violated in practice, DPO no longer pushes for human alignment but instead makes the policy better than a reference in a relative sense, which can result in models that like worse answers. The authors diagnose this failure, prove the difference in objectives, and propose a constrained method to fix it for reliable results.

Core claim

The equivalence of DPO and RLHF is conditional on the RLHF-optimal policy preferring human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. The analysis characterizes violation cases, identifies an undesirable solution space, proves differing objectives, and introduces Constrained Preference Optimization to ensure alignment.

What carries the argument

The implicit assumption that the RLHF-optimal policy must prefer human-preferred responses, which determines whether DPO aligns with RLHF or instead optimizes relative advantage over a reference policy.

If this is right

DPO can exhibit pathological convergence to policies that prefer dispreferred responses while still reducing the loss.
An undesirable solution space exists for DPO when the key assumption does not hold.
DPO and RLHF optimize fundamentally different objectives in cases where the assumption fails.
CPO augments RLHF with constraints to achieve provable alignment while preserving implementation simplicity.
DPO implements soft margin ranking with potentially negative targets from a geometric perspective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might add checks to verify whether the reference policy satisfies the preference assumption before applying DPO.
The soft margin ranking view could guide design of new losses that enforce positive margins for alignment.
Similar conditional analyses may reveal hidden failure modes in other preference-based alignment methods.
CPO could be tested as a drop-in replacement in existing RLHF pipelines to measure gains in robustness.

Load-bearing premise

That the policy which is optimal under RLHF would select responses that humans prefer over those they do not.

What would settle it

A demonstration of a trained policy that achieves low DPO loss but selects dispreferred responses more often than human-preferred ones, or a counterexample where the RLHF optimum does not favor human preferences.

Figures

Figures reproduced from arXiv: 2605.20834 by Bo Han, Dong Fang, Wei Xue, Yike Guo, Yonggang Zhang, Zhiqin Yang.

**Figure 1.** Figure 1: Measurement of violation frequency on Llama-3-8B-Instruct under Llama3 ultrafeedback armorn. We compute the violation statistics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

**Figure 2.** Figure 2: Fraction of training samples in the undesirable solution space U (Definition 3.3) over training steps under different corruption ratios R ∈ {0.2, 0.3, 0.4}. in [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPO matches RLHF only when the optimal policy already prefers the human-chosen responses, and the paper supplies a proof plus CPO as a constrained fix when that assumption breaks.

read the letter

The key takeaway is that DPO matches RLHF only when we assume the RLHF-optimal policy already likes the human-preferred answers more than the dispreferred ones. If that does not hold, DPO ends up optimizing something different and can settle on policies that like the wrong responses even as the DPO loss goes down. The authors prove this conditional equivalence, characterize the bad solution space, and introduce Constrained Preference Optimization to add constraints and restore alignment. What stands out as new is the precise condition for equivalence and the CPO approach, which augments the standard setup without much added complexity. The geometric view through soft margin ranking is also a nice way to see why DPO can have negative targets. They credit the prior DPO work and build directly on it. The paper does a solid job laying out the assumption and showing how it leads to different objectives in some cases. The experiments report that CPO reaches state-of-the-art results on the usual benchmarks, which suggests the fix is at least competitive in practice. The main soft spot is the lack of a direct test for the claimed pathology. The experiments highlight CPO's wins rather than running DPO in a setting where the assumption is violated and showing the loss decreasing while the policy prefers dispreferred outputs. That demonstration would make the failure mode more convincing. The assumption itself is presented as frequently violated, but more evidence on how common the violation is would strengthen the case. This paper is aimed at people who train or analyze aligned language models using preference optimization. Anyone thinking about the theory behind DPO or looking for alternatives will get something out of the conditional result and the CPO proposal. It deserves a serious referee because the theoretical point is worth verifying and the method is straightforward enough to be useful if the claims hold up. I would recommend putting it through peer review. The derivations need checking, and the empirical section could use more targeted tests for the failure case, but the overall direction is worth the time.

Referee Report

1 major / 0 minor

Summary. The paper claims that the equivalence between DPO and RLHF is conditional rather than universal, hinging on the implicit assumption that the RLHF-optimal policy prefers human-preferred responses. When violated, DPO optimizes relative advantage over the reference policy instead of absolute alignment, leading to pathological convergence where DPO loss decreases while the policy prefers dispreferred responses. The authors characterize violation conditions, prove differing objectives, introduce Constrained Preference Optimization (CPO) with constraints for provable alignment, offer a geometric soft-margin ranking interpretation, and report CPO achieving state-of-the-art results on standard benchmarks.

Significance. If the conditional equivalence, failure-mode characterization, and CPO guarantees hold with supporting derivations, the work would clarify important limitations in current preference optimization methods for alignment and provide a practical fix that preserves implementation simplicity. The geometric interpretation and explicit handling of the assumption violation could guide refinements in DPO-style algorithms.

major comments (1)

[Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The observation regarding the need for more direct empirical isolation of the pathological convergence is valid and will be addressed in revision.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.

Authors: We agree that explicitly demonstrating the pathological convergence under assumption violation would strengthen the practical significance of the failure-mode diagnosis. The manuscript currently provides theoretical characterization and proofs (Sections 3-4) showing DPO optimizes relative advantage rather than absolute alignment when the assumption is violated, along with the existence of undesirable solution spaces. However, the experiments focus on CPO benchmark performance. To address this, we will add a controlled synthetic experiment in the revised manuscript: construct preference datasets violating the RLHF-optimal policy preference assumption, train DPO, and report metrics showing DPO loss decreasing while probability mass shifts toward dispreferred responses. We will also include CPO results under identical conditions for comparison. This addition preserves the paper's core claims while directly illustrating the diagnosed failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a conditional proof on an explicitly stated assumption

full rationale

The paper's central derivation establishes conditional equivalence between DPO and RLHF by proving that equivalence holds only when the RLHF-optimal policy prefers human-preferred responses, and that DPO instead optimizes relative advantage (leading to pathological convergence) when the assumption is violated. This is presented as an explicit characterization rather than a self-definitional loop, fitted prediction, or load-bearing self-citation. No equations or claims reduce by construction to inputs from the same data or prior author work; the assumption is named and analyzed as frequently violated in practice, with CPO introduced to enforce alignment. Experiments focus on CPO benchmarks rather than circularly validating the pathology via fitted quantities. The proof chain is self-contained and externally falsifiable via the stated assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption about the RLHF-optimal policy and on the modeling choice that DPO loss can be analyzed via relative advantage; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The RLHF-optimal policy must prefer human-preferred responses
Stated explicitly in the abstract as the implicit assumption that is frequently violated and on which equivalence depends.

pith-pipeline@v0.9.0 · 5753 in / 1276 out tokens · 26316 ms · 2026-05-21T04:56:33.784290+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove this equivalence is conditional rather than universal, depending on an implicit assumption... the RLHF-optimal policy must prefer human-preferred responses... DPO optimizes relative advantage over the reference policy rather than absolute alignment
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DPO is a smooth approximation to margin ranking loss... lim β→∞ (1/β) L_DPO = max(0, δ_ref − δ_θ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

[1]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[2]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization , author=

work page
[4]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The Method of Paired Comparisons , author=

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=

work page
[6]

Proceedings of the 22nd international conference on Machine learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=

work page
[7]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Disentangling Length from Quality in Direct Preference Optimization , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

work page 2024
[8]

Proceedings of the 24th international conference on Machine learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=

work page
[9]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[10]

KTO: Model Alignment as Prospect Theoretic Optimization

Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[12]

2024 Conference on Empirical Methods in Natural Language Processing , year=

ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 Conference on Empirical Methods in Natural Language Processing , year=

work page 2024
[13]

Camels in a changing climate: Enhancing lm adaptation with tulu 2,

Camels in a changing climate: Enhancing lm adaptation with tulu 2 , author=. arXiv preprint arXiv:2311.10702 , year=

work page arXiv
[14]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

work page 2022
[15]

Forty-first International Conference on Machine Learning , year=

Nash learning from human feedback , author=. Forty-first International Conference on Machine Learning , year=

work page
[16]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[18]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[19]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[21]

Zephyr: Direct Distillation of LM Alignment

Zephyr: Direct Distillation of LM Alignment , author=. arXiv preprint arXiv:2310.16944 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[25]

Transactions on Machine Learning Research , year=

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , year=

work page
[26]

2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization , author=. 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=. 2024 , organization=

work page 2024
[27]

Transactions on Machine Learning Research , year=

Robust Preference Optimization through Reward Model Distillation , author=. Transactions on Machine Learning Research , year=

work page
[28]

International Conference on Machine Learning , pages=

Understanding the Learning Dynamics of Alignment with Human Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[29]

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO , author=. arXiv preprint arXiv:2505.19770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Advances in neural information processing systems , volume=

Ranking with large margin principle: Two approaches , author=. Advances in neural information processing systems , volume=

work page
[31]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Facenet: A unified embedding for face recognition and clustering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[32]

Springer Ser

Numerical optimization , author=. Springer Ser. Oper. Res. Financ. Eng./Springer , year=

work page
[33]

Advances in Neural Information Processing Systems , volume=

Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

International Conference on Machine Learning , pages=

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[35]

Slic-hf: Sequence likelihood calibration with human feedback

Slic-hf: Sequence likelihood calibration with human feedback , author=. arXiv preprint arXiv:2305.10425 , year=

work page arXiv
[36]

Alpacaeval: An automatic evaluator of instruction-following models , author=

work page
[37]

Blog post.[Accessed 07-02-2025] , year=

From live data to high-quality benchmarks: The arena-hard pipeline , author=. Blog post.[Accessed 07-02-2025] , year=

work page 2025
[38]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page
[39]

International Conference on Learning Representations , year=

Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=

work page
[40]

International conference on machine learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[41]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

work page 2018
[42]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024

[2] [2]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [3]

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization , author=

work page

[4] [4]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

The Method of Paired Comparisons , author=

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=

work page

[6] [6]

Proceedings of the 22nd international conference on Machine learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=

work page

[7] [7]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Disentangling Length from Quality in Direct Preference Optimization , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

work page 2024

[8] [8]

Proceedings of the 24th international conference on Machine learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=

work page

[9] [9]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page

[10] [10]

KTO: Model Alignment as Prospect Theoretic Optimization

Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[12] [12]

2024 Conference on Empirical Methods in Natural Language Processing , year=

ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 Conference on Empirical Methods in Natural Language Processing , year=

work page 2024

[13] [13]

Camels in a changing climate: Enhancing lm adaptation with tulu 2,

Camels in a changing climate: Enhancing lm adaptation with tulu 2 , author=. arXiv preprint arXiv:2311.10702 , year=

work page arXiv

[14] [14]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

work page 2022

[15] [15]

Forty-first International Conference on Machine Learning , year=

Nash learning from human feedback , author=. Forty-first International Conference on Machine Learning , year=

work page

[16] [16]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[17] [17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[18] [18]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[19] [19]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page

[21] [21]

Zephyr: Direct Distillation of LM Alignment

Zephyr: Direct Distillation of LM Alignment , author=. arXiv preprint arXiv:2310.16944 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[23] [23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

[25] [25]

Transactions on Machine Learning Research , year=

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , year=

work page

[26] [26]

2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization , author=. 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=. 2024 , organization=

work page 2024

[27] [27]

Transactions on Machine Learning Research , year=

Robust Preference Optimization through Reward Model Distillation , author=. Transactions on Machine Learning Research , year=

work page

[28] [28]

International Conference on Machine Learning , pages=

Understanding the Learning Dynamics of Alignment with Human Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[29] [29]

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO , author=. arXiv preprint arXiv:2505.19770 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Advances in neural information processing systems , volume=

Ranking with large margin principle: Two approaches , author=. Advances in neural information processing systems , volume=

work page

[31] [31]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Facenet: A unified embedding for face recognition and clustering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[32] [32]

Springer Ser

Numerical optimization , author=. Springer Ser. Oper. Res. Financ. Eng./Springer , year=

work page

[33] [33]

Advances in Neural Information Processing Systems , volume=

Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

International Conference on Machine Learning , pages=

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[35] [35]

Slic-hf: Sequence likelihood calibration with human feedback

Slic-hf: Sequence likelihood calibration with human feedback , author=. arXiv preprint arXiv:2305.10425 , year=

work page arXiv

[36] [36]

Alpacaeval: An automatic evaluator of instruction-following models , author=

work page

[37] [37]

Blog post.[Accessed 07-02-2025] , year=

From live data to high-quality benchmarks: The arena-hard pipeline , author=. Blog post.[Accessed 07-02-2025] , year=

work page 2025

[38] [38]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page

[39] [39]

International Conference on Learning Representations , year=

Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=

work page

[40] [40]

International conference on machine learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[41] [41]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

work page 2018

[42] [42]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv