Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Pith reviewed 2026-05-21 04:56 UTC · model grok-4.3
The pith
DPO is equivalent to RLHF only when the optimal policy prefers human-chosen responses, an assumption often violated in practice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The equivalence of DPO and RLHF is conditional on the RLHF-optimal policy preferring human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. The analysis characterizes violation cases, identifies an undesirable solution space, proves differing objectives, and introduces Constrained Preference Optimization to ensure alignment.
What carries the argument
The implicit assumption that the RLHF-optimal policy must prefer human-preferred responses, which determines whether DPO aligns with RLHF or instead optimizes relative advantage over a reference policy.
If this is right
- DPO can exhibit pathological convergence to policies that prefer dispreferred responses while still reducing the loss.
- An undesirable solution space exists for DPO when the key assumption does not hold.
- DPO and RLHF optimize fundamentally different objectives in cases where the assumption fails.
- CPO augments RLHF with constraints to achieve provable alignment while preserving implementation simplicity.
- DPO implements soft margin ranking with potentially negative targets from a geometric perspective.
Where Pith is reading between the lines
- Developers might add checks to verify whether the reference policy satisfies the preference assumption before applying DPO.
- The soft margin ranking view could guide design of new losses that enforce positive margins for alignment.
- Similar conditional analyses may reveal hidden failure modes in other preference-based alignment methods.
- CPO could be tested as a drop-in replacement in existing RLHF pipelines to measure gains in robustness.
Load-bearing premise
That the policy which is optimal under RLHF would select responses that humans prefer over those they do not.
What would settle it
A demonstration of a trained policy that achieves low DPO loss but selects dispreferred responses more often than human-preferred ones, or a counterexample where the RLHF optimum does not favor human preferences.
Figures
read the original abstract
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the equivalence between DPO and RLHF is conditional rather than universal, hinging on the implicit assumption that the RLHF-optimal policy prefers human-preferred responses. When violated, DPO optimizes relative advantage over the reference policy instead of absolute alignment, leading to pathological convergence where DPO loss decreases while the policy prefers dispreferred responses. The authors characterize violation conditions, prove differing objectives, introduce Constrained Preference Optimization (CPO) with constraints for provable alignment, offer a geometric soft-margin ranking interpretation, and report CPO achieving state-of-the-art results on standard benchmarks.
Significance. If the conditional equivalence, failure-mode characterization, and CPO guarantees hold with supporting derivations, the work would clarify important limitations in current preference optimization methods for alignment and provide a practical fix that preserves implementation simplicity. The geometric interpretation and explicit handling of the assumption violation could guide refinements in DPO-style algorithms.
major comments (1)
- [Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. The observation regarding the need for more direct empirical isolation of the pathological convergence is valid and will be addressed in revision.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.
Authors: We agree that explicitly demonstrating the pathological convergence under assumption violation would strengthen the practical significance of the failure-mode diagnosis. The manuscript currently provides theoretical characterization and proofs (Sections 3-4) showing DPO optimizes relative advantage rather than absolute alignment when the assumption is violated, along with the existence of undesirable solution spaces. However, the experiments focus on CPO benchmark performance. To address this, we will add a controlled synthetic experiment in the revised manuscript: construct preference datasets violating the RLHF-optimal policy preference assumption, train DPO, and report metrics showing DPO loss decreasing while probability mass shifts toward dispreferred responses. We will also include CPO results under identical conditions for comparison. This addition preserves the paper's core claims while directly illustrating the diagnosed failure mode. revision: yes
Circularity Check
No significant circularity; derivation is a conditional proof on an explicitly stated assumption
full rationale
The paper's central derivation establishes conditional equivalence between DPO and RLHF by proving that equivalence holds only when the RLHF-optimal policy prefers human-preferred responses, and that DPO instead optimizes relative advantage (leading to pathological convergence) when the assumption is violated. This is presented as an explicit characterization rather than a self-definitional loop, fitted prediction, or load-bearing self-citation. No equations or claims reduce by construction to inputs from the same data or prior author work; the assumption is named and analyzed as frequently violated in practice, with CPO introduced to enforce alignment. Experiments focus on CPO benchmarks rather than circularly validating the pathology via fitted quantities. The proof chain is self-contained and externally falsifiable via the stated assumption.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The RLHF-optimal policy must prefer human-preferred responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J-cost uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove this equivalence is conditional rather than universal, depending on an implicit assumption... the RLHF-optimal policy must prefer human-preferred responses... DPO optimizes relative advantage over the reference policy rather than absolute alignment
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DPO is a smooth approximation to margin ranking loss... lim β→∞ (1/β) L_DPO = max(0, δ_ref − δ_θ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Conference on Artificial Intelligence and Statistics , pages=
A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
work page 2024
-
[2]
Advances in Neural Information Processing Systems , volume=
Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization , author=
-
[4]
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
The Method of Paired Comparisons , author=
Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=
-
[6]
Proceedings of the 22nd international conference on Machine learning , pages=
Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=
-
[7]
Findings of the Association for Computational Linguistics ACL 2024 , pages=
Disentangling Length from Quality in Direct Preference Optimization , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=
work page 2024
-
[8]
Proceedings of the 24th international conference on Machine learning , pages=
Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=
-
[9]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[10]
KTO: Model Alignment as Prospect Theoretic Optimization
Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
International Conference on Machine Learning , pages=
Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[12]
2024 Conference on Empirical Methods in Natural Language Processing , year=
ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 Conference on Empirical Methods in Natural Language Processing , year=
work page 2024
-
[13]
Camels in a changing climate: Enhancing lm adaptation with tulu 2,
Camels in a changing climate: Enhancing lm adaptation with tulu 2 , author=. arXiv preprint arXiv:2311.10702 , year=
-
[14]
Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=
work page 2022
-
[15]
Forty-first International Conference on Machine Learning , year=
Nash learning from human feedback , author=. Forty-first International Conference on Machine Learning , year=
-
[16]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[17]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[18]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[19]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Advances in neural information processing systems , volume=
Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
-
[21]
Zephyr: Direct Distillation of LM Alignment
Zephyr: Direct Distillation of LM Alignment , author=. arXiv preprint arXiv:2310.16944 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[23]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
-
[25]
Transactions on Machine Learning Research , year=
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , year=
-
[26]
2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization , author=. 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=. 2024 , organization=
work page 2024
-
[27]
Transactions on Machine Learning Research , year=
Robust Preference Optimization through Reward Model Distillation , author=. Transactions on Machine Learning Research , year=
-
[28]
International Conference on Machine Learning , pages=
Understanding the Learning Dynamics of Alignment with Human Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[29]
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO , author=. arXiv preprint arXiv:2505.19770 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Advances in neural information processing systems , volume=
Ranking with large margin principle: Two approaches , author=. Advances in neural information processing systems , volume=
-
[31]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Facenet: A unified embedding for face recognition and clustering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[32]
Numerical optimization , author=. Springer Ser. Oper. Res. Financ. Eng./Springer , year=
-
[33]
Advances in Neural Information Processing Systems , volume=
Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
International Conference on Machine Learning , pages=
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[35]
Slic-hf: Sequence likelihood calibration with human feedback
Slic-hf: Sequence likelihood calibration with human feedback , author=. arXiv preprint arXiv:2305.10425 , year=
-
[36]
Alpacaeval: An automatic evaluator of instruction-following models , author=
-
[37]
Blog post.[Accessed 07-02-2025] , year=
From live data to high-quality benchmarks: The arena-hard pipeline , author=. Blog post.[Accessed 07-02-2025] , year=
work page 2025
-
[38]
Advances in neural information processing systems , volume=
Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=
-
[39]
International Conference on Learning Representations , year=
Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=
-
[40]
International conference on machine learning , pages=
A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
- [41]
-
[42]
Instruction-Following Evaluation for Large Language Models
Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.