pith. sign in

arxiv: 2605.20834 · v1 · pith:QYVVT3Y4new · submitted 2026-05-20 · 💻 cs.AI · cs.LG

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Pith reviewed 2026-05-21 04:56 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords Direct Preference OptimizationReinforcement Learning from Human FeedbackAI alignmentpreference optimizationConstrained Preference Optimizationconditional equivalencesoft margin ranking
0
0 comments X

The pith

DPO is equivalent to RLHF only when the optimal policy prefers human-chosen responses, an assumption often violated in practice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the claimed equivalence between DPO and RLHF is not always true but holds only under an implicit assumption that the optimal policy from RLHF would choose the responses humans prefer. A sympathetic reader cares because when this assumption is violated in practice, DPO no longer pushes for human alignment but instead makes the policy better than a reference in a relative sense, which can result in models that like worse answers. The authors diagnose this failure, prove the difference in objectives, and propose a constrained method to fix it for reliable results.

Core claim

The equivalence of DPO and RLHF is conditional on the RLHF-optimal policy preferring human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. The analysis characterizes violation cases, identifies an undesirable solution space, proves differing objectives, and introduces Constrained Preference Optimization to ensure alignment.

What carries the argument

The implicit assumption that the RLHF-optimal policy must prefer human-preferred responses, which determines whether DPO aligns with RLHF or instead optimizes relative advantage over a reference policy.

If this is right

  • DPO can exhibit pathological convergence to policies that prefer dispreferred responses while still reducing the loss.
  • An undesirable solution space exists for DPO when the key assumption does not hold.
  • DPO and RLHF optimize fundamentally different objectives in cases where the assumption fails.
  • CPO augments RLHF with constraints to achieve provable alignment while preserving implementation simplicity.
  • DPO implements soft margin ranking with potentially negative targets from a geometric perspective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might add checks to verify whether the reference policy satisfies the preference assumption before applying DPO.
  • The soft margin ranking view could guide design of new losses that enforce positive margins for alignment.
  • Similar conditional analyses may reveal hidden failure modes in other preference-based alignment methods.
  • CPO could be tested as a drop-in replacement in existing RLHF pipelines to measure gains in robustness.

Load-bearing premise

That the policy which is optimal under RLHF would select responses that humans prefer over those they do not.

What would settle it

A demonstration of a trained policy that achieves low DPO loss but selects dispreferred responses more often than human-preferred ones, or a counterexample where the RLHF optimum does not favor human preferences.

Figures

Figures reproduced from arXiv: 2605.20834 by Bo Han, Dong Fang, Wei Xue, Yike Guo, Yonggang Zhang, Zhiqin Yang.

Figure 1
Figure 1. Figure 1: Measurement of violation frequency on Llama-3-8B-Instruct under Llama3 ultrafeedback armorn. We compute the violation statistics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fraction of training samples in the undesirable solution space U (Definition 3.3) over training steps under different corruption ratios R ∈ {0.2, 0.3, 0.4}. in [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that the equivalence between DPO and RLHF is conditional rather than universal, hinging on the implicit assumption that the RLHF-optimal policy prefers human-preferred responses. When violated, DPO optimizes relative advantage over the reference policy instead of absolute alignment, leading to pathological convergence where DPO loss decreases while the policy prefers dispreferred responses. The authors characterize violation conditions, prove differing objectives, introduce Constrained Preference Optimization (CPO) with constraints for provable alignment, offer a geometric soft-margin ranking interpretation, and report CPO achieving state-of-the-art results on standard benchmarks.

Significance. If the conditional equivalence, failure-mode characterization, and CPO guarantees hold with supporting derivations, the work would clarify important limitations in current preference optimization methods for alignment and provide a practical fix that preserves implementation simplicity. The geometric interpretation and explicit handling of the assumption violation could guide refinements in DPO-style algorithms.

major comments (1)
  1. [Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The observation regarding the need for more direct empirical isolation of the pathological convergence is valid and will be addressed in revision.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported results focus on CPO's SOTA performance but do not isolate or directly demonstrate the claimed pathological convergence (DPO loss decreasing while assigning higher probability to dispreferred responses) under explicit violation of the RLHF-optimal policy preferring human-preferred responses. This demonstration is load-bearing for the practical significance of the failure-mode diagnosis.

    Authors: We agree that explicitly demonstrating the pathological convergence under assumption violation would strengthen the practical significance of the failure-mode diagnosis. The manuscript currently provides theoretical characterization and proofs (Sections 3-4) showing DPO optimizes relative advantage rather than absolute alignment when the assumption is violated, along with the existence of undesirable solution spaces. However, the experiments focus on CPO benchmark performance. To address this, we will add a controlled synthetic experiment in the revised manuscript: construct preference datasets violating the RLHF-optimal policy preference assumption, train DPO, and report metrics showing DPO loss decreasing while probability mass shifts toward dispreferred responses. We will also include CPO results under identical conditions for comparison. This addition preserves the paper's core claims while directly illustrating the diagnosed failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a conditional proof on an explicitly stated assumption

full rationale

The paper's central derivation establishes conditional equivalence between DPO and RLHF by proving that equivalence holds only when the RLHF-optimal policy prefers human-preferred responses, and that DPO instead optimizes relative advantage (leading to pathological convergence) when the assumption is violated. This is presented as an explicit characterization rather than a self-definitional loop, fitted prediction, or load-bearing self-citation. No equations or claims reduce by construction to inputs from the same data or prior author work; the assumption is named and analyzed as frequently violated in practice, with CPO introduced to enforce alignment. Experiments focus on CPO benchmarks rather than circularly validating the pathology via fitted quantities. The proof chain is self-contained and externally falsifiable via the stated assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption about the RLHF-optimal policy and on the modeling choice that DPO loss can be analyzed via relative advantage; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The RLHF-optimal policy must prefer human-preferred responses
    Stated explicitly in the abstract as the implicit assumption that is frequently violated and on which equivalence depends.

pith-pipeline@v0.9.0 · 5753 in / 1276 out tokens · 26316 ms · 2026-05-21T04:56:33.784290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    International Conference on Artificial Intelligence and Statistics , pages=

    A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization , author=

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  5. [5]

    The Method of Paired Comparisons , author=

    Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=

  6. [6]

    Proceedings of the 22nd international conference on Machine learning , pages=

    Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=

  7. [7]

    Findings of the Association for Computational Linguistics ACL 2024 , pages=

    Disentangling Length from Quality in Direct Preference Optimization , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

  8. [8]

    Proceedings of the 24th international conference on Machine learning , pages=

    Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=

  9. [9]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  10. [10]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

  11. [11]

    International Conference on Machine Learning , pages=

    Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  12. [12]

    2024 Conference on Empirical Methods in Natural Language Processing , year=

    ORPO: Monolithic Preference Optimization without Reference Model , author=. 2024 Conference on Empirical Methods in Natural Language Processing , year=

  13. [13]

    Camels in a changing climate: Enhancing lm adaptation with tulu 2,

    Camels in a changing climate: Enhancing lm adaptation with tulu 2 , author=. arXiv preprint arXiv:2311.10702 , year=

  14. [14]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

    RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

  15. [15]

    Forty-first International Conference on Machine Learning , year=

    Nash learning from human feedback , author=. Forty-first International Conference on Machine Learning , year=

  16. [16]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  17. [17]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  18. [18]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  19. [19]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  20. [20]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  21. [21]

    Zephyr: Direct Distillation of LM Alignment

    Zephyr: Direct Distillation of LM Alignment , author=. arXiv preprint arXiv:2310.16944 , year=

  22. [22]

    Fine-Tuning Language Models from Human Preferences

    Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  24. [24]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  25. [25]

    Transactions on Machine Learning Research , year=

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. Transactions on Machine Learning Research , year=

  26. [26]

    2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=

    On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization , author=. 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , pages=. 2024 , organization=

  27. [27]

    Transactions on Machine Learning Research , year=

    Robust Preference Optimization through Reward Model Distillation , author=. Transactions on Machine Learning Research , year=

  28. [28]

    International Conference on Machine Learning , pages=

    Understanding the Learning Dynamics of Alignment with Human Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  29. [29]

    Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

    Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO , author=. arXiv preprint arXiv:2505.19770 , year=

  30. [30]

    Advances in neural information processing systems , volume=

    Ranking with large margin principle: Two approaches , author=. Advances in neural information processing systems , volume=

  31. [31]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Facenet: A unified embedding for face recognition and clustering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  32. [32]

    Springer Ser

    Numerical optimization , author=. Springer Ser. Oper. Res. Financ. Eng./Springer , year=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Rrhf: Rank responses to align language models with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    International Conference on Machine Learning , pages=

    Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  35. [35]

    Slic-hf: Sequence likelihood calibration with human feedback

    Slic-hf: Sequence likelihood calibration with human feedback , author=. arXiv preprint arXiv:2305.10425 , year=

  36. [36]

    Alpacaeval: An automatic evaluator of instruction-following models , author=

  37. [37]

    Blog post.[Accessed 07-02-2025] , year=

    From live data to high-quality benchmarks: The arena-hard pipeline , author=. Blog post.[Accessed 07-02-2025] , year=

  38. [38]

    Advances in neural information processing systems , volume=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

  39. [39]

    International Conference on Learning Representations , year=

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=

  40. [40]

    International conference on machine learning , pages=

    A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=

  41. [41]

    2018 , publisher=

    Lectures on convex optimization , author=. 2018 , publisher=

  42. [42]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=