arxiv: 2602.06239 · v2 · submitted 2026-02-05 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla , Emanuele Nevali , Luca Viano , Volkan Cevher

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords direct preference optimizationover-optimizationpessimistic ensemblesample complexitypreference learningtabular setting

0 comments

The pith

PEPO mitigates DPO over-optimization by achieving sample complexity bounds that depend only on single-policy concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PEPO, a single-step algorithm similar to Direct Preference Optimization that trains an ensemble of policies on disjoint data subsets and aggregates them through a worst-case construction favoring agreement across models. This produces pessimism that mitigates over-optimization without requiring knowledge of the data-generating distribution or learning an explicit reward model. In the tabular setting the approach delivers sample complexity guarantees that scale with a single-policy concentrability coefficient rather than the stricter all-policy concentrability that affects DPO. The method keeps the simplicity of DPO-style training while empirical results support the theoretical claims.

Core claim

By training preference-optimized policies on disjoint subsets and aggregating them pessimistically via worst-case construction, PEPO attains sample complexity guarantees depending only on a single-policy concentrability coefficient, thereby avoiding the all-policy concentrability required by algorithms prone to over-optimization such as DPO.

What carries the argument

Pessimistic ensemble aggregation of DPO policies trained on disjoint data subsets, using worst-case construction that favors agreement across models.

If this is right

Sample complexity scales only with single-policy concentrability rather than all-policy concentrability.
No explicit reward model or data distribution knowledge is required.
Single-step training simplicity of DPO is preserved.
Empirical performance remains competitive while theoretical guarantees improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The disjoint-subset ensemble idea may extend to continuous or high-dimensional settings through approximate aggregation.
Partitioning preference data could serve as a practical substitute for explicit pessimism penalties in large-scale preference tuning.
Similar constructions might apply to other single-step preference methods beyond DPO.

Load-bearing premise

An ensemble of policies trained on disjoint subsets can be aggregated via worst-case construction to produce pessimism without access to the data-generating distribution or an explicit reward model.

What would settle it

An experiment in a tabular MDP showing that PEPO's performance degrades to the level of DPO when single-policy concentrability is low, or that the ensemble fails to prevent over-optimization on held-out preference data.

Figures

Figures reproduced from arXiv: 2602.06239 by Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher.

**Figure 2.** Figure 2: Experiment in 3 arms bandit setting for the situation of known and unknown πdata. experimental setting where the preference dataset is generated directly from πref (i.e., πdata = πref), as this would restore the conditions under which χ 2PO and SFT+DPO are theoretically sound. Known vs unknown πdata in a toy environment. For our toy experiment, we consider an empty prompt and 3 possible actions. The prefer… view at source ↗

**Figure 3.** Figure 3: Ablation for L in a bandit setting. A.2 Details for the experiments in the controlled setting For Figure 2a, we used πdata = πref = [0.04, 0.93, 0.03]. That is, the largest mass is on the central action, which is the optimal one. This makes the optimal action well-covered, i.e. small C ⋆ , therefore χ 2PO and PEPO can work reliably. At the same time, the all-policy concentrability is large because the two … view at source ↗

**Figure 4.** Figure 4: Result in the controlled setting without regularization, i.e. with [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the tie probability (missing mass) resulting from a right shifting of the sigmoid [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEPO's ensemble pessimism is a clean practical tweak on DPO, but the single-policy concentrability guarantee probably still needs an extra disagreement bound to hold up.

read the letter

The core move is training separate DPO policies on disjoint data chunks and then aggregating them with a worst-case rule that favors agreement. That construction is new in the DPO literature and keeps the method as simple as standard DPO while trying to inject pessimism without an explicit reward model or knowledge of the data distribution. The tabular sample-complexity claim is the main theoretical hook: they say the bound depends only on single-policy concentrability instead of the all-policy version that usually shows up in over-optimization analyses. Experiments are described as corroborating this, which is worth checking if the numbers are clean. The soft spot is exactly the one the stress-test flags. When the ensemble members disagree on low-coverage state-action pairs, the worst-case selection can effectively optimize over the union of their supports, which reintroduces something close to all-policy concentrability unless they prove a uniform bound on how much the policies can diverge. The abstract does not give that bound or the precise aggregation operator, so the reduction is not yet verified. If the full proof closes this gap with a clean argument, the result strengthens; otherwise the guarantee is weaker than stated. This paper is for people who run preference tuning at scale and want a drop-in change that might reduce over-optimization without extra machinery. It deserves a serious referee because the algorithmic idea is straightforward to implement and the concentrability question is central, even if the current write-up leaves the key step under-specified.

Referee Report

2 major / 2 minor

Summary. The paper introduces PEPO, a single-step DPO-style algorithm that mitigates over-optimization in preference learning by training an ensemble of policies on disjoint data subsets and aggregating them via a worst-case construction that favors model agreement. In the tabular setting, it claims sample-complexity guarantees that depend only on single-policy concentrability coefficients, avoiding the all-policy concentrability required by standard DPO. The approach requires neither knowledge of the data-generating distribution nor an explicit reward model, and the authors report that experiments corroborate the theory while preserving DPO's practicality.

Significance. If the concentrability reduction holds, the result would be significant for preference optimization methods: it supplies a practical, distribution-free mechanism for pessimism that weakens a key assumption limiting theoretical guarantees in DPO and related algorithms. The retention of single-step training and the explicit tabular bound are strengths; the ensemble construction for pessimism without reward modeling is a clean idea that could influence follow-up work on robust RLHF.

major comments (2)

[§4.2, Theorem 1] §4.2, Theorem 1 (tabular sample-complexity bound): the proof that worst-case aggregation yields a bound depending solely on single-policy concentrability is not fully explicit. When ensemble members disagree on low-coverage state-action pairs, the worst-case selection can effectively optimize over the union of supports; without a separate uniform bound on policy disagreement (or an explicit definition of the aggregation operator), the reduction from all-policy to single-policy concentrability does not automatically follow and requires additional argument.
[§3.2] §3.2 (algorithm definition): the precise mathematical form of the 'worst-case construction that favors agreement across models' is not stated as an equation or pseudocode step. This makes it impossible to verify whether the aggregation operator preserves the single-policy concentrability property claimed in the tabular analysis.

minor comments (2)

[Experiments section] The abstract states that 'tabular sample-complexity results exist and are corroborated by experiments,' yet the experimental section provides only high-level performance curves without reporting the precise concentrability coefficients measured or the exact ensemble size and subset sizes used.
[§3] Notation for the ensemble policies and the aggregation operator should be introduced once in §3 and used consistently thereafter to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We are glad that the significance of the concentrability reduction and the practical appeal of PEPO are recognized. We have revised the manuscript to address both major comments by supplying the missing explicit mathematical definitions and expanding the proof with the required intermediate arguments.

read point-by-point responses

Referee: [§4.2, Theorem 1] §4.2, Theorem 1 (tabular sample-complexity bound): the proof that worst-case aggregation yields a bound depending solely on single-policy concentrability is not fully explicit. When ensemble members disagree on low-coverage state-action pairs, the worst-case selection can effectively optimize over the union of supports; without a separate uniform bound on policy disagreement (or an explicit definition of the aggregation operator), the reduction from all-policy to single-policy concentrability does not automatically follow and requires additional argument.

Authors: We agree that the original proof sketch was too concise and left the handling of disagreement implicit. In the revision we have inserted a new Lemma 3 that bounds the total-variation disagreement between any pair of ensemble members by a term controlled solely by the single-policy concentrability coefficient (using the fact that the policies are trained on disjoint data subsets). The worst-case aggregation operator is now defined explicitly as selecting, for each state, the action whose minimum estimated value across the ensemble is taken; this choice penalizes disagreement on low-coverage pairs rather than allowing optimization over their union. The revised proof of Theorem 1 chains the new lemma directly into the concentrability argument, yielding the claimed single-policy bound. The full expanded proof appears in the main text with supporting calculations moved to the appendix. revision: yes
Referee: [§3.2] §3.2 (algorithm definition): the precise mathematical form of the 'worst-case construction that favors agreement across models' is not stated as an equation or pseudocode step. This makes it impossible to verify whether the aggregation operator preserves the single-policy concentrability property claimed in the tabular analysis.

Authors: We accept the criticism that the aggregation step was described only in prose. Section 3.2 now contains the explicit definition: given an ensemble of M policies {π_i} trained on disjoint subsets, the aggregated policy is obtained by π_PEPO(·|s) = argmin_i Q̂_i(s,·) where Q̂_i denotes the implicit value estimate induced by each DPO-trained policy. We have also added Algorithm 1 (pseudocode) that lists the three stages—disjoint subset partitioning, independent DPO training, and pointwise worst-case selection—together with the precise tie-breaking rule that favors agreement. With this operator made formal, the single-policy concentrability preservation follows immediately from the new Lemma 3 mentioned above. revision: yes

Circularity Check

0 steps flagged

No significant circularity; sample bound derived from new ensemble construction

full rationale

The paper defines PEPO as an explicit algorithmic procedure (ensemble policies on disjoint subsets, aggregated by worst-case construction favoring agreement) and then states a sample-complexity result for the tabular case that depends only on single-policy concentrability. No equation or claim in the abstract or description reduces this bound to a fitted parameter defined by the algorithm itself, nor does the central guarantee rest on a self-citation chain or ansatz smuggled from prior work. The derivation is therefore self-contained: the claimed reduction from all-policy to single-policy concentrability is presented as a consequence of the new construction rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP assumptions for the tabular setting plus the domain assumption that disjoint data subsets suffice to induce effective pessimism via worst-case aggregation. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Tabular MDP with finite states and actions
Required for the stated sample-complexity result to hold with concentrability coefficients.
domain assumption Disjoint data subsets allow construction of a pessimistic ensemble without knowledge of the data-generating distribution
This is the key modeling assumption that enables the single-policy concentrability bound.

pith-pipeline@v0.9.0 · 5437 in / 1386 out tokens · 24065 ms · 2026-05-16T06:32:34.967797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we choose the worst case reward as r−(x,a)=β minℓ log(π̃ℓ(a|x)/πref(a|x))−B ptie(x,a)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

[1]

Design considerations in offline preference-based rl

Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,

work page arXiv
[2]

Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

work page arXiv
[3]

Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

work page arXiv
[4]

On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

work page arXiv
[5]

AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

work page arXiv
[6]

UCB Exploration via Q-Ensembles

Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

work page arXiv
[8]

Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

work page arXiv
[11]

Learn your reference model for real good alignment

13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,

work page arXiv
[12]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,

work page arXiv
[13]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster

URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,

work page arXiv
[14]

Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

work page arXiv
[15]

More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

work page arXiv
[16]

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

work page arXiv
[17]

Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

work page arXiv
[18]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...

work page arXiv
[19]

Understanding R1-Zero-Like Training: A Critical Perspective

URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

work page arXiv
[21]

Brendan O’Donoghue

URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,

work page arXiv
[22]

Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

work page arXiv
[23]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,

work page 2024
[24]

Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour

URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,

work page arXiv
[25]

Online apprenticeship learning.arXiv:2102.06924,

Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,

work page arXiv
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

work page arXiv
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

work page arXiv
[30]

Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh

URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,

work page arXiv
[31]

Il-soar: Imitation learning with soft optimistic actor critic

Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,

work page arXiv
[32]

α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

work page arXiv
[33]

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

work page arXiv
[34]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

work page arXiv
[35]

To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

work page arXiv
[36]

Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

work page arXiv
[37]

Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

work page arXiv
[38]

Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

work page arXiv
[39]

In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...

work page 2000
[40]

The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate

A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...

work page 2025
[41]

binarizing

A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...

work page 2022
[42]

It is easy to notice that for any value along the horizontal axis the two values sum up to1

The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...

work page 2023
[43]

Achieving those two improvements simultaneously, i.e

In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...

work page 2024
[44]

Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...

work page 2024
[45]

and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...

work page 2024
[46]

∃x, a∈ X × As.t.−Pessimism(x, a)≥ − X b πdata(b|x) h (λ(x, a, b))2eRmax +λ(x, a, b)e Rmax/2 i# ≤ |X | |A|e −2L/7 Therefore, choosingL= 7 2 log |X ||A| δ ensures that P

The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...

work page 2025
[47]

Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...

work page 2020
[48]

Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the interval[−Rmax, Rmax]is of ordere Rmax, i.e.L=O eRmax

−1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...

work page 2020
[49]

Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2

Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...

work page 2024