Recognition: 2 theorem links
· Lean TheoremProvably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
Pith reviewed 2026-05-16 06:32 UTC · model grok-4.3
The pith
PEPO mitigates DPO over-optimization by achieving sample complexity bounds that depend only on single-policy concentrability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training preference-optimized policies on disjoint subsets and aggregating them pessimistically via worst-case construction, PEPO attains sample complexity guarantees depending only on a single-policy concentrability coefficient, thereby avoiding the all-policy concentrability required by algorithms prone to over-optimization such as DPO.
What carries the argument
Pessimistic ensemble aggregation of DPO policies trained on disjoint data subsets, using worst-case construction that favors agreement across models.
If this is right
- Sample complexity scales only with single-policy concentrability rather than all-policy concentrability.
- No explicit reward model or data distribution knowledge is required.
- Single-step training simplicity of DPO is preserved.
- Empirical performance remains competitive while theoretical guarantees improve.
Where Pith is reading between the lines
- The disjoint-subset ensemble idea may extend to continuous or high-dimensional settings through approximate aggregation.
- Partitioning preference data could serve as a practical substitute for explicit pessimism penalties in large-scale preference tuning.
- Similar constructions might apply to other single-step preference methods beyond DPO.
Load-bearing premise
An ensemble of policies trained on disjoint subsets can be aggregated via worst-case construction to produce pessimism without access to the data-generating distribution or an explicit reward model.
What would settle it
An experiment in a tabular MDP showing that PEPO's performance degrades to the level of DPO when single-policy concentrability is low, or that the ensemble fails to prevent over-optimization on held-out preference data.
Figures
read the original abstract
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PEPO, a single-step DPO-style algorithm that mitigates over-optimization in preference learning by training an ensemble of policies on disjoint data subsets and aggregating them via a worst-case construction that favors model agreement. In the tabular setting, it claims sample-complexity guarantees that depend only on single-policy concentrability coefficients, avoiding the all-policy concentrability required by standard DPO. The approach requires neither knowledge of the data-generating distribution nor an explicit reward model, and the authors report that experiments corroborate the theory while preserving DPO's practicality.
Significance. If the concentrability reduction holds, the result would be significant for preference optimization methods: it supplies a practical, distribution-free mechanism for pessimism that weakens a key assumption limiting theoretical guarantees in DPO and related algorithms. The retention of single-step training and the explicit tabular bound are strengths; the ensemble construction for pessimism without reward modeling is a clean idea that could influence follow-up work on robust RLHF.
major comments (2)
- [§4.2, Theorem 1] §4.2, Theorem 1 (tabular sample-complexity bound): the proof that worst-case aggregation yields a bound depending solely on single-policy concentrability is not fully explicit. When ensemble members disagree on low-coverage state-action pairs, the worst-case selection can effectively optimize over the union of supports; without a separate uniform bound on policy disagreement (or an explicit definition of the aggregation operator), the reduction from all-policy to single-policy concentrability does not automatically follow and requires additional argument.
- [§3.2] §3.2 (algorithm definition): the precise mathematical form of the 'worst-case construction that favors agreement across models' is not stated as an equation or pseudocode step. This makes it impossible to verify whether the aggregation operator preserves the single-policy concentrability property claimed in the tabular analysis.
minor comments (2)
- [Experiments section] The abstract states that 'tabular sample-complexity results exist and are corroborated by experiments,' yet the experimental section provides only high-level performance curves without reporting the precise concentrability coefficients measured or the exact ensemble size and subset sizes used.
- [§3] Notation for the ensemble policies and the aggregation operator should be introduced once in §3 and used consistently thereafter to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We are glad that the significance of the concentrability reduction and the practical appeal of PEPO are recognized. We have revised the manuscript to address both major comments by supplying the missing explicit mathematical definitions and expanding the proof with the required intermediate arguments.
read point-by-point responses
-
Referee: [§4.2, Theorem 1] §4.2, Theorem 1 (tabular sample-complexity bound): the proof that worst-case aggregation yields a bound depending solely on single-policy concentrability is not fully explicit. When ensemble members disagree on low-coverage state-action pairs, the worst-case selection can effectively optimize over the union of supports; without a separate uniform bound on policy disagreement (or an explicit definition of the aggregation operator), the reduction from all-policy to single-policy concentrability does not automatically follow and requires additional argument.
Authors: We agree that the original proof sketch was too concise and left the handling of disagreement implicit. In the revision we have inserted a new Lemma 3 that bounds the total-variation disagreement between any pair of ensemble members by a term controlled solely by the single-policy concentrability coefficient (using the fact that the policies are trained on disjoint data subsets). The worst-case aggregation operator is now defined explicitly as selecting, for each state, the action whose minimum estimated value across the ensemble is taken; this choice penalizes disagreement on low-coverage pairs rather than allowing optimization over their union. The revised proof of Theorem 1 chains the new lemma directly into the concentrability argument, yielding the claimed single-policy bound. The full expanded proof appears in the main text with supporting calculations moved to the appendix. revision: yes
-
Referee: [§3.2] §3.2 (algorithm definition): the precise mathematical form of the 'worst-case construction that favors agreement across models' is not stated as an equation or pseudocode step. This makes it impossible to verify whether the aggregation operator preserves the single-policy concentrability property claimed in the tabular analysis.
Authors: We accept the criticism that the aggregation step was described only in prose. Section 3.2 now contains the explicit definition: given an ensemble of M policies {π_i} trained on disjoint subsets, the aggregated policy is obtained by π_PEPO(·|s) = argmin_i Q̂_i(s,·) where Q̂_i denotes the implicit value estimate induced by each DPO-trained policy. We have also added Algorithm 1 (pseudocode) that lists the three stages—disjoint subset partitioning, independent DPO training, and pointwise worst-case selection—together with the precise tie-breaking rule that favors agreement. With this operator made formal, the single-policy concentrability preservation follows immediately from the new Lemma 3 mentioned above. revision: yes
Circularity Check
No significant circularity; sample bound derived from new ensemble construction
full rationale
The paper defines PEPO as an explicit algorithmic procedure (ensemble policies on disjoint subsets, aggregated by worst-case construction favoring agreement) and then states a sample-complexity result for the tabular case that depends only on single-policy concentrability. No equation or claim in the abstract or description reduces this bound to a fitted parameter defined by the algorithm itself, nor does the central guarantee rest on a self-citation chain or ansatz smuggled from prior work. The derivation is therefore self-contained: the claimed reduction from all-policy to single-policy concentrability is presented as a consequence of the new construction rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tabular MDP with finite states and actions
- domain assumption Disjoint data subsets allow construction of a pessimistic ensemble without knowledge of the data-generating distribution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we choose the worst case reward as r−(x,a)=β minℓ log(π̃ℓ(a|x)/πref(a|x))−B ptie(x,a)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Design considerations in offline preference-based rl
Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,
-
[2]
Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,
-
[3]
Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,
-
[4]
On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,
Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,
-
[5]
Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,
-
[6]
UCB Exploration via Q-Ensembles
Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,
-
[8]
Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,
Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,
Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,
-
[11]
Learn your reference model for real good alignment
13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,
-
[12]
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan
URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,
-
[13]
URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,
-
[14]
Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,
-
[15]
Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,
-
[16]
Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,
-
[17]
Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,
-
[18]
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...
-
[19]
Understanding R1-Zero-Like Training: A Critical Perspective
URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,
-
[21]
URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,
-
[22]
Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,
-
[23]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,
work page 2024
-
[24]
Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour
URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,
-
[25]
Online apprenticeship learning.arXiv:2102.06924,
Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,
-
[26]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,
Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,
-
[28]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,
-
[30]
URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,
-
[31]
Il-soar: Imitation learning with soft optimistic actor critic
Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,
-
[32]
Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,
-
[33]
Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,
-
[34]
Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,
Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,
-
[35]
To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
-
[36]
Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,
-
[37]
Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,
Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,
-
[38]
Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,
-
[39]
In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...
work page 2000
-
[40]
A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...
work page 2025
-
[41]
A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...
work page 2022
-
[42]
It is easy to notice that for any value along the horizontal axis the two values sum up to1
The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...
work page 2023
-
[43]
Achieving those two improvements simultaneously, i.e
In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...
work page 2024
-
[44]
Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...
work page 2024
-
[45]
and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...
work page 2024
-
[46]
The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...
work page 2025
-
[47]
Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...
work page 2020
-
[48]
−1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...
work page 2020
-
[49]
Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2
Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.