arxiv: 2605.02141 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Recognition: 2 theorem links

· Lean Theorem

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

Heyang Zhao, Kaixuan Ji, Qingyue Zhao, Qiwei Di, Quanquan Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH

keywords regularizationcomplexitysampletildeepsilonofflinebanditsmulti-armed

0 comments

The pith

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π/ε) for large regularization and Ω(S A C^π/ε²) for small regularization, with matching lower bounds across the full range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In offline multi-armed bandits, an algorithm must pick actions using only a fixed dataset collected by some unknown behavior policy. Adding KL regularization penalizes deviation from a reference policy and is common in practice for stability. The authors analyze an existing algorithm called KL-PCB and derive how many samples are needed to reach a target performance gap ε. When the regularization strength η is large (roughly 1/ε), the sample need grows linearly with η. When η is small (roughly 1/ε or weaker), the need grows like 1/ε squared. They also prove matching lower bounds, showing these rates cannot be improved in the worst case. The coverage coefficient C^π* measures how well the offline data covers the optimal policy.

Core claim

We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of Õ(η S A C^π*/ε) under large regularization η = Õ(ε^{-1}), and a sample complexity of Ω̃(S A C^π*/ε²) under small regularization η = Ω̃(ε^{-1}). We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths.

Load-bearing premise

The analysis relies on the offline dataset satisfying a policy coverage coefficient C^π* at the optimal policy and on the regularization parameter η being either much larger or much smaller than 1/ε; the precise dependence of the bounds on these quantities is not derived for intermediate η values.

Figures

Figures reproduced from arXiv: 2605.02141 by Heyang Zhao, Kaixuan Ji, Qingyue Zhao, Qiwei Di, Quanquan Gu.

**Figure 1.** Figure 1: An illustration of the counting argument for the case view at source ↗

read the original abstract

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(\eta SAC^{\pi^*}/\epsilon)$ under large regularization $\eta = \tilde{O}(\epsilon^{-1})$, and a sample complexity of $\tilde{\Omega}(SAC^{\pi^*}/\epsilon^2)$ under small regularization $\eta = \tilde{\Omega}(\epsilon^{-1})$, where $\eta$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{\pi^*}$ policy coverage coefficient at the optimal policy $\pi^*$, $\epsilon$ is the desired sub-optimality, and $\tilde{O}$ and $\tilde{\Omega}$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds matching lower bounds that align with KL-PCB upper bounds across all η, but only states the upper bounds explicitly for the extreme large and small regularization regimes.

read the letter

This paper gives matching lower bounds for the sample complexity of offline multi-armed bandits under KL regularization that align with the upper bounds from KL-PCB over the whole range of the regularization parameter. The new part is the lower bounds. They show the rate is roughly η times the usual term when regularization is strong, and the standard 1/ε² term when it is weak. This extends the prior work on KL-PCB by filling in the lower bound side and making the η dependence clear in both limits. The paper handles the coverage coefficient C^π* properly and keeps the bounds in terms of S, A, and ε as expected. It does a good job stating the results cleanly in the abstract, with the tilde notation hiding logs. The soft spot is that the upper bound is only derived for the extreme regimes of η much bigger or much smaller than 1/ε. The claim of a nearly complete characterization over the entire range relies on the lower bounds matching those, but without an upper bound or rate for the transitional η around 1/ε, it's not fully closed. The analysis might extend, but the abstract leaves that part implicit. The assumptions look standard for offline settings. This is aimed at researchers doing theory on offline decision making and regularized bandits. Anyone setting up experiments with KL regularization in bandits would get value from the sample size recommendations here. It deserves a serious referee. The lower bounds are the key addition and look worth verifying.

Referee Report

1 major / 2 minor

Summary. The paper analyzes the sample complexity of offline multi-armed bandits under KL regularization. It provides a sharp analysis of the KL-PCB algorithm, deriving an upper bound of Õ(η S A C^{π*}/ε) when the regularization parameter η is large (η = Õ(ε^{-1})) and a rate of Ω̃(S A C^{π*}/ε²) when η is small (η = Ω̃(ε^{-1})), along with a pair of lower bounds that are claimed to match these upper bounds over the entire range of η values, yielding a nearly complete characterization.

Significance. If the claimed matching bounds hold without hidden restrictions, the work would deliver a precise, regime-dependent understanding of how KL regularization strength governs sample complexity in offline MABs. This extends prior analyses of KL-PCB and supplies falsifiable lower bounds, which is a positive contribution to the theory of offline decision-making. However, the absence of an explicit upper bound or interpolation for the transitional regime η ≈ Θ(ε^{-1}) limits the completeness of the characterization.

major comments (1)

[Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.

minor comments (2)

[Abstract] Abstract: The sentence 'achieves a sample complexity of Ω̃(S A C^{π*}/ε²) under small regularization' is imprecise because Ω̃ conventionally denotes lower bounds; clarify whether this rate is achieved by the algorithm or is instead the matching lower bound.
[Abstract] Abstract: The phrase 'remains largely from fully characterized' appears to be a typographical error and should be corrected to 'remains largely not fully characterized.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address the major comment regarding the abstract and the completeness of the characterization below.

read point-by-point responses

Referee: [Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.

Authors: We thank the referee for this observation. Our manuscript provides upper bounds on the sample complexity of the KL-PCB algorithm in two specific regimes of the regularization parameter η: Õ(η S A C^{π*}/ε) for η = Õ(ε^{-1}) and Õ(S A C^{π*}/ε²) for η = Ω̃(ε^{-1}). We note that the abstract contains a typographical error in the second regime, where Õ should be used instead of Ω̃ to indicate the achieved upper bound. The pair of lower bounds we present match these upper bounds within their respective regimes and are valid across the full spectrum of η. Nevertheless, we do not supply an upper bound or an interpolating expression for the transitional regime η = Θ(ε^{-1}). We therefore agree that this prevents a fully closed characterization for all values of η. In the revised version, we will amend the abstract to accurately reflect the regimes where matching occurs and include a brief discussion of the transitional case as an open question. This revision will ensure our claims are precisely aligned with the theorems and proofs in the paper. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. No explicit free parameters beyond the given η, ε, S, A, C^π* are mentioned. Standard concentration inequalities are implicitly used but not listed.

pith-pipeline@v0.9.0 · 5579 in / 1212 out tokens · 31457 ms · 2026-05-08T19:33:32.789185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (for contrast) washburn_uniqueness_aczel — RS's forced cost is J(x)=½(x+x⁻¹)−1, not the Shannon-style KL used here; no overlap unclear
the reverse Kullback–Leibler (KL) divergence regularizer KL(π̂ ∥ π_ref) ... J(π) := E[r(s,a) − η⁻¹ log(π/π_ref)]

Reference graph

Works this paper leans on

67 extracted references · 15 canonical work pages · 3 internal anchors

[1]

F.(2020)

Agarwal, A.,Kakade, S.andYang, L. F.(2020). Model-based reinforcement learning with a generative model is minimax optimal. InConference on Learning Theory. PMLR

2020
[2]

Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences

Assouad, P.(1983). Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences. S´ erie 1, Math´ ematique2961021–1024

1983
[3]

Best arm identification in multi-armed bandits

Audibert, J.-Y.andBubeck, S.(2010). Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010. 35

2010
[4]

Pure exploration in infinitely- armed bandit models with fixed-confidence

Aziz, M.,Anderton, J.,Kaufmann, E.andAslam, J.(2018). Pure exploration in infinitely- armed bandit models with fixed-confidence. InAlgorithmic Learning Theory. PMLR

2018
[5]

Weighted sums of certain dependent random variables.Tohoku Mathematical

Azuma, K.(1967). Weighted sums of certain dependent random variables.Tohoku Mathematical

1967
[6]

A.andMurty, U

Bondy, J. A.andMurty, U. S. R.(1979).Graph theory with applications. north-Holland

1979
[7]

N.andKatehakis, M

Burnetas, A. N.andKatehakis, M. N.(1996). Optimal adaptive policies for sequential allo- cation problems.Advances in Applied Mathematics17122–142

1996
[8]

R.andKalyanakrishnan, S.(2017)

Chaudhuri, A. R.andKalyanakrishnan, S.(2017). Pac identification of a bandit arm rel- ative to a reward quantile. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence

2017
[9]

R.andKalyanakrishnan, S.(2018)

Chaudhuri, A. R.andKalyanakrishnan, S.(2018). Quantile-regret minimisation in infinitely many-armed bandits. InUAI

2018
[10]

J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024)

Chen, F.,Foster, D. J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024). Assouad, fano, and le cam with interaction: A unifying lower bound framework and characterization for bandit learnability.Advances in Neural Information Processing Systems3775585–75641

2024
[11]

Adversarially trained actor critic for offline reinforcement learning

Cheng, C.-A.,Xie, T.,Jiang, N.andAgarwal, A.(2022). Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning. PMLR

2022
[12]

M.(1999).Elements of information theory

Cover, T. M.(1999).Elements of information theory. John Wiley & Sons. De Heide, R.,Cheshire, J.,M ´enard, P.andCarpentier, A.(2021). Bandits with many optimal arms.Advances in Neural Information Processing Systems3422457–22469

1999
[13]

M.(2019)

Degenne, R.andKoolen, W. M.(2019). Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems32

2019
[14]

Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning

Di, Q.,Zhao, H.,He, J.andGu, Q.(2024). Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning. InThe Twelfth International Conference on Learning Representations

2024
[15]

Model alignment as prospect theoretic optimization

Ethayarajh, K.,Xu, W.,Muennighoff, N.,Jurafsky, D.andKiela, D.(2024). Model alignment as prospect theoretic optimization. InInternational Conference on Machine Learning. PMLR

2024
[16]

arXiv preprint arXiv:2112.13487 , year=

Foster, D. J.,Kakade, S. M.,Qian, J.andRakhlin, A.(2021). The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487

work page arXiv 2021
[17]

J.,Mhammedi, Z.andRohatgi, D.(2025)

Foster, D. J.,Mhammedi, Z.andRohatgi, D.(2025). Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. InThe Thirty Eighth Annual Conference on Learning Theory. PMLR

2025
[18]

A.(1975)

Freedman, D. A.(1975). On tail probabilities for martingales.the Annals of Probability100–118

1975
[19]

Optimal best arm identification with fixed confidence

Garivier, A.andKaufmann, E.(2016). Optimal best arm identification with fixed confidence. InConference on Learning Theory. PMLR. 36

2016
[20]

Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96

Garivier, A.andKaufmann, E.(2021). Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96

2021
[21]

Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399

Garivier, A.,M ´enard, P.andStoltz, G.(2019). Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399

2019
[22]

N.(1952)

Gilbert, E. N.(1952). A comparison of signalling alphabets.The Bell system technical journal 31504–522

1952
[23]

Evolution of information in interactive decision making: A case study for multi-armed bandits.arXiv preprint arXiv:2503.00273

Gu, Y.,Han, Y.andQian, J.(2025). Evolution of information in interactive decision making: A case study for multi-armed bandits.arXiv preprint arXiv:2503.00273

work page arXiv 2025
[24]

Essential coding theory.Draft available at http://cse

Guruswami, V.,Rudra, A.andSudan, M.(2019). Essential coding theory.Draft available at http://cse. buffalo. edu/faculty/atri/courses/coding-theory/book11

2019
[25]

arXiv preprint arXiv:2603.02155 , year=

Ji, K.,Zhao, Q.,Zhao, H.,Di, Q.andGu, Q.(2026). Near-optimal regret for kl-regularized multi-armed bandits.arXiv preprint arXiv:2603.02155

work page arXiv 2026
[26]

Is pessimism provably efficient for offline rl? In International Conference on Machine Learning

Jin, Y.,Yang, Z.andWang, Z.(2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR

2021
[27]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191

Kumar, A.,Zhou, A.,Tucker, G.andLevine, S.(2020). Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191

2020
[28]

L.andRobbins, H.(1985)

Lai, T. L.andRobbins, H.(1985). Asymptotically efficient adaptive allocation rules.Advances in applied mathematics64–22

1985
[29]

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

Lai, X.,Tian, Z.,Chen, Y.,Yang, S.,Peng, X.andJia, J.(2024). Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629

work page arXiv 2024
[30]

Cambridge University Press

Lattimore, T.andSzepesv ´ari, C.(2020).Bandit algorithms. Cambridge University Press. Le Cam, L.(1973). Convergence of estimates under dimensionality restrictions.The Annals of Statistics38–53

2020
[31]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine, S.,Kumar, A.,Tucker, G.andFu, J.(2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643

work page internal anchor Pith review arXiv 2020
[32]

Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436

Levine, S.,Pastor, P.,Krizhevsky, A.,Ibarz, J.andQuillen, D.(2018). Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436

2018
[33]

Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260

Li, G.,Shi, L.,Chen, Y.,Chi, Y.andWei, Y.(2024). Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260

2024
[34]

N.(2004)

Mannor, S.andTsitsiklis, J. N.(2004). The sample complexity of exploration in the multi- armed bandit problem.Journal of Machine Learning Research5623–648

2004
[35]

Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235

Meng, Y.,Xia, M.andChen, D.(2024). Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235. 37

2024
[36]

arXiv preprint arXiv:2510.13060 , year=

Nayak, A.,Yang, T.,Yagan, O.,Joshi, G.andChi, Y.(2025). Achieving logarithmic regret in kl-regularized zero-sum markov games.arXiv preprint arXiv:2510.13060

work page arXiv 2025
[37]

E.,Pattathil, S.,Zhang, J.andZhang, K.(2023)

Ozdaglar, A. E.,Pattathil, S.,Zhang, J.andZhang, K.(2023). Revisiting the linear- programming framework for offline rl with general function approximation. InInternational Conference on Machine Learning. PMLR

2023
[38]

Cambridge university press

Polyanskiy, Y.andWu, Y.(2025).Information theory: From coding to learning. Cambridge university press

2025
[39]

Fromr to Q∗: Your language model is secretly a Q-function,

Rafailov, R.,Hejna, J.,Park, R.andFinn, C.(2024). Fromrtoq ∗: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358

work page arXiv 2024
[40]

D.,Ermon, S.andFinn, C.(2023)

Rafailov, R.,Sharma, A.,Mitchell, E.,Manning, C. D.,Ermon, S.andFinn, C.(2023). Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36

2023
[41]

Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716

Rashidinejad, P.,Zhu, B.,Ma, C.,Jiao, J.andRussell, S.(2021). Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716

2021
[42]

Optimal conserva- tive offline rl with general function approximation via augmented lagrangian.arXiv preprint arXiv:2211.00716

Rashidinejad, P.,Zhu, H.,Yang, K.,Russell, S.andJiao, J.(2022). Optimal conserva- tive offline rl with general function approximation via augmented lagrangian.arXiv preprint arXiv:2211.00716

work page arXiv 2022
[43]

S.andSanghavi, S.(2021)

Ren, T.,Li, J.,Dai, B.,Du, S. S.andSanghavi, S.(2021). Nearly horizon-free offline rein- forcement learning.Advances in neural information processing systems3415621–15634

2021
[44]

A variant of azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392

Shamir, O.(2011). A variant of azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392

work page arXiv 2011
[45]

On the complexity of bandit linear optimization

Shamir, O.(2015). On the complexity of bandit linear optimization. InConference on Learning Theory. PMLR

2015
[46]

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity

Shi, L.,Li, G.,Wei, Y.,Chen, Y.andChi, Y.(2022). Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. InInternational conference on machine learning. PMLR

2022
[47]

Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

Sidford, A.,Wang, M.,Wu, X.,Yang, L. F.andYe, Y.(2018). Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492

work page arXiv 2018
[48]

Fast rates for maximum entropy exploration

Tiapkin, D.,Belomestny, D.,Calandriello, D.,Moulines, E.,Munos, R.,Naumov, A.,Perrault, P.,Tang, Y.,Valko, M.andMenard, P.(2023). Fast rates for maximum entropy exploration. InInternational Conference on Machine Learning. PMLR

2023
[49]

Introduction to nonparametric estimation

Tsybakov, A.(2008). Introduction to nonparametric estimation. InSpringer Series in Statistics

2008
[50]

arXiv preprint arXiv:2107.06226 , year=

Uehara, M.andSun, W.(2021). Pessimistic model-based offline reinforcement learning under partial coverage.arXiv preprint arXiv:2107.06226. 38

work page arXiv 2021
[51]

R.(1957)

Varshamov, R. R.(1957). Estimate of the number of signals in error correcting codes.Docklady Akad. Nauk, SSSR117739–741

1957
[52]

M.(2021)

Wang, R.,Foster, D.andKakade, S. M.(2021). What are the statistical limits of offline rl with linear function approximation? InInternational Conference on Learning Representations

2021
[53]

C.andSun, W.(2025)

Wang, Z.,Zhou, D.,Lui, J. C.andSun, W.(2025). Model-based RL as a minimalist approach to horizon-free and second-order bounds. InThe Thirteenth International Conference on Learning Representations

2025
[54]

Improved bounds for private and robust alignment

Weng, W.,He, Y.andZhou, X.(2025). Improved bounds for private and robust alignment. arXiv preprint arXiv:2512.23816

work page arXiv 2025
[55]

arXiv preprint arXiv:2510.13512 , year=

Wu, Y.,Thareja, R.,Vepakomma, P.andOrabona, F.(2025b). Offline and online kl- regularized rlhf under differential privacy.arXiv preprint arXiv:2510.13512

work page arXiv
[56]

Behavior Regularized Offline Reinforcement Learning

Wu, Y.,Tucker, G.andNachum, O.(2019). Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361

work page internal anchor Pith review arXiv 2019
[57]

J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A

Xie, T.,Foster, D. J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A. H.andRakhlin, A.(2025). Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient RLHF. InThe Thirteenth International Conference on Learning Representa- tions

2025
[58]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

Xiong, W.,Dong, H.,Ye, C.,Wang, Z.,Zhong, H.,Ji, H.,Jiang, N.andZhang, T.(2024). Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. InForty-first International Conference on Machine Learning

2024
[59]

Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game

Xiong, W.,Zhong, H.,Shi, C.,Shen, C.,Wang, L.andZhang, T.(2023). Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. InInternational Conference on Learning Representations (ICLR)

2023
[60]

The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219

Yan, Y.,Li, G.,Chen, Y.andFan, J.(2023). The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219

2023
[61]

Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism

Yin, M.,Duan, Y.,Wang, M.andWang, Y.-X.(2022). Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. InInter- national Conference on Learning Representation. 39

2022
[62]

Assouad, fano, and le cam

Yu, B.(1997). Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics. Springer, 423–435

1997
[63]

Offline reinforcement learning with realizability and single-policy concentrability

Zhan, W.,Huang, B.,Huang, A.,Jiang, N.andLee, J.(2022). Offline reinforcement learning with realizability and single-policy concentrability. InConference on Learning Theory. PMLR

2022
[64]

Cambridge University Press

Zhang, T.(2023).Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press

2023
[65]

Beyond Pessimism: Offline Learning in KL-regularized Games

Zhang, Y.,Chen, C.andJiang, N.(2026). Beyond pessimism: Offline learning in kl-regularized games.arXiv preprint arXiv:2604.06738

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits

Zhao, Q.,Ji, K.,Zhao, H.,Zhang, T.andGu, Q.(2026). Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits. InThe Fourteenth International Conference on Learning Representations

2026
[67]

On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060

Zhu, Y.andNowak, R.(2020). On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060. 40

2020