Recognition: 2 theorem links
· Lean TheoremOn the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Pith reviewed 2026-05-08 19:33 UTC · model grok-4.3
The pith
Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of Õ(η S A C^π*/ε) under large regularization η = Õ(ε^{-1}), and a sample complexity of Ω̃(S A C^π*/ε²) under small regularization η = Ω̃(ε^{-1}). We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths.
Load-bearing premise
The analysis relies on the offline dataset satisfying a policy coverage coefficient C^π* at the optimal policy and on the regularization parameter η being either much larger or much smaller than 1/ε; the precise dependence of the bounds on these quantities is not derived for intermediate η values.
Figures
read the original abstract
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(\eta SAC^{\pi^*}/\epsilon)$ under large regularization $\eta = \tilde{O}(\epsilon^{-1})$, and a sample complexity of $\tilde{\Omega}(SAC^{\pi^*}/\epsilon^2)$ under small regularization $\eta = \tilde{\Omega}(\epsilon^{-1})$, where $\eta$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{\pi^*}$ policy coverage coefficient at the optimal policy $\pi^*$, $\epsilon$ is the desired sub-optimality, and $\tilde{O}$ and $\tilde{\Omega}$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the sample complexity of offline multi-armed bandits under KL regularization. It provides a sharp analysis of the KL-PCB algorithm, deriving an upper bound of Õ(η S A C^{π*}/ε) when the regularization parameter η is large (η = Õ(ε^{-1})) and a rate of Ω̃(S A C^{π*}/ε²) when η is small (η = Ω̃(ε^{-1})), along with a pair of lower bounds that are claimed to match these upper bounds over the entire range of η values, yielding a nearly complete characterization.
Significance. If the claimed matching bounds hold without hidden restrictions, the work would deliver a precise, regime-dependent understanding of how KL regularization strength governs sample complexity in offline MABs. This extends prior analyses of KL-PCB and supplies falsifiable lower bounds, which is a positive contribution to the theory of offline decision-making. However, the absence of an explicit upper bound or interpolation for the transitional regime η ≈ Θ(ε^{-1}) limits the completeness of the characterization.
major comments (1)
- [Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.
minor comments (2)
- [Abstract] Abstract: The sentence 'achieves a sample complexity of Ω̃(S A C^{π*}/ε²) under small regularization' is imprecise because Ω̃ conventionally denotes lower bounds; clarify whether this rate is achieved by the algorithm or is instead the matching lower bound.
- [Abstract] Abstract: The phrase 'remains largely from fully characterized' appears to be a typographical error and should be corrected to 'remains largely not fully characterized.'
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address the major comment regarding the abstract and the completeness of the characterization below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The upper bound for KL-PCB is stated only for the two extreme regimes (η = Õ(ε^{-1}) yielding Õ(η S A C^{π*}/ε) and η = Ω̃(ε^{-1}) yielding Ω̃(S A C^{π*}/ε²)), yet the central claim is a 'nearly complete characterization' via matching bounds 'over the entire range of regularization strengths.' No general upper bound or continuous interpolation is supplied for the transitional regime η = Θ(ε^{-1}), so the lower bounds cannot close the characterization for all η as asserted.
Authors: We thank the referee for this observation. Our manuscript provides upper bounds on the sample complexity of the KL-PCB algorithm in two specific regimes of the regularization parameter η: Õ(η S A C^{π*}/ε) for η = Õ(ε^{-1}) and Õ(S A C^{π*}/ε²) for η = Ω̃(ε^{-1}). We note that the abstract contains a typographical error in the second regime, where Õ should be used instead of Ω̃ to indicate the achieved upper bound. The pair of lower bounds we present match these upper bounds within their respective regimes and are valid across the full spectrum of η. Nevertheless, we do not supply an upper bound or an interpolating expression for the transitional regime η = Θ(ε^{-1}). We therefore agree that this prevents a fully closed characterization for all values of η. In the revised version, we will amend the abstract to accurately reflect the regimes where matching occurs and include a brief discussion of the transitional case as an open question. This revision will ensure our claims are precisely aligned with the theorems and proofs in the paper. revision: partial
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquation (for contrast)washburn_uniqueness_aczel — RS's forced cost is J(x)=½(x+x⁻¹)−1, not the Shannon-style KL used here; no overlap unclearthe reverse Kullback–Leibler (KL) divergence regularizer KL(π̂ ∥ π_ref) ... J(π) := E[r(s,a) − η⁻¹ log(π/π_ref)]
Reference graph
Works this paper leans on
-
[1]
F.(2020)
Agarwal, A.,Kakade, S.andYang, L. F.(2020). Model-based reinforcement learning with a generative model is minimax optimal. InConference on Learning Theory. PMLR
2020
-
[2]
Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences
Assouad, P.(1983). Deux remarques sur l’estimation.Comptes rendus des s´ eances de l’Acad´ emie des sciences. S´ erie 1, Math´ ematique2961021–1024
1983
-
[3]
Best arm identification in multi-armed bandits
Audibert, J.-Y.andBubeck, S.(2010). Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010. 35
2010
-
[4]
Pure exploration in infinitely- armed bandit models with fixed-confidence
Aziz, M.,Anderton, J.,Kaufmann, E.andAslam, J.(2018). Pure exploration in infinitely- armed bandit models with fixed-confidence. InAlgorithmic Learning Theory. PMLR
2018
-
[5]
Weighted sums of certain dependent random variables.Tohoku Mathematical
Azuma, K.(1967). Weighted sums of certain dependent random variables.Tohoku Mathematical
1967
-
[6]
A.andMurty, U
Bondy, J. A.andMurty, U. S. R.(1979).Graph theory with applications. north-Holland
1979
-
[7]
N.andKatehakis, M
Burnetas, A. N.andKatehakis, M. N.(1996). Optimal adaptive policies for sequential allo- cation problems.Advances in Applied Mathematics17122–142
1996
-
[8]
R.andKalyanakrishnan, S.(2017)
Chaudhuri, A. R.andKalyanakrishnan, S.(2017). Pac identification of a bandit arm rel- ative to a reward quantile. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence
2017
-
[9]
R.andKalyanakrishnan, S.(2018)
Chaudhuri, A. R.andKalyanakrishnan, S.(2018). Quantile-regret minimisation in infinitely many-armed bandits. InUAI
2018
-
[10]
J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024)
Chen, F.,Foster, D. J.,Han, Y.,Qian, J.,Rakhlin, A.andXu, Y.(2024). Assouad, fano, and le cam with interaction: A unifying lower bound framework and characterization for bandit learnability.Advances in Neural Information Processing Systems3775585–75641
2024
-
[11]
Adversarially trained actor critic for offline reinforcement learning
Cheng, C.-A.,Xie, T.,Jiang, N.andAgarwal, A.(2022). Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning. PMLR
2022
-
[12]
M.(1999).Elements of information theory
Cover, T. M.(1999).Elements of information theory. John Wiley & Sons. De Heide, R.,Cheshire, J.,M ´enard, P.andCarpentier, A.(2021). Bandits with many optimal arms.Advances in Neural Information Processing Systems3422457–22469
1999
-
[13]
M.(2019)
Degenne, R.andKoolen, W. M.(2019). Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems32
2019
-
[14]
Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning
Di, Q.,Zhao, H.,He, J.andGu, Q.(2024). Pessimistic nonlinear least-squares value iter- ation for offline reinforcement learning. InThe Twelfth International Conference on Learning Representations
2024
-
[15]
Model alignment as prospect theoretic optimization
Ethayarajh, K.,Xu, W.,Muennighoff, N.,Jurafsky, D.andKiela, D.(2024). Model alignment as prospect theoretic optimization. InInternational Conference on Machine Learning. PMLR
2024
-
[16]
arXiv preprint arXiv:2112.13487 , year=
Foster, D. J.,Kakade, S. M.,Qian, J.andRakhlin, A.(2021). The statistical complexity of interactive decision making.arXiv preprint arXiv:2112.13487
-
[17]
J.,Mhammedi, Z.andRohatgi, D.(2025)
Foster, D. J.,Mhammedi, Z.andRohatgi, D.(2025). Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. InThe Thirty Eighth Annual Conference on Learning Theory. PMLR
2025
-
[18]
A.(1975)
Freedman, D. A.(1975). On tail probabilities for martingales.the Annals of Probability100–118
1975
-
[19]
Optimal best arm identification with fixed confidence
Garivier, A.andKaufmann, E.(2016). Optimal best arm identification with fixed confidence. InConference on Learning Theory. PMLR. 36
2016
-
[20]
Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96
Garivier, A.andKaufmann, E.(2021). Nonasymptotic sequential tests for overlapping hy- potheses applied to near-optimal arm identification in bandit models.Sequential Analysis40 61–96
2021
-
[21]
Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399
Garivier, A.,M ´enard, P.andStoltz, G.(2019). Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research44377–399
2019
-
[22]
N.(1952)
Gilbert, E. N.(1952). A comparison of signalling alphabets.The Bell system technical journal 31504–522
1952
-
[23]
Gu, Y.,Han, Y.andQian, J.(2025). Evolution of information in interactive decision making: A case study for multi-armed bandits.arXiv preprint arXiv:2503.00273
-
[24]
Essential coding theory.Draft available at http://cse
Guruswami, V.,Rudra, A.andSudan, M.(2019). Essential coding theory.Draft available at http://cse. buffalo. edu/faculty/atri/courses/coding-theory/book11
2019
-
[25]
arXiv preprint arXiv:2603.02155 , year=
Ji, K.,Zhao, Q.,Zhao, H.,Di, Q.andGu, Q.(2026). Near-optimal regret for kl-regularized multi-armed bandits.arXiv preprint arXiv:2603.02155
-
[26]
Is pessimism provably efficient for offline rl? In International Conference on Machine Learning
Jin, Y.,Yang, Z.andWang, Z.(2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR
2021
-
[27]
Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191
Kumar, A.,Zhou, A.,Tucker, G.andLevine, S.(2020). Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems331179–1191
2020
-
[28]
L.andRobbins, H.(1985)
Lai, T. L.andRobbins, H.(1985). Asymptotically efficient adaptive allocation rules.Advances in applied mathematics64–22
1985
-
[29]
Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024
Lai, X.,Tian, Z.,Chen, Y.,Yang, S.,Peng, X.andJia, J.(2024). Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629
-
[30]
Cambridge University Press
Lattimore, T.andSzepesv ´ari, C.(2020).Bandit algorithms. Cambridge University Press. Le Cam, L.(1973). Convergence of estimates under dimensionality restrictions.The Annals of Statistics38–53
2020
-
[31]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Levine, S.,Kumar, A.,Tucker, G.andFu, J.(2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643
work page internal anchor Pith review arXiv 2020
-
[32]
Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436
Levine, S.,Pastor, P.,Krizhevsky, A.,Ibarz, J.andQuillen, D.(2018). Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research37421–436
2018
-
[33]
Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260
Li, G.,Shi, L.,Chen, Y.,Chi, Y.andWei, Y.(2024). Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics52233–260
2024
-
[34]
N.(2004)
Mannor, S.andTsitsiklis, J. N.(2004). The sample complexity of exploration in the multi- armed bandit problem.Journal of Machine Learning Research5623–648
2004
-
[35]
Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235
Meng, Y.,Xia, M.andChen, D.(2024). Simpo: Simple preference optimization with a reference- free reward.Advances in Neural Information Processing Systems37124198–124235. 37
2024
-
[36]
arXiv preprint arXiv:2510.13060 , year=
Nayak, A.,Yang, T.,Yagan, O.,Joshi, G.andChi, Y.(2025). Achieving logarithmic regret in kl-regularized zero-sum markov games.arXiv preprint arXiv:2510.13060
-
[37]
E.,Pattathil, S.,Zhang, J.andZhang, K.(2023)
Ozdaglar, A. E.,Pattathil, S.,Zhang, J.andZhang, K.(2023). Revisiting the linear- programming framework for offline rl with general function approximation. InInternational Conference on Machine Learning. PMLR
2023
-
[38]
Cambridge university press
Polyanskiy, Y.andWu, Y.(2025).Information theory: From coding to learning. Cambridge university press
2025
-
[39]
Fromr to Q∗: Your language model is secretly a Q-function,
Rafailov, R.,Hejna, J.,Park, R.andFinn, C.(2024). Fromrtoq ∗: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358
-
[40]
D.,Ermon, S.andFinn, C.(2023)
Rafailov, R.,Sharma, A.,Mitchell, E.,Manning, C. D.,Ermon, S.andFinn, C.(2023). Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36
2023
-
[41]
Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716
Rashidinejad, P.,Zhu, B.,Ma, C.,Jiao, J.andRussell, S.(2021). Bridging offline rein- forcement learning and imitation learning: A tale of pessimism.Advances in Neural Information Processing Systems3411702–11716
2021
-
[42]
Rashidinejad, P.,Zhu, H.,Yang, K.,Russell, S.andJiao, J.(2022). Optimal conserva- tive offline rl with general function approximation via augmented lagrangian.arXiv preprint arXiv:2211.00716
-
[43]
S.andSanghavi, S.(2021)
Ren, T.,Li, J.,Dai, B.,Du, S. S.andSanghavi, S.(2021). Nearly horizon-free offline rein- forcement learning.Advances in neural information processing systems3415621–15634
2021
-
[44]
Shamir, O.(2011). A variant of azuma’s inequality for martingales with subgaussian tails.arXiv preprint arXiv:1110.2392
-
[45]
On the complexity of bandit linear optimization
Shamir, O.(2015). On the complexity of bandit linear optimization. InConference on Learning Theory. PMLR
2015
-
[46]
Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity
Shi, L.,Li, G.,Wei, Y.,Chen, Y.andChi, Y.(2022). Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. InInternational conference on machine learning. PMLR
2022
-
[47]
Sidford, A.,Wang, M.,Wu, X.,Yang, L. F.andYe, Y.(2018). Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492
-
[48]
Fast rates for maximum entropy exploration
Tiapkin, D.,Belomestny, D.,Calandriello, D.,Moulines, E.,Munos, R.,Naumov, A.,Perrault, P.,Tang, Y.,Valko, M.andMenard, P.(2023). Fast rates for maximum entropy exploration. InInternational Conference on Machine Learning. PMLR
2023
-
[49]
Introduction to nonparametric estimation
Tsybakov, A.(2008). Introduction to nonparametric estimation. InSpringer Series in Statistics
2008
-
[50]
arXiv preprint arXiv:2107.06226 , year=
Uehara, M.andSun, W.(2021). Pessimistic model-based offline reinforcement learning under partial coverage.arXiv preprint arXiv:2107.06226. 38
-
[51]
R.(1957)
Varshamov, R. R.(1957). Estimate of the number of signals in error correcting codes.Docklady Akad. Nauk, SSSR117739–741
1957
-
[52]
M.(2021)
Wang, R.,Foster, D.andKakade, S. M.(2021). What are the statistical limits of offline rl with linear function approximation? InInternational Conference on Learning Representations
2021
-
[53]
C.andSun, W.(2025)
Wang, Z.,Zhou, D.,Lui, J. C.andSun, W.(2025). Model-based RL as a minimalist approach to horizon-free and second-order bounds. InThe Thirteenth International Conference on Learning Representations
2025
-
[54]
Improved bounds for private and robust alignment
Weng, W.,He, Y.andZhou, X.(2025). Improved bounds for private and robust alignment. arXiv preprint arXiv:2512.23816
-
[55]
arXiv preprint arXiv:2510.13512 , year=
Wu, Y.,Thareja, R.,Vepakomma, P.andOrabona, F.(2025b). Offline and online kl- regularized rlhf under differential privacy.arXiv preprint arXiv:2510.13512
-
[56]
Behavior Regularized Offline Reinforcement Learning
Wu, Y.,Tucker, G.andNachum, O.(2019). Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361
work page internal anchor Pith review arXiv 2019
-
[57]
J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A
Xie, T.,Foster, D. J.,Krishnamurthy, A.,Rosset, C.,Awadallah, A. H.andRakhlin, A.(2025). Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient RLHF. InThe Thirteenth International Conference on Learning Representa- tions
2025
-
[58]
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
Xiong, W.,Dong, H.,Ye, C.,Wang, Z.,Zhong, H.,Ji, H.,Jiang, N.andZhang, T.(2024). Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. InForty-first International Conference on Machine Learning
2024
-
[59]
Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game
Xiong, W.,Zhong, H.,Shi, C.,Shen, C.,Wang, L.andZhang, T.(2023). Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. InInternational Conference on Learning Representations (ICLR)
2023
-
[60]
The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219
Yan, Y.,Li, G.,Chen, Y.andFan, J.(2023). The efficacy of pessimism in asynchronous q-learning.IEEE Transactions on Information Theory697185–7219
2023
-
[61]
Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism
Yin, M.,Duan, Y.,Wang, M.andWang, Y.-X.(2022). Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. InInter- national Conference on Learning Representation. 39
2022
-
[62]
Assouad, fano, and le cam
Yu, B.(1997). Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics. Springer, 423–435
1997
-
[63]
Offline reinforcement learning with realizability and single-policy concentrability
Zhan, W.,Huang, B.,Huang, A.,Jiang, N.andLee, J.(2022). Offline reinforcement learning with realizability and single-policy concentrability. InConference on Learning Theory. PMLR
2022
-
[64]
Cambridge University Press
Zhang, T.(2023).Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press
2023
-
[65]
Beyond Pessimism: Offline Learning in KL-regularized Games
Zhang, Y.,Chen, C.andJiang, N.(2026). Beyond pessimism: Offline learning in kl-regularized games.arXiv preprint arXiv:2604.06738
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[66]
Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits
Zhao, Q.,Ji, K.,Zhao, H.,Zhang, T.andGu, Q.(2026). Towards a sharp analysis of offline policy learning for$f$-divergence-regularized contextual bandits. InThe Fourteenth International Conference on Learning Representations
2026
-
[67]
On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060
Zhu, Y.andNowak, R.(2020). On regret with multiple best arms.Advances in Neural Infor- mation Processing Systems339050–9060. 40
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.