Recognition: no theorem link
Nearly-Optimal Algorithm for Adversarial Kernelized Bandits
Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3
The pith
The exponential-weight algorithm achieves Õ(√(T γ_T)) adversarial regret in kernelized bandits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The exponential-weight algorithm achieves Õ(√(T γ_T)) adversarial regret in the kernelized bandit problem, where γ_T denotes the maximum information gain. Algorithm-independent lower bounds for the squared exponential and ν-Matérn kernels establish that this performance is optimal up to polylogarithmic factors. A Nyström-based variant preserves the regret bound while remaining computationally tractable.
What carries the argument
The exponential-weight algorithm that maintains a multiplicative distribution over actions based on observed rewards, together with the maximum information gain γ_T that quantifies the complexity of functions in the reproducing kernel Hilbert space.
Load-bearing premise
Reward functions chosen adversarially at each round belong to a fixed known reproducing kernel Hilbert space whose maximum information gain can be bounded.
What would settle it
An explicit adversary and kernel for which every algorithm must incur regret larger than Õ(√(T γ_T)) polylog factors would disprove the upper bound.
read the original abstract
This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm achieves $\tilde{O}(\sqrt{T \gamma_T})$ adversarial regret, where $T$ and $\gamma_T$ denote the number of total rounds and the maximum information gain, respectively. For squared exponential (SE) and $\nu$-Mat\'ern kernels, we also show algorithm-independent lower bounds that guarantee the optimality of our algorithm up to polylogarithmic factors. Furthermore, we present a computationally efficient variant of our algorithm using Nystr\"om approximation while maintaining nearly optimal regret guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies adversarial kernelized bandits where reward functions lie in a known RKHS with fixed kernel. It shows that the exponential-weights algorithm attains Õ(√(T γ_T)) regret, where γ_T is the maximum information gain. Algorithm-independent lower bounds are given for squared-exponential and ν-Matérn kernels that match up to polylog factors, establishing near-optimality. A Nyström-approximated computationally efficient variant is also presented that preserves the nearly-optimal regret.
Significance. If the central regret bounds and lower-bound constructions hold, the work provides the first nearly-tight characterization of adversarial regret for kernelized bandits, directly extending the information-gain framework from the stochastic setting. The explicit dependence on γ_T yields kernel-specific rates (including polylog optimality for SE and Matérn), and the Nyström variant addresses computational tractability without sacrificing the leading term. These are standard strengths of the kernel-bandit literature and are cleanly realized here.
minor comments (3)
- [Abstract] Abstract and introduction: the phrase 'nearly optimal' is used for both the main algorithm and the Nyström variant; a brief sentence clarifying that the polylog gap is the only sub-optimality would improve precision.
- [Problem formulation] The problem setup should explicitly restate the bounded RKHS-norm assumption (||f||_H ≤ B) alongside the known-kernel assumption, as this is load-bearing for the regret analysis even though it is standard.
- [Preliminaries] Notation: the definition of the maximum information gain γ_T should be recalled in the main text (not only in the abstract) before the regret statement, to aid readers unfamiliar with the kernel literature.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation of minor revision. The report correctly summarizes our main results on the exponential-weights algorithm achieving Õ(√(T γ_T)) adversarial regret for kernelized bandits, the algorithm-independent lower bounds matching up to polylog factors for squared-exponential and ν-Matérn kernels, and the Nyström-based computationally efficient variant that preserves the leading regret term.
Circularity Check
No significant circularity identified
full rationale
The paper's central result applies the exponential-weights algorithm to an implicit expert set whose cardinality is controlled by the maximum information gain γ_T of a fixed kernel. This γ_T is an external, kernel-dependent quantity defined independently of the algorithm (standard in the RKHS bandit literature) and is not fitted or derived within the paper. The Õ(√(T γ_T)) regret bound follows from the standard analysis of exponential weights once the RKHS ball and bounded-norm assumptions are granted in the problem setup. The algorithm-independent lower bounds for SE and Matérn kernels rely on known, externally established rates for γ_T and do not reduce to any fitted parameter or self-referential definition. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results as new derivations appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reward functions lie in a known reproducing kernel Hilbert space with bounded maximum information gain γ_T
Reference graph
Works this paper leans on
-
[1]
J. D. Abernethy, E. Hazan, and A. Rakhlin. Interior-point methods for full-information and bandit online learning.IEEE Transactions on Information Theory, 58(7):4164–4175, 2012
work page 2012
-
[2]
C. Allenberg, P. Auer, L. Gy¨orfi, and G. Ottucs´ak. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. InInternational Conference on Algorithmic Learning Theory, pages 229–243. Springer, 2006
work page 2006
-
[3]
P. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. High-probability regret bounds for bandit online linear optimization. InProceedings of the 21st annual conference on learning theory-COLT 2008, 2008
work page 2008
-
[4]
I. Bogunovic, J. Scarlett, S. Jegelka, and V. Cevher. Adversarially robust optimization with Gaussian processes. InProc. Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[5]
I. Bogunovic, A. Krause, and J. Scarlett. Corruption-tolerant Gaussian process bandit optimiza- tion. InInternational Conference on Artificial Intelligence and Statistics, 2020
work page 2020
-
[6]
I. Bogunovic, Z. Li, A. Krause, and J. Scarlett. A robust phased elimination algorithm for corruption-tolerant Gaussian process bandits. InProc. Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[7]
S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012
work page 2012
- [8]
-
[9]
A. D. Bull. Convergence rates of efficient global optimization algorithms.Journal of Machine Learning Research, 2011
work page 2011
- [10]
- [11]
-
[12]
X. Cai, S. Gomes, and J. Scarlett. Lenient regret and good-action identification in Gaussian process bandits. InInternational Conference on Machine Learning, pages 1183–1192. PMLR, 2021
work page 2021
-
[13]
D. Calandriello and L. Rosasco. Statistical and computational trade-offs in kernel k-means. Advances in neural information processing systems, 31, 2018
work page 2018
-
[14]
D. Calandriello, A. Lazaric, and M. Valko. Distributed adaptive sampling for kernel matrix approximation. InArtificial Intelligence and Statistics, 2017
work page 2017
-
[15]
D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. InConference on learning theory, 2019
work page 2019
-
[16]
R. Camilleri, K. Jamieson, and J. Katz-Samuels. High-dimensional experimental design and kernel bandits. InProc. International Conference on Machine Learning (ICML), 2021
work page 2021
-
[17]
N. Chatterji, A. Pacchiano, and P. Bartlett. Online learning with kernel losses. InInternational Conference on Machine Learning, 2019. 10
work page 2019
-
[18]
S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. InProc. International Conference on Machine Learning (ICML), 2017
work page 2017
-
[19]
S. R. Chowdhury and A. Gopalan. No-regret algorithms for multi-task Bayesian optimization. InInternational Conference on Artificial Intelligence and Statistics, 2021
work page 2021
-
[20]
Y. Deng, X. Zhou, B. Kim, A. Tewari, A. Gupta, and N. Shroff. Weighted Gaussian process bandits for non-stationary environments. InProc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2022
work page 2022
-
[21]
T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization.Journal of Machine Learning Research, 2014
work page 2014
-
[22]
V. V. Fedorov.Theory of optimal experiments. Elsevier, 2013
work page 2013
- [23]
-
[24]
E. Hazan and S. Kale. A simple multi-armed bandit algorithm with optimal variation-bounded regret. InProceedings of the 24th Annual Conference on Learning Theory, pages 817–820. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[25]
K. Hong, Y. Li, and A. Tewari. An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. InProc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
work page 2023
-
[26]
S. Iwazaki. Gaussian process upper confidence bound achieves nearly-optimal regret in noise- free Gaussian process bandits. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[27]
S. Iwazaki. Improved regret bounds for Gaussian process upper confidence bound in Bayesian optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
- [28]
-
[29]
S. Iwazaki and S. Takeno. Improved regret analysis in Gaussian process bandits: Optimality for noiseless reward, RKHS norm, and non-stationary variance. InInternational Conference on Machine Learning, pages 26642–26672. PMLR, 2025
work page 2025
-
[30]
S. Iwazaki and S. Takeno. Near-optimal algorithm for non-stationary kernelized bandits. In International Conference on Artificial Intelligence and Statistics, pages 406–414. PMLR, 2025
work page 2025
-
[31]
S. Iwazaki, Y. Inatsu, and I. Takeuchi. Mean-variance analysis in Bayesian optimization under uncertainty. InProc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2021
work page 2021
-
[32]
Janz.Sequential decision making with feature-linear models
D. Janz.Sequential decision making with feature-linear models. PhD thesis, 2022
work page 2022
-
[33]
Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences
M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences.arXiv preprint arXiv:1807.02582, 2018
work page Pith review arXiv 2018
-
[34]
J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust Bayesian optimization. InProc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2020
work page 2020
-
[35]
A. Krause and C. Ong. Contextual Gaussian process bandit optimization. InProc. Neural Information Processing Systems (NeurIPS), 2011
work page 2011
-
[36]
T. Lattimore and C. Szepesv ´ari.Bandit algorithms. Cambridge University Press, 2020
work page 2020
-
[37]
C.-W. Lee, H. Luo, C.-Y. Wei, M. Zhang, and X. Zhang. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. InInternational Conference on Machine Learning, 2021
work page 2021
- [38]
-
[39]
D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with Gaussian process regression. InProc. International Joint Conference on Artificial Intelligence (IJCAI), 2007. 11
work page 2007
- [40]
-
[41]
C. Musco and C. Musco. Recursive sampling for the Nystr ¨om method.Advances in neural information processing systems, 30, 2017
work page 2017
-
[42]
G. Neu, J. Olkhovskaya, and S. Vakili. Adversarial contextual bandits go kernelized. In International Conference on Algorithmic Learning Theory, 2024
work page 2024
-
[43]
Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Value-at-risk optimization with Gaussian processes. InProc. International Conference on Machine Learning (ICML), 2021
work page 2021
-
[44]
C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005
work page 2005
-
[45]
G. Riutort-Mayol, P.-C. B¨ urkner, M. R. Andersen, A. Solin, and A. Vehtari. Practical hilbert space approximate Bayesian Gaussian processes for probabilistic programming.Statistics and Computing, 33(1):17, 2023
work page 2023
-
[46]
D. Russo and B. Van Roy. Learning to optimize via posterior sampling.Mathematics of Operations Research, 39(4):1221–1243, 2014
work page 2014
-
[47]
D. Russo and B. Van Roy. Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014
work page 2014
- [48]
- [49]
- [50]
- [51]
-
[52]
J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. InProc. Conference on Learning Theory (COLT), 2017
work page 2017
-
[53]
P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause. No-regret learning in unknown games with correlated payoffs.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[54]
P. G. Sessa, I. Bogunovic, M. Kamgarpour, and A. Krause. Learning to play sequential games versus unknown opponents.Advances in neural information processing systems, 33:8971–8981, 2020
work page 2020
-
[55]
O. Shamir. On the complexity of bandit linear optimization. InConference on Learning Theory, 2015
work page 2015
-
[56]
S. Shekhar and T. Javidi. Multi-scale zero-order optimization of smooth functions in an RKHS. arXiv preprint arXiv:2005.04832, 2020
-
[57]
S. Shekhar and T. Javidi. Instance dependent regret analysis of kernelized bandits. InInternational Conference on Machine Learning, 2022
work page 2022
- [58]
-
[59]
A. Solin and S. S¨arkk¨a. Hilbert space methods for reduced-rank gaussian process regression. Statistics and Computing, 2020
work page 2020
-
[60]
N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. InProc. International Conference on Machine Learning (ICML), 2010
work page 2010
-
[61]
S. Takemori and M. Sato. Approximation theory based methods for RKHS bandits. In International Conference on Machine Learning, 2021. 12
work page 2021
-
[62]
S. Takeno and S. Iwazaki. On regret bounds of thompson sampling for bayesian optimization. arXiv preprint arXiv:2603.09276, 2026
-
[63]
S. Takeno, Y. Inatsu, and M. Karasuyama. Randomized Gaussian process upper confidence bound with tighter Bayesian regret bounds. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 33490–33515. PMLR, 2023
work page 2023
-
[64]
S. Vakili. Open problem: Regret bounds for noise-free kernel-based bandits. InProc. Conference on Learning Theory (COLT), 2022
work page 2022
- [65]
- [66]
- [67]
-
[68]
Vershynin.High-dimensional probability: An introduction with applications in data science
R. Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018
work page 2018
- [69]
-
[70]
X. Zhou and N. Shroff. No-regret algorithms for time-varying Bayesian optimization. In2021 55th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2021
work page 2021
-
[71]
J. Zimmert and T. Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. InConference on Learning Theory, 2022
work page 2022
-
[72]
𝑇∑︁ 𝑡=1 𝜎2 (x (𝑇+1) ;X (𝑇) , 𝜆) # (26) ≤E X(𝑇) ,x (𝑇+1)
M. Zuluaga, A. Krause, et al. e-pal: An active learning approach to the multi-objective optimization problem.Journal of Machine Learning Research, 2016. 13 A Notation table Tab. 2 summarizes the notations used in this paper. B Pseudocode In this section, we provide the pseudocode for the algorithms that could not be included in the main text due to space ...
work page 2016
-
[73]
The number𝑠 𝑡 satisfies𝑠 𝑡 =𝑂(𝛾 𝑇 log(𝑇 𝛾𝑇 ))=𝑂(𝛾 𝑇 log(𝑇))for any𝑡∈ [𝑇] 11
-
[74]
𝑓𝑡 (x∗) − ∑︁ x∈ X e𝑃𝑡 (x)𝑓 𝑡 (x) # (90) =E [ 𝑓𝑡 (x∗) −e𝑢𝑡 (x∗)]| {z } (i) +E
The computational cost of recursive RLS-Nystr ¨om is 𝑂(|X|𝛾 2 𝑇 (log(𝑇 𝛾𝑇 )) 2)= 𝑂(|X|𝛾 2 𝑇 (log𝑇) 2). Here, we confirm the computational cost of ( e𝑓𝑡 (x)) x∈ X and (e𝜎2 𝑡 (x)) x∈ X is 𝑂(|X|𝑠 2 𝑡 +𝑠 3 𝑡 ). Firstly, the computation of Nystr ¨om embedding 𝜙𝑡 (·) requires the calculation of pseudo-inverse [(S ⊤ 𝑡 e𝐾𝑡S𝑡 )1/2]†, which require 𝑂(𝑠 3 𝑡 )-comput...
-
[75]
and [15], which study the kernel approximation for k-means clustering and the computationally efficient KB algorithm, respectively. Finally, Lemma H.11 is used to obtain the upper bound on the regret arising from the unfavorable rare event, where the accuracy of recursive RLS-Nystr¨om is not sufficient. The proof of Lemma H.11 is obtained with elementary ...
-
[76]
The number of sampled columns 𝑠∈ [|X|] satisfies 𝑠≤𝑠 max(4𝛾𝑇 , 𝛿), where 𝑠max (𝑤, 𝑧)= 384(𝑤+1)log((𝑤+1)/𝑧)
-
[77]
3 is at most𝑂(|X|𝑠 2 max (4𝛾𝑇 , 𝛿))
The computational time of Alg. 3 is at most𝑂(|X|𝑠 2 max (4𝛾𝑇 , 𝛿))
-
[78]
The weighed sampling matrixS∈R | X | ×𝑠 returned by Alg. 3 satisfies the following: 1 2 ¯𝚿 ¯𝚿⊤ + 𝜆 𝑇 I| X | ⪯ ¯𝚿SS⊤ ¯𝚿⊤ + 𝜆 𝑇 I| X | ⪯ 3 2 ¯𝚿 ¯𝚿⊤ + 𝜆 𝑇 I| X | ,(235) where ¯𝚿=( ¯𝜓(x (1) ), . . . , ¯𝜓(x ( | X | ))) ∈R | X | × | X |. 36 In statements 1 and 2, 𝛾𝑇 :=𝛾 𝑇 (𝜆,X) denotes the maximum information gain of kernel 𝑘 on X with the variance parameter𝜆. ...
-
[79]
The number of sampled columns e𝑠∈ [𝑇] satisfies e𝑠≤𝑠 max (𝛾𝑇 , 𝛿), where 𝑠max (𝑤, 𝑧)= 384(𝑤+1)log((𝑤+1)/𝑧)
-
[80]
3 is at most𝑂(𝑇 𝑠 2 max (𝛾𝑇 , 𝛿))
The computational time of Alg. 3 is at most𝑂(𝑇 𝑠 2 max (𝛾𝑇 , 𝛿))
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.