pith. sign in

arxiv: 2606.23414 · v1 · pith:CSIULQZFnew · submitted 2026-06-22 · 💻 cs.LG

Leveraging Similarities in Multi-Armed Bandits

Pith reviewed 2026-06-26 08:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-armed banditssimilarity structuretree-compatible lossesregret boundsLipschitz banditsfeedback modelseffective action countbest-of-both-worlds
0
0 comments X

The pith

Tree-structured action similarities allow bandit algorithms to replace the full action count K by an effective count K_eff in regret bounds, but only when feedback is richer than one-point.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard one-point bandit feedback cannot exploit similarities between actions even when those similarities are encoded by a tree and losses are forced to be close for related actions. It then supplies a family of algorithms that handle a spectrum of feedback models from semi-bandit down to two-point bandit feedback and obtain regret bounds governed by an effective action count K_eff instead of the raw number of leaves K. These bounds hold in both stochastic and adversarial regimes. The same machinery yields square-root regret for Lipschitz bandits in dimension at most two when only two-point feedback is available.

Core claim

For loss sequences that are tree-compatible, meaning losses of actions whose lowest common ancestor is at a given tree level differ by at most a factor tied to that level, algorithms exist that achieve regret scaling with an effective number of actions K_eff induced by the tree rather than the total number of leaves K, provided the feedback model supplies at least two loss evaluations per round; the same algorithms are impossible to obtain under ordinary one-point feedback.

What carries the argument

A rooted tree on the action set whose internal nodes define similarity levels, together with the derived effective action count K_eff that counts how many distinct loss values are distinguishable under the tree compatibility constraint.

If this is right

  • Regret bounds replace the raw action count K by the smaller effective count K_eff across semi-bandit, multi-point, and two-point feedback models.
  • The algorithms simultaneously achieve the best known rates for both stochastic and adversarial environments.
  • Square-root regret becomes attainable for Lipschitz bandits on the unit ball when dimension is at most two and two-point feedback is supplied.
  • The same framework covers hierarchical action sets and tagged or latent-trait similarities encoded as trees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Two-point feedback may be sufficient to obtain near-optimal rates in other structured continuous problems whose similarity can be approximated by a tree.
  • The impossibility for one-point feedback suggests that any practical deployment of similarity-aware bandits will require at least occasional direct loss comparisons between nearby actions.
  • K_eff could serve as a design criterion for constructing or pruning action trees in applications where the number of leaves is large but the effective diversity is small.

Load-bearing premise

The loss sequence must be tree-compatible so that losses of similar actions stay close.

What would settle it

An explicit loss sequence that is tree-compatible yet forces any algorithm using only one-point feedback to suffer regret linear in the full number of actions K rather than in K_eff.

read the original abstract

In many online learning and bandit problems, the actions we consider possess inherent similarities--for instance because they share latent traits, tags, or hierarchical structure. We study online learning with a similarity-structured action set, encoded by a rooted tree whose leaves are the actions and whose levels quantify how closely two actions are related. The loss sequence is assumed tree-compatible: losses of similar actions are constrained to be close. We establish an impossibility result showing that usual one-point bandit feedback cannot, in general, leverage range or tree-induced similarity, even under very strong similarity constraints. We then provide a unified set of algorithms which adapt to a wide range of richer feedback models, from semi-bandit feedback down to multi-point bandit protocols, including the minimal two-point feedback setting. We show these algorithms exhibit best-of-both-worlds guarantees and provably exploit action similarities by replacing the number of actions $K$ by a similarity-aware effective number of actions $K_{\mathrm{eff}}$ in the regret bounds. As an application, we show that under two-point feedback, it is possible to achieve $\sqrt{T}$ regret in Lipschitz bandits when $d \leq 2$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper studies multi-armed bandits where actions form a similarity structure encoded by a rooted tree (leaves are actions, levels quantify relatedness). Losses are assumed tree-compatible (similar actions have close losses). It proves an impossibility result that standard one-point bandit feedback cannot leverage such similarities even under strong constraints. It then gives a unified family of algorithms for richer feedback models (semi-bandit down to minimal two-point feedback) that achieve best-of-both-worlds regret bounds by replacing the action count K with a similarity-aware effective count K_eff. As an application, two-point feedback yields √T regret for Lipschitz bandits when d ≤ 2.

Significance. If the proofs and definitions of K_eff hold, the work supplies a clean way to quantify and exploit tree-induced similarities across feedback regimes, with the two-point Lipschitz application providing a concrete improvement over standard one-point bounds in low dimensions. The impossibility result for one-point feedback and the best-of-both-worlds guarantees are useful complementary contributions.

minor comments (2)
  1. The abstract introduces K_eff as the quantity that replaces K in the regret bounds but does not indicate how K_eff is computed from the tree or the loss-compatibility assumption; a brief parenthetical definition or reference to the main theorem would improve readability.
  2. The tree-compatibility assumption on the loss sequence is central to both the impossibility and the positive results; a short illustrative example of a tree-compatible vs. incompatible loss sequence would help readers assess the modeling restriction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting the potential significance of the impossibility result, the unified algorithms, and the Lipschitz application. We address the points raised below. Since the report lists no enumerated major comments after 'MAJOR COMMENTS:', we respond to the overall assessment and note that we stand ready to supply additional proof details or clarifications if the editor or referee requests them.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation relies on explicit modeling assumptions (rooted tree encoding action similarities, tree-compatible loss sequences) that are stated upfront and used to define K_eff as a similarity-aware effective action count; regret bounds are then derived by substituting this quantity into standard analyses for richer feedback models. The impossibility result for one-point feedback is shown separately as a complementary negative result, and the Lipschitz-bandit application for d ≤ 2 follows directly from instantiating the general two-point-feedback algorithm without reducing any bound to a fitted parameter or self-referential definition. No load-bearing step collapses to a self-citation chain, ansatz smuggled via prior work, or renaming of a known empirical pattern; the central claims remain independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the tree encoding of similarities and the tree-compatibility assumption on losses; K_eff is derived from these rather than fitted.

axioms (2)
  • domain assumption Action set encoded by a rooted tree whose leaves are actions and levels quantify closeness.
    Stated in the problem setup to define similarity structure.
  • domain assumption Loss sequence is tree-compatible: losses of similar actions are constrained to be close.
    Invoked to enable leveraging similarities and to define K_eff.

pith-pipeline@v0.9.1-grok · 5744 in / 1353 out tokens · 19421 ms · 2026-06-26T08:39:12.833593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 2 linked inside Pith

  1. [1]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, D \'a vid P \'a l, and Csaba Szepesv \'a ri. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

  2. [2]

    Optimal algorithms for online convex optimization with multi-point bandit feedback

    Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt, pages 28--40, 2010

  3. [3]

    Thompson sampling for contextual bandits with linear payoffs

    Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127--135. PMLR, 2013

  4. [4]

    Mixed-effect thompson sampling

    Imad Aouali, Branislav Kveton, and Sumeet Katariya. Mixed-effect thompson sampling. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2087--2115. PMLR, 2023

  5. [5]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002

  6. [6]

    Convex optimization theory, volume 1

    Dimitri Bertsekas. Convex optimization theory, volume 1. Athena Scientific, 2009

  7. [7]

    Sébastien Bubeck, Nicoló Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 41.1--41.14. PMLR, 2012

  8. [8]

    Algorithmic chaining and the role of partial feedback in online nonparametric learning

    Nicol \`o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S \'e bastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, pages 465--481. PMLR, 2017

  9. [9]

    Combinatorial bandits revisited

    Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015

  10. [10]

    Minimal exploration in structured stochastic bandits

    Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  11. [11]

    A blackbox approach to best of both worlds in bandits and beyond

    Chris Dann, Chen - Yu Wei, and Julian Zimmert. A blackbox approach to best of both worlds in bandits and beyond. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5503--5570. PMLR, 12--15 Jul 2023

  12. [12]

    Bandits with side observations: Bounded vs

    R \'e my Degenne, Evrard Garcelon, and Vianney Perchet. Bandits with side observations: Bounded vs. logarithmic regret. In Conference on Uncertainty in Artificial Intelligence, 2018

  13. [13]

    Optimal rates for zero-order convex optimization: The power of two function evaluations

    John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015

  14. [14]

    Online convex optimization in the bandit setting: gradient descent without a gradient

    Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007, 2004

  15. [15]

    Refined lower bounds for adversarial bandits

    S \'e bastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. Advances in Neural Information Processing Systems, 29, 2016

  16. [16]

    Zeroth-order non-convex learning via hierarchical dual averaging

    Am \'e lie H \'e liou, Matthieu Martin, Panayotis Mertikopoulos, and Thibaud Rahier. Zeroth-order non-convex learning via hierarchical dual averaging. In International Conference on Machine Learning, pages 4192--4202. PMLR, 2021

  17. [17]

    Deep hierarchy in bandits

    Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, and Mohammad Ghavamzadeh. Deep hierarchy in bandits. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8833--8851. PMLR, 2022 a

  18. [18]

    Hierarchical bayesian bandits

    Joey Hong, Branislav Kveton, Manzil Zaheer, and Mohammad Ghavamzadeh. Hierarchical bayesian bandits. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7724--7741. PMLR, 2022 b

  19. [19]

    Tight first-and second-order regret bounds for adversarial linear bandits

    Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33: 0 2028--2038, 2020

  20. [20]

    Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs

    Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems, volume 35, pages 28631--28643. Curran Associates, Inc., 2022

  21. [21]

    David S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9 0 (3): 0 256--278, 1974. ISSN 0022-0000

  22. [22]

    Robust lipschitz bandits to adversarial corruptions

    Yue Kang, Cho-Jui Hsieh, and Thomas Chun Man Lee. Robust lipschitz bandits to adversarial corruptions. Advances in Neural Information Processing Systems, 36: 0 10897--10908, 2023

  23. [23]

    Nearly tight bounds for the continuum-armed bandit problem

    Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004

  24. [24]

    Multi-armed bandits in metric spaces

    Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681--690, 2008

  25. [25]

    Bandits and experts in metric spaces

    Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66 0 (4): 0 1--77, 2019

  26. [26]

    Bandit algorithms

    Tor Lattimore and Csaba Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

  27. [27]

    Adaptive regret for bandits made possible: Two queries suffice

    Zhou Lu, Qiuyi Zhang, Xinyi Chen, Fred Zhang, David Woodruff, and Elad Hazan. Adaptive regret for bandits made possible: Two queries suffice. arXiv preprint arXiv:2401.09278, 2024

  28. [28]

    Nested bandits

    Matthieu Martin, Panayotis Mertikopoulos, Thibaud Rahier, and Houssam Zenati. Nested bandits. In International Conference on Machine Learning, pages 15093--15121. PMLR, 2022

  29. [29]

    Prior-dependent allocations for bayesian fixed-budget best-arm identification in structured bandits

    Nicolas Nguyen, Imad Aouali, Andr \'a s Gy \"o rgy, and Claire Vernade. Prior-dependent allocations for bayesian fixed-budget best-arm identification in structured bandits. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 of Proceedings of Machine Learning Research, pages 379--387. PMLR, 2025

  30. [30]

    A modern introduction to online learning

    Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2025

  31. [31]

    Adaptive discretization for adversarial lipschitz bandits

    Chara Podimata and Alex Slivkins. Adaptive discretization for adversarial lipschitz bandits. In Conference on Learning Theory, pages 3788--3805. PMLR, 2021

  32. [32]

    Tyrrell Rockafellar

    R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, 1970

  33. [33]

    Exploiting easy data in online optimization

    Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014

  34. [34]

    An optimal algorithm for bandit and zero-order convex optimization with two-point feedback

    Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017

  35. [35]

    A tight analysis of the greedy algorithm for set cover

    Petr Slav\' k. A tight analysis of the greedy algorithm for set cover. Journal of Algorithms, 25 0 (2): 0 237--254, 1997. ISSN 0196-6774

  36. [36]

    Multi-armed bandits on implicit metric spaces

    Aleksandrs Slivkins. Multi-armed bandits on implicit metric spaces. Advances in Neural Information Processing Systems, 24, 2011

  37. [37]

    Adaptation to easy data in prediction with limited advice

    Tobias Sommer Thune and Yevgeny Seldin. Adaptation to easy data in prediction with limited advice. Advances in Neural Information Processing Systems, 31, 2018

  38. [38]

    Efficient learning in large-scale combinatorial semi-bandits

    Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pages 1113--1122. PMLR, 2015

  39. [39]

    Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits

    Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22 0 (28): 0 1--49, 2021