Lexicographic Multiarmed Bandit

Alihan H\"uy\"uk; Cem Tekin

arxiv: 1907.11605 · v2 · pith:GRLKBHXKnew · submitted 2019-07-26 · 💻 cs.LG · stat.ML

Lexicographic Multiarmed Bandit

Alihan H\"uy\"uk , Cem Tekin This is my paper

Pith reviewed 2026-05-24 15:44 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords multiarmed banditlexicographic optimizationmultiobjective learningregret boundsbounded regretsatisficing objectives

0 comments

The pith

With accurate priors on lexicographic optimal rewards, multiarmed bandit learners achieve uniformly bounded expected regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies multiobjective multiarmed bandit problems in which objectives have a strict lexicographic ordering. The learner must pick arms to minimize a multidimensional regret that tracks how often it fails to select arms optimal under this ordering. When the learner is given the exact or near-exact expected rewards of the lexicographic optima for each objective, the authors show that simple algorithms keep total expected regret from growing with time. The same algorithms also deliver bounded regret for the related problem of satisficing objectives. In the absence of any such priors only sublinear regret is guaranteed.

Core claim

In the lexicographic multiarmed bandit problem, when the learner is provided with the lexicographic optimal expected reward for each objective, an algorithm exists that achieves expected regret uniformly bounded in time; the same holds when given near-optimal values, and the algorithm extends to bounded regret for satisficing objectives.

What carries the argument

The multidimensional regret that sums losses across the lexicographic order of objectives, together with selection rules that exploit the supplied prior values to avoid accumulating further loss.

If this is right

Expected regret stays bounded no matter how many times the learner plays.
The same algorithms attain bounded regret when objectives are satisficing rather than lexicographic.
Sublinear gap-free regret remains possible even when no prior information is supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bounded-regret result may allow stable long-horizon performance in sequential decisions that involve ranked criteria such as safety before efficiency.
Similar use of known target values could be tested in other ordered multiobjective sequential problems to see whether constant regret appears there as well.
Empirical checks on whether the boundedness survives modest misspecification of the supplied priors would clarify the practical margin of the result.

Load-bearing premise

The learner receives accurate prior information on the lexicographic optimal or near-optimal expected rewards for each objective.

What would settle it

If the observed cumulative regret grows without bound as the number of rounds increases, despite using the provided priors on optimal rewards.

Figures

Figures reproduced from arXiv: 1907.11605 by Alihan H\"uy\"uk, Cem Tekin.

**Figure 2.** Figure 2: Regret of PF-LEX 3 in Setting 1. report the average regret of the learner over 5 runs ( [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicographic optimal arms, and then, consider two settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective near-lexicographic optimal expected rewards. For both settings we prove that the learner achieves expected regret uniformly bounded in time. The algorithm we propose for the second setting also attains bounded regret for the multiarmed bandit with satisficing objectives. In addition, we also consider the harder prior-free case, and show that the learner can still achieve sublinear in time gap-free regret. Finally, we experimentally evaluate performance of the proposed algorithms in a variety of multiobjective learning problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a lex-ordered multidimensional regret and proves uniform bounded regret when the learner is handed the optimal expected rewards per objective, but the priors are strong and the prior-free case is standard sublinear.

read the letter

The main contribution is a regret notion that respects strict priority ordering across objectives instead of scalarizing or using Pareto fronts. With accurate prior knowledge of the lex-optimal means (first setting) or near-optimal means (second setting), the algorithms achieve expected regret that stays bounded over time rather than growing. The same algorithm for the near-optimal case also works for satisficing objectives. In the no-prior case they fall back to sublinear gap-free regret, which is unsurprising but still stated cleanly. The formulation is distinct from ordinary multiobjective work and the bounded-regret claims are conditioned exactly on those priors, so there is no internal contradiction. The main limitation is that the priors are quite informative; knowing the optimal expected reward for each objective already tells you a lot about which arms matter. Without that information the result reverts to the usual sublinear bound. Experiments are mentioned across several multiobjective problems, but the abstract gives no numbers or baselines, so it is hard to judge how large the practical gain is. The work is narrow but internally consistent. It is worth sending to referees who already work on structured bandits or applications with hard priority constraints; a generalist reader will find the prior requirement limits the scope. I would send it for review rather than desk-reject.

Referee Report

0 major / 2 minor

Summary. The manuscript addresses the lexicographic multi-armed bandit problem with lexicographically ordered objectives. It defines a multidimensional regret measuring deviation from lexicographic optimal arms and studies two settings with prior information: one where the learner knows the exact lexicographic optimal expected reward per objective, and one with near-optimal values. For both, it proves uniformly bounded expected regret. The algorithm for the near-optimal case also attains bounded regret for satisficing objectives. In the prior-free case, sublinear gap-free regret is shown. Experimental evaluations on multiobjective problems are included.

Significance. If the proofs are correct, the result is significant: uniformly bounded (constant) regret is a stronger guarantee than typical sublinear bounds in bandits and is enabled by the accurate prior on per-objective optimal expectations. The byproduct result for satisficing objectives broadens applicability. The prior-free gap-free result is standard but relevant, and experiments provide supporting evidence.

minor comments (2)

[Abstract] The abstract states that proofs exist for the bounded-regret claims but does not list key assumptions or provide sketches; the full paper should ensure these are explicitly stated and cross-referenced in the main body (e.g., near the theorem statements).
[Experiments] In the experimental section, specify the number of independent runs, error bars or variance measures, and any statistical tests used to support performance claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on lexicographic multi-armed bandits and the recommendation for minor revision. No specific major comments appear in the report, so we have no point-by-point responses to provide at this time. We remain available to address any minor suggestions or clarifications during the revision process.

Circularity Check

0 steps flagged

No significant circularity; results conditioned on explicit external priors

full rationale

The paper states two main regret results explicitly conditioned on the learner receiving accurate prior information consisting of the (near-)lexicographic optimal expected rewards per objective. The uniform boundedness claim is distinguished from the prior-free case (sublinear gap-free regret only). No equations, derivations, or self-citations are presented that reduce a claimed prediction or uniqueness result to a fitted input or prior work by the same authors. The derivation chain is therefore self-contained against the stated assumptions rather than internally circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the problem setup presupposes the existence of lexicographic optima but introduces no explicit fitted constants or new entities.

axioms (1)

domain assumption There exist arms whose expected rewards define lexicographic optimality for each objective.
Required for the regret to be well-defined and for the learner's goal to be attainable.

pith-pipeline@v0.9.0 · 5713 in / 1208 out tokens · 26663 ms · 2026-05-24T15:44:49.981965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Solving the apparent diversity-accuracy dilemma of recommender systems,

T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y .-C. Zhang, “Solving the apparent diversity-accuracy dilemma of recommender systems,” in Proc. National Academy of Sciences, vol. 107, no. 10, 2010, pp. 4511–4515

work page 2010
[2]

Lessons on applying automated recommender systems to information-seeking tasks,

J. A. Konstan, S. M. McNee, C.-N. Ziegler, R. Torres, N. Kapoor, and J. Riedl, “Lessons on applying automated recommender systems to information-seeking tasks,” in Proc. AAAI Conference on Artiﬁcial Intelligence, vol. 6, 2006, pp. 1630–1633

work page 2006
[3]

Lexicographically optimal routing for wireless sensor networks with multiple sinks,

V . Shah-Mansouri, A.-H. Mohsenian-Rad, and V . W. Wong, “Lexicographically optimal routing for wireless sensor networks with multiple sinks,” IEEE Transactions on Vehicular Technology, vol. 58, no. 3, pp. 1490–1500, 2009

work page 2009
[4]

Intensity-modulated radiation therapy for the treatment of anal cancer,

A. Lin and E. Ben-Josef, “Intensity-modulated radiation therapy for the treatment of anal cancer,” Clinical Colorectal Cancer, vol. 6, no. 10, pp. 716–719, 2007

work page 2007
[5]

Asymptotically efﬁcient adaptive allocation rules,

T. L. Lai and H. Robbins, “Asymptotically efﬁcient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985

work page 1985
[6]

Ehrgott, Multicriteria Optimization

M. Ehrgott, Multicriteria Optimization. Springer Science & Business Media, 2005, vol. 491

work page 2005
[7]

Lattimore and C

T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge University Press (preprint), 2019

work page 2019
[8]

Bounded regret in stochastic multi-armed bandits,

S. Bubeck, V . Perchet, and P. Rigollet, “Bounded regret in stochastic multi-armed bandits,” in Proc. Conference on Learning Theory, 2013, pp. 122–134

work page 2013
[9]

Achieving complete learning in multi-armed bandit problems,

S. Vakili and Q. Zhao, “Achieving complete learning in multi-armed bandit problems,” in Proc. 2013 Asilomar Conference on Signals, Systems and Computers, 2013, pp. 1778–1782

work page 2013
[10]

Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations,

Y . Gai, B. Krishnamachari, and R. Jain, “Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations,” IEEE/ACM Transactions on Networking, vol. 20, no. 5, pp. 1466–1478, 2012

work page 2012
[11]

Optimal sequential sampling from two populations

T. Lai and H. Robbins, “Optimal sequential sampling from two populations.” Proceedings of the National Academy of Sciences of the United States of America, vol. 81, no. 4, pp. 1284–1286, 1984

work page 1984
[12]

Explore ﬁrst, exploit next: The true shape of regret in bandit problems,

A. Garivier, P. Ménard, and G. Stoltz, “Explore ﬁrst, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, 2018

work page 2018
[13]

Prior-free and prior-dependent regret bounds for Thompson sampling,

S. Bubeck and C.-Y . Liu, “Prior-free and prior-dependent regret bounds for Thompson sampling,” in Proc. Advances in Neural Information Processing Systems, 2013, pp. 638–646

work page 2013
[14]

A structured multiarmed bandit problem and the greedy policy,

A. J. Mersereau, P. Rusmevichientong, and J. N. Tsitsiklis, “A structured multiarmed bandit problem and the greedy policy,” IEEE Transactions on Automatic Control, vol. 54, no. 12, pp. 2787–2802, 2009

work page 2009
[15]

Bounded regret for ﬁnite-armed structured bandits,

T. Lattimore and R. Munos, “Bounded regret for ﬁnite-armed structured bandits,” in Proc. Advances in Neural Information Processing Systems, 2014, pp. 550–558

work page 2014
[16]

Designing multi-objective multi-armed bandits algorithms: A study,

M. M. Drugan and A. Nowe, “Designing multi-objective multi-armed bandits algorithms: A study,” in Proc. 2013 International Joint Conference on Neural Networks, 2013, pp. 1–8

work page 2013
[17]

Multi-objective contextual bandit problem with similarity information,

E. Turgay, D. Oner, and C. Tekin, “Multi-objective contextual bandit problem with similarity information,” in Proc. 21st International Conference on Artiﬁcial Intelligence and Statistics, 2018, pp. 1673–1681

work page 2018
[18]

Contextual bandits with similarity information,

A. Slivkins, “Contextual bandits with similarity information,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2533–2568, 2014

work page 2014
[19]

Multi-objective contextual multi-armed bandit with a dominant objective,

C. Tekin and E. Turgay, “Multi-objective contextual multi-armed bandit with a dominant objective,” IEEE Transactions on Signal Processing, vol. 66, no. 14, pp. 3799–3813, 2018. 9

work page 2018
[20]

An optimal algorithm for the thresholding bandit problem,

A. Locatelli, M. Gutzeit, and A. Carpentier, “An optimal algorithm for the thresholding bandit problem,” in Proc. 33rd International Conference on Machine Learning, 2016, pp. 1690–1698

work page 2016
[21]

Satisﬁcing in multi-armed bandit problems,

P. Reverdy, V . Srivastava, and N. E. Leonard, “Satisﬁcing in multi-armed bandit problems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3788–3803, 2017

work page 2017
[22]

Fairness in learning: Classic and contextual bandits,

M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learning: Classic and contextual bandits,” in Proc. Advances in Neural Information Processing Systems, 2016, pp. 325–333

work page 2016
[23]

Improved algorithms for linear stochastic bandits,

Y . Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Proc. Advances in Neural Information Processing Systems, 2011, pp. 2312–2320

work page 2011
[24]

Active learning in heteroscedastic noise,

A. Antos, V . Grover, and C. Szepesvári, “Active learning in heteroscedastic noise,”Theoretical Computer Science, vol. 411, no. 29-30, pp. 2712–2728, 2010. 10 A Appendix A.1 Additional Theorems Theorem 4. When OM-LEX is run for Case 1, ∀i ∈ D and ∀T ≥ 1, we have E[Regi pf(T )] ≤ ∑ a∈S i ((π2 3 D + 1 ) ∆i a + 36 ∇maxa log 17 ∇maxa ) . Proof. Note that E[Re...

work page 2010

[1] [1]

Solving the apparent diversity-accuracy dilemma of recommender systems,

T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y .-C. Zhang, “Solving the apparent diversity-accuracy dilemma of recommender systems,” in Proc. National Academy of Sciences, vol. 107, no. 10, 2010, pp. 4511–4515

work page 2010

[2] [2]

Lessons on applying automated recommender systems to information-seeking tasks,

J. A. Konstan, S. M. McNee, C.-N. Ziegler, R. Torres, N. Kapoor, and J. Riedl, “Lessons on applying automated recommender systems to information-seeking tasks,” in Proc. AAAI Conference on Artiﬁcial Intelligence, vol. 6, 2006, pp. 1630–1633

work page 2006

[3] [3]

Lexicographically optimal routing for wireless sensor networks with multiple sinks,

V . Shah-Mansouri, A.-H. Mohsenian-Rad, and V . W. Wong, “Lexicographically optimal routing for wireless sensor networks with multiple sinks,” IEEE Transactions on Vehicular Technology, vol. 58, no. 3, pp. 1490–1500, 2009

work page 2009

[4] [4]

Intensity-modulated radiation therapy for the treatment of anal cancer,

A. Lin and E. Ben-Josef, “Intensity-modulated radiation therapy for the treatment of anal cancer,” Clinical Colorectal Cancer, vol. 6, no. 10, pp. 716–719, 2007

work page 2007

[5] [5]

Asymptotically efﬁcient adaptive allocation rules,

T. L. Lai and H. Robbins, “Asymptotically efﬁcient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985

work page 1985

[6] [6]

Ehrgott, Multicriteria Optimization

M. Ehrgott, Multicriteria Optimization. Springer Science & Business Media, 2005, vol. 491

work page 2005

[7] [7]

Lattimore and C

T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge University Press (preprint), 2019

work page 2019

[8] [8]

Bounded regret in stochastic multi-armed bandits,

S. Bubeck, V . Perchet, and P. Rigollet, “Bounded regret in stochastic multi-armed bandits,” in Proc. Conference on Learning Theory, 2013, pp. 122–134

work page 2013

[9] [9]

Achieving complete learning in multi-armed bandit problems,

S. Vakili and Q. Zhao, “Achieving complete learning in multi-armed bandit problems,” in Proc. 2013 Asilomar Conference on Signals, Systems and Computers, 2013, pp. 1778–1782

work page 2013

[10] [10]

Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations,

Y . Gai, B. Krishnamachari, and R. Jain, “Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations,” IEEE/ACM Transactions on Networking, vol. 20, no. 5, pp. 1466–1478, 2012

work page 2012

[11] [11]

Optimal sequential sampling from two populations

T. Lai and H. Robbins, “Optimal sequential sampling from two populations.” Proceedings of the National Academy of Sciences of the United States of America, vol. 81, no. 4, pp. 1284–1286, 1984

work page 1984

[12] [12]

Explore ﬁrst, exploit next: The true shape of regret in bandit problems,

A. Garivier, P. Ménard, and G. Stoltz, “Explore ﬁrst, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, 2018

work page 2018

[13] [13]

Prior-free and prior-dependent regret bounds for Thompson sampling,

S. Bubeck and C.-Y . Liu, “Prior-free and prior-dependent regret bounds for Thompson sampling,” in Proc. Advances in Neural Information Processing Systems, 2013, pp. 638–646

work page 2013

[14] [14]

A structured multiarmed bandit problem and the greedy policy,

A. J. Mersereau, P. Rusmevichientong, and J. N. Tsitsiklis, “A structured multiarmed bandit problem and the greedy policy,” IEEE Transactions on Automatic Control, vol. 54, no. 12, pp. 2787–2802, 2009

work page 2009

[15] [15]

Bounded regret for ﬁnite-armed structured bandits,

T. Lattimore and R. Munos, “Bounded regret for ﬁnite-armed structured bandits,” in Proc. Advances in Neural Information Processing Systems, 2014, pp. 550–558

work page 2014

[16] [16]

Designing multi-objective multi-armed bandits algorithms: A study,

M. M. Drugan and A. Nowe, “Designing multi-objective multi-armed bandits algorithms: A study,” in Proc. 2013 International Joint Conference on Neural Networks, 2013, pp. 1–8

work page 2013

[17] [17]

Multi-objective contextual bandit problem with similarity information,

E. Turgay, D. Oner, and C. Tekin, “Multi-objective contextual bandit problem with similarity information,” in Proc. 21st International Conference on Artiﬁcial Intelligence and Statistics, 2018, pp. 1673–1681

work page 2018

[18] [18]

Contextual bandits with similarity information,

A. Slivkins, “Contextual bandits with similarity information,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2533–2568, 2014

work page 2014

[19] [19]

Multi-objective contextual multi-armed bandit with a dominant objective,

C. Tekin and E. Turgay, “Multi-objective contextual multi-armed bandit with a dominant objective,” IEEE Transactions on Signal Processing, vol. 66, no. 14, pp. 3799–3813, 2018. 9

work page 2018

[20] [20]

An optimal algorithm for the thresholding bandit problem,

A. Locatelli, M. Gutzeit, and A. Carpentier, “An optimal algorithm for the thresholding bandit problem,” in Proc. 33rd International Conference on Machine Learning, 2016, pp. 1690–1698

work page 2016

[21] [21]

Satisﬁcing in multi-armed bandit problems,

P. Reverdy, V . Srivastava, and N. E. Leonard, “Satisﬁcing in multi-armed bandit problems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3788–3803, 2017

work page 2017

[22] [22]

Fairness in learning: Classic and contextual bandits,

M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learning: Classic and contextual bandits,” in Proc. Advances in Neural Information Processing Systems, 2016, pp. 325–333

work page 2016

[23] [23]

Improved algorithms for linear stochastic bandits,

Y . Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Proc. Advances in Neural Information Processing Systems, 2011, pp. 2312–2320

work page 2011

[24] [24]

Active learning in heteroscedastic noise,

A. Antos, V . Grover, and C. Szepesvári, “Active learning in heteroscedastic noise,”Theoretical Computer Science, vol. 411, no. 29-30, pp. 2712–2728, 2010. 10 A Appendix A.1 Additional Theorems Theorem 4. When OM-LEX is run for Case 1, ∀i ∈ D and ∀T ≥ 1, we have E[Regi pf(T )] ≤ ∑ a∈S i ((π2 3 D + 1 ) ∆i a + 36 ∇maxa log 17 ∇maxa ) . Proof. Note that E[Re...

work page 2010