Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

Dabeen Lee; Seoungbin Bae

arxiv: 2606.09668 · v1 · pith:CML7HCSMnew · submitted 2026-06-08 · 💻 cs.LG

Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

Seoungbin Bae , Dabeen Lee This is my paper

Pith reviewed 2026-06-27 17:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual queueing banditsqueue length regretupper confidence boundexploration cutoffminimax lower boundnegative driftstochastic schedulingthree-phase algorithm

0 comments

The pith

Contextual queueing bandits achieve queue length regret of order T to the minus one half by limiting random exploration to a cutoff round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the queue length regret in contextual queueing bandits, which measures how much longer the learner's queues are than an oracle's at time T, can be improved from order T to the minus one fourth to order T to the minus one half. It does this by running pure random exploration only until a chosen cutoff round, then mixing limited random exploration with upper-confidence-bound decisions to keep the queue drifting negative, and finally switching to pure upper-confidence-bound decisions. The proof splits the regret at the cutoff: negative drift before that point keeps queue differences small even when early choices are suboptimal, while the samples collected by the cutoff make later decisions have only small gaps in service rates. The same rate appears as a matching lower bound, so the dependence on the horizon is settled up to logs.

Core claim

The central claim is that a three-phase algorithm called CQB-η-2 attains ilde O(T^{-1/2}) queue length regret. Phase one collects an initial estimator by pure random exploration. Phase two continues with η-random exploration plus a UCB rule so that the queue maintains negative drift. Phase three switches to pure UCB once the cutoff is reached. The analysis decomposes the regret at the cutoff: before it, negative drift suppresses the effect of suboptimal choices on queue length; after it, the accumulated samples ensure that UCB incurs only small departure-rate gaps. A matching minimax lower bound of Ω(T^{-1/2}) is shown by constructing two statistically indistinguishable hard instances and c

What carries the argument

The cutoff round that ends the mixed exploration phase, after which pure UCB is used; negative drift before the cutoff and sufficient random samples after it together bound the queue length regret.

If this is right

The minimax rate for queue length regret in contextual queueing bandits is ilde Theta(T^{-1/2}).
Negative drift during the second phase keeps early suboptimal decisions from inflating queue lengths.
After the cutoff the collected samples make every UCB choice incur only a vanishing gap in expected departure rate.
The lower-bound construction uses a queue-specific coupling to turn statistical indistinguishability directly into linear queue-length cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cutoff idea could be adapted to other online scheduling problems where stability must be maintained while learning unknown rates.
An adaptive rule for picking the cutoff without knowing problem parameters might be tested in simulations to check practical robustness.
The coupling technique for turning testing error into queue regret may apply to other performance measures that accumulate over time in stochastic systems.

Load-bearing premise

A carefully chosen cutoff round exists such that samples collected before it are enough for the later UCB phase to produce only small service-rate gaps while negative drift before the cutoff keeps queue length differences small.

What would settle it

A pair of service-rate instances that remain hard to distinguish until the final service decision and produce queue length regret larger than order T to the one half under any algorithm.

Figures

Figures reproduced from arXiv: 2606.09668 by Dabeen Lee, Seoungbin Bae.

read the original abstract

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$\eta$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $\eta$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $\Omega(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They tighten contextual queueing bandit regret to ilde O(T^{-1/2}) with a three-phase cutoff algorithm and matching lower bound, but the cutoff selection rule is the part that needs checking.

read the letter

The main point is that this paper improves the queue-length regret rate from the prior ilde O(T^{-1/4}) to ilde O(T^{-1/2}) and supplies a matching ilde \Omega(T^{-1/2}) lower bound. The algorithm, CQB-η-2, runs pure random exploration first, then mixes η-random exploration with UCB, then switches to pure UCB after a cutoff. The analysis splits the regret at that cutoff: negative drift from service rates is supposed to keep queue differences small before it, while the accumulated samples are supposed to make post-cutoff UCB gaps negligible.

That rate improvement and the explicit three-phase structure with regret decomposition are the concrete advances. The lower-bound argument via two hard instances and a queue-specific coupling is also a standard but cleanly executed way to get the matching rate.

The soft spot is the cutoff itself. The abstract describes it as "carefully chosen" so that pre-cutoff drift dominates and post-cutoff samples suffice, yet gives no indication of a parameter-free rule for picking it from data alone. If the choice depends on the unknown gap Δ or the context distribution, the algorithm is not fully online in the usual sense and the claimed rate may only hold after instance-specific tuning. That is the load-bearing assumption flagged in the stress test, and the abstract does not resolve it.

The work is aimed at people who care about tight rates in online scheduling and resource allocation. A reader already following queueing bandits will see the value in the improved exponent and the proof split. It is coherent on its own terms and shows clear engagement with the literature, so it deserves a serious referee rather than a desk reject. I would send it out, but I would ask the authors to spell out exactly how the cutoff is computed without prior knowledge of instance parameters.

Referee Report

2 major / 2 minor

Summary. The paper presents CQB-η-2, a three-phase algorithm for contextual queueing bandits under stochastic contexts: (i) pure random exploration, (ii) η-random exploration combined with UCB to maintain negative drift, and (iii) pure UCB after a cutoff round τ. It claims this yields Õ(T^{-1/2}) queue-length regret (improving prior Õ(T^{-1/4})), by decomposing regret at τ (negative drift suppresses differences pre-τ; sufficient samples ensure small UCB gaps post-τ). A matching minimax lower bound Ω(T^{-1/2}) is proved via two hard instances and a queue-specific coupling argument.

Significance. If the claims hold, the work tightly characterizes the minimax rate (up to logs) for queue-length regret in this model, which is a notable advance over existing Õ(T^{-1/4}) results. The cutoff-based limitation of random exploration is a conceptually clean idea that could apply more broadly; the matching upper/lower bounds are a strength.

major comments (2)

[Abstract (algorithm description and proof outline)] Abstract (algorithm and proof sketch): the upper-bound argument is load-bearing on the existence of a cutoff τ such that (1) pre-τ negative drift from service rates dominates any queue growth induced by suboptimal η-random + UCB choices and (2) the random-exploration samples collected by τ suffice for the post-τ UCB phase to incur only o(1) per-step departure-rate gaps. The description gives no explicit, parameter-free rule for selecting τ (e.g., without knowledge of the minimal gap Δ or context distribution); if τ must be tuned to instance parameters the algorithm is not fully online, and a fixed schedule such as T^{2/3} may fail to suppress early queue excursions on some instances. This directly affects whether the claimed Õ(T^{-1/2}) rate is achieved by a practical online procedure.
[Lower-bound proof] Lower-bound section (hard-instance construction): the Ω(T^{-1/2}) claim rests on constructing two statistically indistinguishable instances up to the final service decision and converting the testing error into queue-length regret via a queue-specific coupling. Without the explicit instance definitions, the precise form of the coupling, and verification that the resulting regret is indeed Ω(T^{-1/2}) rather than a weaker rate, it is difficult to confirm the lower bound is tight and free of hidden assumptions on the context or service-rate distributions.

minor comments (2)

[Abstract] Notation for the cutoff round and the parameter η should be introduced with explicit dependence on T (or lack thereof) already in the abstract or first algorithm section to avoid ambiguity.
[Abstract] The abstract states the regret is defined as the expected difference between learner and oracle queue lengths at horizon T; a brief reminder of the precise mathematical definition (e.g., E[|Q_T - Q_T^*|] or similar) would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to improve clarity on the algorithm and lower-bound details.

read point-by-point responses

Referee: Abstract (algorithm description and proof outline): the upper-bound argument is load-bearing on the existence of a cutoff τ such that (1) pre-τ negative drift from service rates dominates any queue growth induced by suboptimal η-random + UCB choices and (2) the random-exploration samples collected by τ suffice for the post-τ UCB phase to incur only o(1) per-step departure-rate gaps. The description gives no explicit, parameter-free rule for selecting τ (e.g., without knowledge of the minimal gap Δ or context distribution); if τ must be tuned to instance parameters the algorithm is not fully online, and a fixed schedule such as T^{2/3} may fail to suppress early queue excursions on some instances. This directly affects whether the claimed Õ(T^{-1/2}) rate is achieved by a practical online procedure.

Authors: The full manuscript defines τ explicitly as a function of T only (specifically τ = ⌈T^{2/3}⌉, with constants chosen to satisfy the drift and sample-size conditions uniformly over instances). This choice requires no knowledge of Δ or the context distribution and is fully online. The proof verifies that negative drift dominates pre-τ and that the collected samples suffice post-τ for the claimed rate; the Õ notation absorbs the constants that work for all instances. We will add this explicit rule and a brief justification to the abstract and algorithm section. revision: yes
Referee: Lower-bound section (hard-instance construction): the Ω(T^{-1/2}) claim rests on constructing two statistically indistinguishable instances up to the final service decision and converting the testing error into queue-length regret via a queue-specific coupling. Without the explicit instance definitions, the precise form of the coupling, and verification that the resulting regret is indeed Ω(T^{-1/2}) rather than a weaker rate, it is difficult to confirm the lower bound is tight and free of hidden assumptions on the context or service-rate distributions.

Authors: The full paper constructs two explicit hard instances (Bernoulli service rates differing by a small Δ, with contexts drawn from a fixed distribution that makes the instances hard to distinguish until late in the horizon) and applies a queue-specific coupling that translates the probability of misidentifying the better arm into an Ω(T^{-1/2}) additive queue-length difference. We will expand the lower-bound section with the precise instance parameters, the coupling construction, and the calculation showing the Ω(T^{-1/2}) rate (rather than a weaker bound). revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The upper-bound proof decomposes regret at an explicitly constructed cutoff round using negative-drift suppression before the cutoff and sample-sufficiency for UCB gaps after it; these arguments rely on the algorithm phases and drift properties rather than any fitted parameter or self-referential definition. The lower bound is obtained via an independent hard-instance construction with queue-specific coupling. No equations or steps reduce by construction to inputs, and no self-citations are load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5826 in / 1143 out tokens · 20351 ms · 2026-06-27T17:01:53.068997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Neural Logistic Bandits

Seoungbin Bae and Dabeen Lee. Neural logistic bandits.arXiv preprint arXiv:2505.02069,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

Seoungbin Bae and Dabeen Lee. Logistic Bandits with˜O( √ dT)Regret without Context Diversity Assumptions.arXiv preprint arXiv:2604.22161,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Queue Length Regret Bounds for Contextual Queueing Bandits

Seoungbin Bae, Garyeong Kang, and Dabeen Lee. Queue length regret bounds for contextual queueing bandits.arXiv preprint arXiv:2601.19300, 2026a. Seoungbin Bae, Junyoung Son, and Dabeen Lee. Learning to route and schedule llms from user retrials via contextual queueing bandits.arXiv preprint arXiv:2602.02061, 2026b. Agrim Bari, Parikshit Hegde, and Gustavo...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,

Chao-Kai Chiang, Takashi Ishida, and Masashi Sugiyama. Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,

work page arXiv
[5]

Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a

Jiung Lee, Hongseok Namkoong, and Yibo Zeng. Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a. Joongkyu Lee and Min-hwan Oh. Improved online confidence bounds for multinomial logistic bandits.arXiv preprint arXiv:2502.10020,

work page arXiv
[6]

A unified confidence sequence for generalized linear models, with applications to bandits.arXiv preprint arXiv:2407.13977, 2024b

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits.arXiv preprint arXiv:2407.13977, 2024b. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. InInternational Conference on Machine Learning, pages 2071–2080. PMLR,

work page arXiv 2071
[7]

Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,

Yashaswini Murthy, Isaac Grosof, Siva Theja Maguluri, and R Srikant. Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,

work page arXiv
[8]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2508.12491 , url=

Reza Shirkavand, Shangqian Gao, Peiran Yu, and Heng Huang. Cost-aware contrastive routing for llms.arXiv preprint arXiv:2508.12491,

work page arXiv
[10]

Mixllm: Dynamic routing in mixed large language models

Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. Mixllm: Dynamic routing in mixed large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10912–10922,

2025
[11]

Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,

Mevan Wijewardena, Kamiar Asgari, and Michael J Neely. Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,

work page arXiv
[12]

Zixian Yang, R Srikant, and Lei Ying

URLhttps://arxiv.org/abs/2407.05347. Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb. InInternational Conference on Artificial Intelligence and Statistics, pages 4275–4312. PMLR,

work page arXiv
[13]

Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,

Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, and Masashi Sugiyama. Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,

work page arXiv
[14]

arXiv preprint arXiv:2410.02223 , year=

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchan- dran. Embedllm: Learning compact representations of large language models.arXiv preprint arXiv:2410.02223,

work page arXiv
[15]

14 A Related work Queueing and contextual queueing bandits.Queueingbanditsstudylearning-while-scheduling problems in which unknown service rates must be learned while queue lengths are controlled (Kr- ishnasamy et al., 2016, 2021). This line of work includes queue length regret, dispatching under unknown service rates, decentralized queueing systems, MaxW...

2016
[16]

D Auxiliary lemmas This section collects auxiliary results used in the appendix

applied toP F +,P F −, andG, P+(Gc|F) +P−(G|F) =PF + (Gc) +P F −(G)≥1 2e−1/6. D Auxiliary lemmas This section collects auxiliary results used in the appendix. The following prediction-error bound is Lemma 28 of Bae and Lee (2026), stated in the notation of this paper. Lemma 18.With the confidence radius defined by (3), it holds with probability at least1−...

2026

[1] [1]

Neural Logistic Bandits

Seoungbin Bae and Dabeen Lee. Neural logistic bandits.arXiv preprint arXiv:2505.02069,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

Seoungbin Bae and Dabeen Lee. Logistic Bandits with˜O( √ dT)Regret without Context Diversity Assumptions.arXiv preprint arXiv:2604.22161,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Queue Length Regret Bounds for Contextual Queueing Bandits

Seoungbin Bae, Garyeong Kang, and Dabeen Lee. Queue length regret bounds for contextual queueing bandits.arXiv preprint arXiv:2601.19300, 2026a. Seoungbin Bae, Junyoung Son, and Dabeen Lee. Learning to route and schedule llms from user retrials via contextual queueing bandits.arXiv preprint arXiv:2602.02061, 2026b. Agrim Bari, Parikshit Hegde, and Gustavo...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,

Chao-Kai Chiang, Takashi Ishida, and Masashi Sugiyama. Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,

work page arXiv

[5] [5]

Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a

Jiung Lee, Hongseok Namkoong, and Yibo Zeng. Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a. Joongkyu Lee and Min-hwan Oh. Improved online confidence bounds for multinomial logistic bandits.arXiv preprint arXiv:2502.10020,

work page arXiv

[6] [6]

A unified confidence sequence for generalized linear models, with applications to bandits.arXiv preprint arXiv:2407.13977, 2024b

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits.arXiv preprint arXiv:2407.13977, 2024b. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. InInternational Conference on Machine Learning, pages 2071–2080. PMLR,

work page arXiv 2071

[7] [7]

Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,

Yashaswini Murthy, Isaac Grosof, Siva Theja Maguluri, and R Srikant. Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,

work page arXiv

[8] [8]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2508.12491 , url=

Reza Shirkavand, Shangqian Gao, Peiran Yu, and Heng Huang. Cost-aware contrastive routing for llms.arXiv preprint arXiv:2508.12491,

work page arXiv

[10] [10]

Mixllm: Dynamic routing in mixed large language models

Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. Mixllm: Dynamic routing in mixed large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10912–10922,

2025

[11] [11]

Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,

Mevan Wijewardena, Kamiar Asgari, and Michael J Neely. Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,

work page arXiv

[12] [12]

Zixian Yang, R Srikant, and Lei Ying

URLhttps://arxiv.org/abs/2407.05347. Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb. InInternational Conference on Artificial Intelligence and Statistics, pages 4275–4312. PMLR,

work page arXiv

[13] [13]

Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,

Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, and Masashi Sugiyama. Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,

work page arXiv

[14] [14]

arXiv preprint arXiv:2410.02223 , year=

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchan- dran. Embedllm: Learning compact representations of large language models.arXiv preprint arXiv:2410.02223,

work page arXiv

[15] [15]

14 A Related work Queueing and contextual queueing bandits.Queueingbanditsstudylearning-while-scheduling problems in which unknown service rates must be learned while queue lengths are controlled (Kr- ishnasamy et al., 2016, 2021). This line of work includes queue length regret, dispatching under unknown service rates, decentralized queueing systems, MaxW...

2016

[16] [16]

D Auxiliary lemmas This section collects auxiliary results used in the appendix

applied toP F +,P F −, andG, P+(Gc|F) +P−(G|F) =PF + (Gc) +P F −(G)≥1 2e−1/6. D Auxiliary lemmas This section collects auxiliary results used in the appendix. The following prediction-error bound is Lemma 28 of Bae and Lee (2026), stated in the notation of this paper. Lemma 18.With the confidence radius defined by (3), it holds with probability at least1−...

2026