Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret
Pith reviewed 2026-06-27 17:01 UTC · model grok-4.3
The pith
Contextual queueing bandits achieve queue length regret of order T to the minus one half by limiting random exploration to a cutoff round.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a three-phase algorithm called CQB-η-2 attains ilde O(T^{-1/2}) queue length regret. Phase one collects an initial estimator by pure random exploration. Phase two continues with η-random exploration plus a UCB rule so that the queue maintains negative drift. Phase three switches to pure UCB once the cutoff is reached. The analysis decomposes the regret at the cutoff: before it, negative drift suppresses the effect of suboptimal choices on queue length; after it, the accumulated samples ensure that UCB incurs only small departure-rate gaps. A matching minimax lower bound of Ω(T^{-1/2}) is shown by constructing two statistically indistinguishable hard instances and c
What carries the argument
The cutoff round that ends the mixed exploration phase, after which pure UCB is used; negative drift before the cutoff and sufficient random samples after it together bound the queue length regret.
If this is right
- The minimax rate for queue length regret in contextual queueing bandits is ilde Theta(T^{-1/2}).
- Negative drift during the second phase keeps early suboptimal decisions from inflating queue lengths.
- After the cutoff the collected samples make every UCB choice incur only a vanishing gap in expected departure rate.
- The lower-bound construction uses a queue-specific coupling to turn statistical indistinguishability directly into linear queue-length cost.
Where Pith is reading between the lines
- The cutoff idea could be adapted to other online scheduling problems where stability must be maintained while learning unknown rates.
- An adaptive rule for picking the cutoff without knowing problem parameters might be tested in simulations to check practical robustness.
- The coupling technique for turning testing error into queue regret may apply to other performance measures that accumulate over time in stochastic systems.
Load-bearing premise
A carefully chosen cutoff round exists such that samples collected before it are enough for the later UCB phase to produce only small service-rate gaps while negative drift before the cutoff keeps queue length differences small.
What would settle it
A pair of service-rate instances that remain hard to distinguish until the final service decision and produce queue length regret larger than order T to the one half under any algorithm.
Figures
read the original abstract
Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$\eta$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $\eta$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $\Omega(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CQB-η-2, a three-phase algorithm for contextual queueing bandits under stochastic contexts: (i) pure random exploration, (ii) η-random exploration combined with UCB to maintain negative drift, and (iii) pure UCB after a cutoff round τ. It claims this yields Õ(T^{-1/2}) queue-length regret (improving prior Õ(T^{-1/4})), by decomposing regret at τ (negative drift suppresses differences pre-τ; sufficient samples ensure small UCB gaps post-τ). A matching minimax lower bound Ω(T^{-1/2}) is proved via two hard instances and a queue-specific coupling argument.
Significance. If the claims hold, the work tightly characterizes the minimax rate (up to logs) for queue-length regret in this model, which is a notable advance over existing Õ(T^{-1/4}) results. The cutoff-based limitation of random exploration is a conceptually clean idea that could apply more broadly; the matching upper/lower bounds are a strength.
major comments (2)
- [Abstract (algorithm description and proof outline)] Abstract (algorithm and proof sketch): the upper-bound argument is load-bearing on the existence of a cutoff τ such that (1) pre-τ negative drift from service rates dominates any queue growth induced by suboptimal η-random + UCB choices and (2) the random-exploration samples collected by τ suffice for the post-τ UCB phase to incur only o(1) per-step departure-rate gaps. The description gives no explicit, parameter-free rule for selecting τ (e.g., without knowledge of the minimal gap Δ or context distribution); if τ must be tuned to instance parameters the algorithm is not fully online, and a fixed schedule such as T^{2/3} may fail to suppress early queue excursions on some instances. This directly affects whether the claimed Õ(T^{-1/2}) rate is achieved by a practical online procedure.
- [Lower-bound proof] Lower-bound section (hard-instance construction): the Ω(T^{-1/2}) claim rests on constructing two statistically indistinguishable instances up to the final service decision and converting the testing error into queue-length regret via a queue-specific coupling. Without the explicit instance definitions, the precise form of the coupling, and verification that the resulting regret is indeed Ω(T^{-1/2}) rather than a weaker rate, it is difficult to confirm the lower bound is tight and free of hidden assumptions on the context or service-rate distributions.
minor comments (2)
- [Abstract] Notation for the cutoff round and the parameter η should be introduced with explicit dependence on T (or lack thereof) already in the abstract or first algorithm section to avoid ambiguity.
- [Abstract] The abstract states the regret is defined as the expected difference between learner and oracle queue lengths at horizon T; a brief reminder of the precise mathematical definition (e.g., E[|Q_T - Q_T^*|] or similar) would help readers.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to improve clarity on the algorithm and lower-bound details.
read point-by-point responses
-
Referee: Abstract (algorithm description and proof outline): the upper-bound argument is load-bearing on the existence of a cutoff τ such that (1) pre-τ negative drift from service rates dominates any queue growth induced by suboptimal η-random + UCB choices and (2) the random-exploration samples collected by τ suffice for the post-τ UCB phase to incur only o(1) per-step departure-rate gaps. The description gives no explicit, parameter-free rule for selecting τ (e.g., without knowledge of the minimal gap Δ or context distribution); if τ must be tuned to instance parameters the algorithm is not fully online, and a fixed schedule such as T^{2/3} may fail to suppress early queue excursions on some instances. This directly affects whether the claimed Õ(T^{-1/2}) rate is achieved by a practical online procedure.
Authors: The full manuscript defines τ explicitly as a function of T only (specifically τ = ⌈T^{2/3}⌉, with constants chosen to satisfy the drift and sample-size conditions uniformly over instances). This choice requires no knowledge of Δ or the context distribution and is fully online. The proof verifies that negative drift dominates pre-τ and that the collected samples suffice post-τ for the claimed rate; the Õ notation absorbs the constants that work for all instances. We will add this explicit rule and a brief justification to the abstract and algorithm section. revision: yes
-
Referee: Lower-bound section (hard-instance construction): the Ω(T^{-1/2}) claim rests on constructing two statistically indistinguishable instances up to the final service decision and converting the testing error into queue-length regret via a queue-specific coupling. Without the explicit instance definitions, the precise form of the coupling, and verification that the resulting regret is indeed Ω(T^{-1/2}) rather than a weaker rate, it is difficult to confirm the lower bound is tight and free of hidden assumptions on the context or service-rate distributions.
Authors: The full paper constructs two explicit hard instances (Bernoulli service rates differing by a small Δ, with contexts drawn from a fixed distribution that makes the instances hard to distinguish until late in the horizon) and applies a queue-specific coupling that translates the probability of misidentifying the better arm into an Ω(T^{-1/2}) additive queue-length difference. We will expand the lower-bound section with the precise instance parameters, the coupling construction, and the calculation showing the Ω(T^{-1/2}) rate (rather than a weaker bound). revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The upper-bound proof decomposes regret at an explicitly constructed cutoff round using negative-drift suppression before the cutoff and sample-sufficiency for UCB gaps after it; these arguments rely on the algorithm phases and drift properties rather than any fitted parameter or self-referential definition. The lower bound is obtained via an independent hard-instance construction with queue-specific coupling. No equations or steps reduce by construction to inputs, and no self-citations are load-bearing.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Seoungbin Bae and Dabeen Lee. Neural logistic bandits.arXiv preprint arXiv:2505.02069,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions
Seoungbin Bae and Dabeen Lee. Logistic Bandits with˜O( √ dT)Regret without Context Diversity Assumptions.arXiv preprint arXiv:2604.22161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Queue Length Regret Bounds for Contextual Queueing Bandits
Seoungbin Bae, Garyeong Kang, and Dabeen Lee. Queue length regret bounds for contextual queueing bandits.arXiv preprint arXiv:2601.19300, 2026a. Seoungbin Bae, Junyoung Son, and Dabeen Lee. Learning to route and schedule llms from user retrials via contextual queueing bandits.arXiv preprint arXiv:2602.02061, 2026b. Agrim Bari, Parikshit Hegde, and Gustavo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,
Chao-Kai Chiang, Takashi Ishida, and Masashi Sugiyama. Llm routing with dueling feedback.arXiv preprint arXiv:2510.00841,
-
[5]
Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a
Jiung Lee, Hongseok Namkoong, and Yibo Zeng. Design and scheduling of an ai-based queueing system.arXiv preprint arXiv:2406.06855, 2024a. Joongkyu Lee and Min-hwan Oh. Improved online confidence bounds for multinomial logistic bandits.arXiv preprint arXiv:2502.10020,
-
[6]
Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits.arXiv preprint arXiv:2407.13977, 2024b. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. InInternational Conference on Machine Learning, pages 2071–2080. PMLR,
-
[7]
Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,
Yashaswini Murthy, Isaac Grosof, Siva Theja Maguluri, and R Srikant. Performance of npg in countable state-space average-cost rl.arXiv preprint arXiv:2405.20467,
-
[8]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2508.12491 , url=
Reza Shirkavand, Shangqian Gao, Peiran Yu, and Heng Huang. Cost-aware contrastive routing for llms.arXiv preprint arXiv:2508.12491,
-
[10]
Mixllm: Dynamic routing in mixed large language models
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. Mixllm: Dynamic routing in mixed large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10912–10922,
2025
-
[11]
Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,
Mevan Wijewardena, Kamiar Asgari, and Michael J Neely. Bandit-based rate adaptation for a single-server queue.arXiv preprint arXiv:2512.12016,
-
[12]
Zixian Yang, R Srikant, and Lei Ying
URLhttps://arxiv.org/abs/2407.05347. Zixian Yang, R Srikant, and Lei Ying. Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb. InInternational Conference on Artificial Intelligence and Statistics, pages 4275–4312. PMLR,
-
[13]
Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, and Masashi Sugiyama. Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,
-
[14]
arXiv preprint arXiv:2410.02223 , year=
Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchan- dran. Embedllm: Learning compact representations of large language models.arXiv preprint arXiv:2410.02223,
-
[15]
14 A Related work Queueing and contextual queueing bandits.Queueingbanditsstudylearning-while-scheduling problems in which unknown service rates must be learned while queue lengths are controlled (Kr- ishnasamy et al., 2016, 2021). This line of work includes queue length regret, dispatching under unknown service rates, decentralized queueing systems, MaxW...
2016
-
[16]
D Auxiliary lemmas This section collects auxiliary results used in the appendix
applied toP F +,P F −, andG, P+(Gc|F) +P−(G|F) =PF + (Gc) +P F −(G)≥1 2e−1/6. D Auxiliary lemmas This section collects auxiliary results used in the appendix. The following prediction-error bound is Lemma 28 of Bae and Lee (2026), stated in the notation of this paper. Lemma 18.With the confidence radius defined by (3), it holds with probability at least1−...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.