Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Christopher G. Brinton; Md Kamran Chowdhury Shisher; Mung Chiang; Vishrant Tripathi

arxiv: 2506.18186 · v3 · submitted 2025-06-22 · 💻 cs.LG · stat.ML

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher , Vishrant Tripathi , Mung Chiang , Christopher G. Brinton This is my paper

Pith reviewed 2026-05-19 07:51 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords restless banditsWhittle indexonline learningnon-stationary dynamicsdynamic regretsliding windowbandit-over-bandit

0 comments

The pith

A sliding-window Whittle policy adapts to unknown non-stationary dynamics in restless bandits while achieving sub-linear dynamic regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Sliding-Window Online Whittle policy to solve resource allocation in restless multi-armed bandits whose transition kernels drift over time without prior warning. It maintains estimates inside a sliding window, computes indices from upper-confidence bounds on those estimates, and solves the resulting bilinear program to keep decisions fast. The central guarantee is that the policy incurs regret that grows slower than linearly in the number of episodes. When the total amount of change is unknown ahead of time, an outer bandit-over-bandit layer selects the window length on the fly from observed variation. If the claim holds, systems that must allocate limited resources under drifting conditions can do so without full knowledge of future dynamics or expensive recomputation at every step.

Core claim

The authors develop the SW-Whittle policy that tracks time-varying kernels by restricting kernel estimates to a sliding window of tunable length, then obtains Whittle indices via upper-confidence bounds on those estimates together with a bilinear optimization routine. They prove that this construction yields sub-linear dynamic regret against the optimal policy for the prevailing dynamics at each episode. When the variation budget is unknown, the same framework is wrapped inside a Bandit-over-Bandit procedure that learns the window length online from the observed total variation, preserving the same regret order.

What carries the argument

The sliding-window estimator of transition kernels, paired with upper-confidence-bound index computation and bilinear optimization to obtain the indices themselves.

If this is right

The policy stays computationally efficient because index computation reuses the same bilinear routine on recent data only.
Sub-linear dynamic regret holds uniformly over episodes even though the optimal policy itself changes with the kernels.
The bandit-over-bandit wrapper removes the need to know the variation budget in advance while keeping the same regret scaling.
Numerical tests show lower total regret than stationary baselines across multiple drifting environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same window-plus-outer-bandit pattern could be tried on other index policies that currently assume stationary dynamics.
One could examine whether the bilinear optimization step can be replaced by faster approximations without losing the regret guarantee.
Deployed systems might use the online window-selection rule to set practical adaptation rates when exact variation budgets are unavailable.

Load-bearing premise

The non-stationary transition kernels admit a bounded total variation that some sliding window can track.

What would settle it

A sequence of episodes in which the kernels change faster than any online-tuned window can follow, producing cumulative regret that grows linearly rather than sub-linearly with episode count.

Figures

Figures reproduced from arXiv: 2506.18186 by Christopher G. Brinton, Md Kamran Chowdhury Shisher, Mung Chiang, Vishrant Tripathi.

**Figure 1.** Figure 1: One Dimensional Bandit: Cumulative regret versus the number of episodes with different values of time horizon (H = 20, 50, 100) with ϵn = 0.05. 0 10 20 30 40 50 episode 10 1 10 2 10 3 Cumulative Regret H=20 0 10 20 30 40 50 episode 10 2 10 3 H=50 0 10 20 30 40 50 episode 10 2 10 3 H=100 Random UCWhittle OurPolicy WIQL [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: Wireless Scheduling: Cumulative regret versus the number of episodes with different values of time horizon (H = 20, 50, 200) with ϵn = 0.05. the time difference between current time h and the generation time of the most recently delivered signal. The AoI value of a source n increases by 1 if the source n is not scheduled. If the source n is scheduled, the AoI value drops to 1 with probability qn(t) (succes… view at source ↗

read the original abstract

The restless multi-armed bandit (RMAB) framework is a popular approach to solving resource allocation problems in networked systems. In this paper, we study optimal resource allocation in RMABs facing unknown and non-stationary dynamics. Solving RMABs optimally is known to be PSPACE-hard even with full knowledge of model parameters. While Whittle index policies offer asymptotic optimality with low computational cost, they require access to stationary transition kernels, an unrealistic assumption in many modern networking applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Through theoretical analysis, we show that our algorithm achieves sub-linear dynamic regret with respect to the number of episodes. We further address the important case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. In our scheme, window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SW-Whittle plus bandit-over-bandit gives a practical way to handle unknown non-stationary RMAB kernels, but the meta-regret from window tuning likely prevents a clean sub-linear dynamic regret bound.

read the letter

The paper introduces a sliding-window Whittle index policy for restless bandits with time-varying transition kernels and layers a bandit-over-bandit procedure on top to pick the window length when the total variation budget is unknown. That specific combination is the main new element they bring to the table. It builds on standard sliding-window ideas and bandit-over-bandit tuning but applies them directly to Whittle index computation via UCB kernel estimates and bilinear optimization. The approach stays computationally light, which matters for the resource allocation settings they target. Experiments reportedly show lower cumulative regret than baselines in several non-stationary regimes, which is a useful data point even if the setups are synthetic. The problem itself is real: many networked systems have drifting dynamics, and assuming stationary kernels is often unrealistic. The authors correctly note that full optimal policies are PSPACE-hard, so keeping the Whittle structure is a reasonable design choice. The soft spot sits in the regret claim. The stress-test concern holds up on the given details. Inner sliding-window regret typically scales as O(sqrt(W) + V T / W), but the outer bandit-over-bandit for choosing among window lengths adds a term on the order of T^{2/3} K^{1/3} polylog T. The abstract states sub-linear dynamic regret without a sketch showing how these pieces sum to o(T) for unknown V, so the central guarantee looks incomplete. No proof outline or explicit bound decomposition is provided, which makes it hard to verify the analysis closes. This paper is for people working on online learning for restless bandits in networked control or resource allocation. A reader already familiar with Whittle indices and non-stationary bandit techniques would get the most out of it. It deserves a serious referee because the problem is practically relevant and the algorithmic idea is a direct, implementable extension of existing tools, even though the theoretical part needs tightening on the regret decomposition.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Sliding-Window Online Whittle (SW-Whittle) policy for restless multi-armed bandits with unknown non-stationary transition kernels. It estimates time-varying kernels over sliding windows, computes Whittle indices via UCB bounds and bilinear optimization, and achieves sub-linear dynamic regret w.r.t. the number of episodes. For the case of unknown total variation budget, a Bandit-over-Bandit meta-algorithm is used to tune window lengths online. Numerical experiments on synthetic non-stationary environments show the method outperforming several baselines in cumulative regret.

Significance. If the dynamic-regret bounds hold under the stated variation-budget assumption, the result would meaningfully extend the Whittle-index literature to non-stationary RMAB settings that arise in networking and resource allocation. The explicit handling of unknown variation via an outer bandit-over-bandit layer, together with the computational efficiency of the inner index computation, is a practical strength. The work supplies both a theoretical guarantee and reproducible numerical validation.

major comments (2)

[§5.3, Theorem 4] §5.3, Theorem 4 (dynamic regret bound): the decomposition into inner SW-Whittle regret O(√W + VT/W) and outer meta-regret O(T^{2/3}K^{1/3} polylog T) is stated, but the proof does not explicitly verify that the sum remains o(T) uniformly over all admissible variation budgets V when W is chosen by the outer algorithm; an additional cross term arising from the dependence between window selection and kernel estimation error appears to be omitted.
[§4.2, Algorithm 2] §4.2, Algorithm 2 (Bandit-over-Bandit): the outer UCB index for window lengths is defined using an empirical estimate of variation that itself depends on the inner UCB kernel estimates; the analysis does not bound the resulting bias in the meta-regret term, which is load-bearing for the claim that sub-linear regret holds when V is unknown.

minor comments (2)

[Abstract] The abstract states 'sub-linear dynamic regret with respect to the number of episodes' but does not specify the dependence on the variation budget V; this should be clarified in the introduction and abstract for precision.
[Figure 3] Figure 3 caption: the legend labels 'SW-Whittle (known V)' and 'SW-Whittle (unknown V)' but the x-axis scale is not labeled as number of episodes; this reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address the two major comments point by point below, indicating the revisions we will make to strengthen the theoretical analysis.

read point-by-point responses

Referee: [§5.3, Theorem 4] §5.3, Theorem 4 (dynamic regret bound): the decomposition into inner SW-Whittle regret O(√W + VT/W) and outer meta-regret O(T^{2/3}K^{1/3} polylog T) is stated, but the proof does not explicitly verify that the sum remains o(T) uniformly over all admissible variation budgets V when W is chosen by the outer algorithm; an additional cross term arising from the dependence between window selection and kernel estimation error appears to be omitted.

Authors: We agree that the main-text statement of Theorem 4 presents the decomposition without an explicit verification that the total remains o(T) uniformly over admissible V, and that a cross term arising from the dependence between the outer window choice and the inner kernel estimation error is not isolated in the provided sketch. The appendix proof controls this dependence implicitly via the concentration of the sliding-window estimators and the fact that the outer algorithm selects W from a discrete grid whose size is polynomial in T, but we acknowledge the presentation would benefit from an explicit step. We will add a supporting lemma that bounds the cross term by O(T^{2/3} polylog T) and verifies that the sum of all terms is o(T) for every V in the admissible range [0,T]. revision: yes
Referee: [§4.2, Algorithm 2] §4.2, Algorithm 2 (Bandit-over-Bandit): the outer UCB index for window lengths is defined using an empirical estimate of variation that itself depends on the inner UCB kernel estimates; the analysis does not bound the resulting bias in the meta-regret term, which is load-bearing for the claim that sub-linear regret holds when V is unknown.

Authors: The referee is correct that the outer UCB index is constructed from an empirical variation estimate that inherits error from the inner UCB kernel estimates, and that the current analysis does not explicitly quantify the resulting bias in the meta-regret. We will revise the proof of the Bandit-over-Bandit regret bound by adding a concentration argument showing that the variation estimator deviates from the true V by at most O(√(W log T)) with high probability; this bias contributes an additive term that is absorbed into the existing O(T^{2/3} K^{1/3} polylog T) meta-regret without changing the overall sub-linear rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives sub-linear dynamic regret for the SW-Whittle policy under non-stationary kernels via standard sliding-window UCB analysis and a Bandit-over-Bandit meta-algorithm for unknown variation budget. The regret decomposition follows established non-stationary bandit techniques (concentration bounds on kernel estimates, bilinear index computation, and window tuning), without any step reducing by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation. The bounded total variation assumption is used to control the bias-variance tradeoff in the regret bound independently of the algorithm's outputs. The central claim therefore rests on external mathematical tools rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the existence of a bounded variation budget for the kernels and on the ability of upper-confidence-bound estimates plus bilinear optimization to produce valid Whittle indices inside each window.

axioms (1)

domain assumption Transition kernels possess a bounded total variation budget over the horizon.
Required for the sliding-window estimator to track changes and deliver sub-linear regret.

pith-pipeline@v0.9.0 · 5764 in / 1165 out tokens · 32809 ms · 2026-05-19T07:51:31.423029+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Modularized Framework for Piecewise-Stationary Restless Bandits
cs.IT 2026-04 unverdicted novelty 7.0

A modular PS-RMAB framework integrates base algorithms with change detection and diminishing exploration to achieve tilde-O(sqrt(L M K T)) excess regret against a segment-oracle benchmark, with simulations showing nea...
MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
cs.LG 2025-11 unverdicted novelty 7.0

Under the Markov-Averaged Indexability condition, synchronous Q-learning with Whittle indices converges almost surely to the optimal Q-function and indices in restless bandits whose arms are driven by an unobserved la...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 2 Pith papers

[1]

Restless bandits: Activity allocation in a changing world

Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988

work page 1988
[2]

Opportunistic scheduling as restless bandits

Vivek S Borkar, Gaurav S Kasbekar, Sarath Pattathil, and Priyesh Y Shetty. Opportunistic scheduling as restless bandits. IEEE Transactions on Control of Network Systems, 5(4):1952– 1961, 2017

work page 1952
[3]

Schedul- ing policies for minimizing age of information in broadcast wireless networks

Igor Kadota, Abhishek Sinha, Elif Uysal-Biyikoglu, Rahul Singh, and Eytan Modiano. Schedul- ing policies for minimizing age of information in broadcast wireless networks. IEEE/ACM Transactions on Networking, 26(6):2637–2650, 2018

work page 2018
[4]

A whittle index approach to minimizing functions of age of information

Vishrant Tripathi and Eytan Modiano. A whittle index approach to minimizing functions of age of information. IEEE/ACM Transactions on Networking, 2024

work page 2024
[5]

Timely communications for remote inference

Md Kamran Chowdhury Shisher, Yin Sun, and I-Hong Hou. Timely communications for remote inference. IEEE/ACM Transactions on Networking, 32(5):3824–3839, 2024

work page 2024
[6]

Scheduling algorithms for optimizing age of information in wireless networks with throughput constraints

Igor Kadota, Abhishek Sinha, and Eytan Modiano. Scheduling algorithms for optimizing age of information in wireless networks with throughput constraints. IEEE/ACM Transactions on Networking, 27(4):1359–1372, 2019

work page 2019
[7]

Indexability and whittle index for restless bandit problems involving reset processes

Keqin Liu, Richard Weber, and Qing Zhao. Indexability and whittle index for restless bandit problems involving reset processes. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, pages 7690–7696. IEEE, 2011

work page 2011
[8]

Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach

Diego Ruiz-Hernández, Jesús M Pinar-Pérez, and David Delgado-Gómez. Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach. Computers & Operations Research, 119:104927, 2020

work page 2020
[9]

Scalable operator allocation for multirobot assistance: A restless bandit approach

Abhinav Dahiya, Nima Akbarzadeh, Aditya Mahajan, and Stephen L Smith. Scalable operator allocation for multirobot assistance: A restless bandit approach. IEEE Transactions on Control of Network Systems, 9(3):1397–1408, 2022

work page 2022
[10]

Continuous-time restless bandit and dynamic scheduling for make-to-stock production

Fabrice Dusonchet and M-O Hongler. Continuous-time restless bandit and dynamic scheduling for make-to-stock production. IEEE Transactions on Robotics and Automation, 19(6):977–990, 2003

work page 2003
[11]

A hidden markov restless multi-armed bandit model for playout recommendation systems

Rahul Meshram, Aditya Gopalan, and D Manjunath. A hidden markov restless multi-armed bandit model for playout recommendation systems. In Communication Systems and Networks: 9th International Conference, COMSNETS 2017, Bengaluru, India, January 4–8, 2017, Revised Selected Papers and Invited Papers 9, pages 335–362. Springer, 2017

work page 2017
[12]

On the whittle index for restless multiarmed hidden markov bandits

Rahul Meshram, D Manjunath, and Aditya Gopalan. On the whittle index for restless multiarmed hidden markov bandits. IEEE Transactions on Automatic Control, 63(9):3046–3053, 2018

work page 2018
[13]

Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015

work page 2015
[14]

Restless bandits visiting villages: A preliminary study on distributing public health services

Biswarup Bhattacharya. Restless bandits visiting villages: A preliminary study on distributing public health services. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies, pages 1–8, 2018

work page 2018
[15]

Optimal screening for hepatocellular carcinoma: A restless bandit model

Elliot Lee, Mariel S Lavieri, and Michael V olk. Optimal screening for hepatocellular carcinoma: A restless bandit model. Manufacturing & Service Operations Management, 21(1):198–212, 2019

work page 2019
[16]

Collapsing bandits and their application to public health intervention

Aditya Mate, Jackson Killian, Haifeng Xu, Andrew Perrault, and Milind Tambe. Collapsing bandits and their application to public health intervention. Advances in Neural Information Processing Systems, 33:15639–15650, 2020. 11

work page 2020
[17]

A decision-language model (dlm) for dynamic restless multi-armed bandit tasks in public health

Nikhil Behari, Edwin Zhang, Yunfan Zhao, Aparna Taneja, Dheeraj Nagaraj, and Milind Tambe. A decision-language model (dlm) for dynamic restless multi-armed bandit tasks in public health. arXiv preprint arXiv:2402.14807, 2024

work page arXiv 2024
[18]

The complexity of optimal queueing network control

Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queueing network control. In Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory, pages 318–322, 1994

work page 1994
[19]

On an index policy for restless bandits

Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of applied probability, 27(3):637–648, 1990

work page 1990
[20]

Asymptotically optimal priority policies for indexable and nonindexable restless bandits

Ina Maria Verloop. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. The Annals of Applied Probability, 26(4):1947–1995, 2016

work page 1947
[21]

When are kalman-filter restless bandits indexable? In C

Christopher R Dance and Tomi Silander. When are kalman-filter restless bandits indexable? In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

work page 2015
[22]

Multi-uav dynamic routing with partial observations using restless bandit allocation indices

Jerome Le Ny, Munther Dahleh, and Eric Feron. Multi-uav dynamic routing with partial observations using restless bandit allocation indices. In 2008 American Control Conference, pages 4220–4225. IEEE, 2008

work page 2008
[23]

Optimistic whittle index policy: Online learning for restless bandits

Kai Wang, Lily Xu, Aparna Taneja, and Milind Tambe. Optimistic whittle index policy: Online learning for restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10131–10139, 2023

work page 2023
[24]

A reinforcement learning algorithm for restless bandits

Vivek S Borkar and Karan Chadha. A reinforcement learning algorithm for restless bandits. In 2018 Indian Control Conference (ICC), pages 89–94. IEEE, 2018

work page 2018
[25]

Real-time status: How often should one update? In IEEE INFOCOM, pages 2731–2735, 2012

Sanjit Kaul, Roy Yates, and Marco Gruteser. Real-time status: How often should one update? In IEEE INFOCOM, pages 2731–2735, 2012

work page 2012
[26]

Emre Koksal, and Ness B

Yin Sun, Elif Uysal-Biyikoglu, Roy Yates, C. Emre Koksal, and Ness B. Shroff. Update or wait: How to keep your data fresh. In IEEE INFOCOM, pages 1–9, 2016. doi: 10.1109/INFOCOM. 2016.7524524

work page doi:10.1109/infocom 2016
[27]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965, 2021

work page arXiv 2021
[28]

Restless bandits with controlled restarts: Indexability and computation of whittle index

Nima Akbarzadeh and Aditya Mahajan. Restless bandits with controlled restarts: Indexability and computation of whittle index. In 2019 IEEE 58th conference on decision and control (CDC), pages 7294–7300. IEEE, 2019

work page 2019
[29]

A Whittle index policy for the remote estimation of multiple continuous Gauss-Markov processes over parallel channels

Tasmeen Zaman Ornee and Yin Sun. A Whittle index policy for the remote estimation of multiple continuous Gauss-Markov processes over parallel channels. ACM MobiHoc, 2023

work page 2023
[30]

Whittle index based q-learning for restless bandits with average reward

Konstantin E Avrachenkov and Vivek S Borkar. Whittle index based q-learning for restless bandits with average reward. Automatica, 139:110186, 2022

work page 2022
[31]

Towards q-learning the whittle index for restless bandits

Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor. Towards q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pages 249–254. IEEE, 2019

work page 2019
[32]

Neurwin: Neural whittle index network for restless bandits via deep rl

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, Srinivas Shakkottai, et al. Neurwin: Neural whittle index network for restless bandits via deep rl. Advances in Neural Information Processing Systems, 34:828–839, 2021

work page 2021
[33]

Deeptop: Deep threshold-optimal policy for mdps and rmabs

Khaled Nakhleh, I Hou, et al. Deeptop: Deep threshold-optimal policy for mdps and rmabs. Advances in Neural Information Processing Systems, 35:28734–28746, 2022

work page 2022
[34]

An online learning approach to optimizing time-varying costs of aoi

Vishrant Tripathi and Eytan Modiano. An online learning approach to optimizing time-varying costs of aoi. In Proceedings of the Twenty-second International Symposium on Theory, Algo- rithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 241–250, 2021. 12

work page 2021
[35]

Lecture notes on information theory

Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for MIT (6.441), UIUC (ECE 563), Yale (STAT 664), (2012-2017), 2014

work page 2012
[36]

Non-stationary stochastic optimization

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Opera- tions research, 63(5):1227–1244, 2015

work page 2015
[37]

Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems, 9(4):319–337, 2019

work page 2019
[38]

Q-learning lagrange policies for multi-action restless bandits

Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe. Q-learning lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 871–881, 2021

work page 2021
[39]

A review of multistate modelling approaches in monitoring disease progression: Bayesian estimation using the kolmogorov-chapman forward equations

Zvifadzo Matsena Zingoni, Tobias F Chirwa, Jim Todd, and Eustasius Musenge. A review of multistate modelling approaches in monitoring disease progression: Bayesian estimation using the kolmogorov-chapman forward equations. Statistical methods in medical research, 30(5): 1373–1392, 2021

work page 2021
[40]

Monitored markov decision processes

Simone Parisi, Montaser Mohammedalamen, Alireza Kazemipour, Matthew E Taylor, and Michael Bowling. Monitored markov decision processes. arXiv preprint arXiv:2402.06819, 2024

work page arXiv 2024
[41]

Sampling for data freshness optimization: Non-linear age functions

Yin Sun and Benjamin Cyr. Sampling for data freshness optimization: Non-linear age functions. J. Commun. Netw., 21(3):204–219, 2019

work page 2019
[42]

Inequalities for the l1 deviation of the empirical distribution

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, page 125, 2003. 13 A Proof of Lemma 1 We can get Lemma 1 by combining Lemma 2 and Lemma 3, which are provided below: Lemma 2. Given η1 > 0 and t ≥ 1, for (s, a) ∈ Z (n)...

work page 2003

[1] [1]

Restless bandits: Activity allocation in a changing world

Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988

work page 1988

[2] [2]

Opportunistic scheduling as restless bandits

Vivek S Borkar, Gaurav S Kasbekar, Sarath Pattathil, and Priyesh Y Shetty. Opportunistic scheduling as restless bandits. IEEE Transactions on Control of Network Systems, 5(4):1952– 1961, 2017

work page 1952

[3] [3]

Schedul- ing policies for minimizing age of information in broadcast wireless networks

Igor Kadota, Abhishek Sinha, Elif Uysal-Biyikoglu, Rahul Singh, and Eytan Modiano. Schedul- ing policies for minimizing age of information in broadcast wireless networks. IEEE/ACM Transactions on Networking, 26(6):2637–2650, 2018

work page 2018

[4] [4]

A whittle index approach to minimizing functions of age of information

Vishrant Tripathi and Eytan Modiano. A whittle index approach to minimizing functions of age of information. IEEE/ACM Transactions on Networking, 2024

work page 2024

[5] [5]

Timely communications for remote inference

Md Kamran Chowdhury Shisher, Yin Sun, and I-Hong Hou. Timely communications for remote inference. IEEE/ACM Transactions on Networking, 32(5):3824–3839, 2024

work page 2024

[6] [6]

Scheduling algorithms for optimizing age of information in wireless networks with throughput constraints

Igor Kadota, Abhishek Sinha, and Eytan Modiano. Scheduling algorithms for optimizing age of information in wireless networks with throughput constraints. IEEE/ACM Transactions on Networking, 27(4):1359–1372, 2019

work page 2019

[7] [7]

Indexability and whittle index for restless bandit problems involving reset processes

Keqin Liu, Richard Weber, and Qing Zhao. Indexability and whittle index for restless bandit problems involving reset processes. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, pages 7690–7696. IEEE, 2011

work page 2011

[8] [8]

Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach

Diego Ruiz-Hernández, Jesús M Pinar-Pérez, and David Delgado-Gómez. Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach. Computers & Operations Research, 119:104927, 2020

work page 2020

[9] [9]

Scalable operator allocation for multirobot assistance: A restless bandit approach

Abhinav Dahiya, Nima Akbarzadeh, Aditya Mahajan, and Stephen L Smith. Scalable operator allocation for multirobot assistance: A restless bandit approach. IEEE Transactions on Control of Network Systems, 9(3):1397–1408, 2022

work page 2022

[10] [10]

Continuous-time restless bandit and dynamic scheduling for make-to-stock production

Fabrice Dusonchet and M-O Hongler. Continuous-time restless bandit and dynamic scheduling for make-to-stock production. IEEE Transactions on Robotics and Automation, 19(6):977–990, 2003

work page 2003

[11] [11]

A hidden markov restless multi-armed bandit model for playout recommendation systems

Rahul Meshram, Aditya Gopalan, and D Manjunath. A hidden markov restless multi-armed bandit model for playout recommendation systems. In Communication Systems and Networks: 9th International Conference, COMSNETS 2017, Bengaluru, India, January 4–8, 2017, Revised Selected Papers and Invited Papers 9, pages 335–362. Springer, 2017

work page 2017

[12] [12]

On the whittle index for restless multiarmed hidden markov bandits

Rahul Meshram, D Manjunath, and Aditya Gopalan. On the whittle index for restless multiarmed hidden markov bandits. IEEE Transactions on Automatic Control, 63(9):3046–3053, 2018

work page 2018

[13] [13]

Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015

work page 2015

[14] [14]

Restless bandits visiting villages: A preliminary study on distributing public health services

Biswarup Bhattacharya. Restless bandits visiting villages: A preliminary study on distributing public health services. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies, pages 1–8, 2018

work page 2018

[15] [15]

Optimal screening for hepatocellular carcinoma: A restless bandit model

Elliot Lee, Mariel S Lavieri, and Michael V olk. Optimal screening for hepatocellular carcinoma: A restless bandit model. Manufacturing & Service Operations Management, 21(1):198–212, 2019

work page 2019

[16] [16]

Collapsing bandits and their application to public health intervention

Aditya Mate, Jackson Killian, Haifeng Xu, Andrew Perrault, and Milind Tambe. Collapsing bandits and their application to public health intervention. Advances in Neural Information Processing Systems, 33:15639–15650, 2020. 11

work page 2020

[17] [17]

A decision-language model (dlm) for dynamic restless multi-armed bandit tasks in public health

Nikhil Behari, Edwin Zhang, Yunfan Zhao, Aparna Taneja, Dheeraj Nagaraj, and Milind Tambe. A decision-language model (dlm) for dynamic restless multi-armed bandit tasks in public health. arXiv preprint arXiv:2402.14807, 2024

work page arXiv 2024

[18] [18]

The complexity of optimal queueing network control

Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queueing network control. In Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory, pages 318–322, 1994

work page 1994

[19] [19]

On an index policy for restless bandits

Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of applied probability, 27(3):637–648, 1990

work page 1990

[20] [20]

Asymptotically optimal priority policies for indexable and nonindexable restless bandits

Ina Maria Verloop. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. The Annals of Applied Probability, 26(4):1947–1995, 2016

work page 1947

[21] [21]

When are kalman-filter restless bandits indexable? In C

Christopher R Dance and Tomi Silander. When are kalman-filter restless bandits indexable? In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

work page 2015

[22] [22]

Multi-uav dynamic routing with partial observations using restless bandit allocation indices

Jerome Le Ny, Munther Dahleh, and Eric Feron. Multi-uav dynamic routing with partial observations using restless bandit allocation indices. In 2008 American Control Conference, pages 4220–4225. IEEE, 2008

work page 2008

[23] [23]

Optimistic whittle index policy: Online learning for restless bandits

Kai Wang, Lily Xu, Aparna Taneja, and Milind Tambe. Optimistic whittle index policy: Online learning for restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10131–10139, 2023

work page 2023

[24] [24]

A reinforcement learning algorithm for restless bandits

Vivek S Borkar and Karan Chadha. A reinforcement learning algorithm for restless bandits. In 2018 Indian Control Conference (ICC), pages 89–94. IEEE, 2018

work page 2018

[25] [25]

Real-time status: How often should one update? In IEEE INFOCOM, pages 2731–2735, 2012

Sanjit Kaul, Roy Yates, and Marco Gruteser. Real-time status: How often should one update? In IEEE INFOCOM, pages 2731–2735, 2012

work page 2012

[26] [26]

Emre Koksal, and Ness B

Yin Sun, Elif Uysal-Biyikoglu, Roy Yates, C. Emre Koksal, and Ness B. Shroff. Update or wait: How to keep your data fresh. In IEEE INFOCOM, pages 1–9, 2016. doi: 10.1109/INFOCOM. 2016.7524524

work page doi:10.1109/infocom 2016

[27] [27]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965, 2021

work page arXiv 2021

[28] [28]

Restless bandits with controlled restarts: Indexability and computation of whittle index

Nima Akbarzadeh and Aditya Mahajan. Restless bandits with controlled restarts: Indexability and computation of whittle index. In 2019 IEEE 58th conference on decision and control (CDC), pages 7294–7300. IEEE, 2019

work page 2019

[29] [29]

A Whittle index policy for the remote estimation of multiple continuous Gauss-Markov processes over parallel channels

Tasmeen Zaman Ornee and Yin Sun. A Whittle index policy for the remote estimation of multiple continuous Gauss-Markov processes over parallel channels. ACM MobiHoc, 2023

work page 2023

[30] [30]

Whittle index based q-learning for restless bandits with average reward

Konstantin E Avrachenkov and Vivek S Borkar. Whittle index based q-learning for restless bandits with average reward. Automatica, 139:110186, 2022

work page 2022

[31] [31]

Towards q-learning the whittle index for restless bandits

Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor. Towards q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pages 249–254. IEEE, 2019

work page 2019

[32] [32]

Neurwin: Neural whittle index network for restless bandits via deep rl

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, Srinivas Shakkottai, et al. Neurwin: Neural whittle index network for restless bandits via deep rl. Advances in Neural Information Processing Systems, 34:828–839, 2021

work page 2021

[33] [33]

Deeptop: Deep threshold-optimal policy for mdps and rmabs

Khaled Nakhleh, I Hou, et al. Deeptop: Deep threshold-optimal policy for mdps and rmabs. Advances in Neural Information Processing Systems, 35:28734–28746, 2022

work page 2022

[34] [34]

An online learning approach to optimizing time-varying costs of aoi

Vishrant Tripathi and Eytan Modiano. An online learning approach to optimizing time-varying costs of aoi. In Proceedings of the Twenty-second International Symposium on Theory, Algo- rithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 241–250, 2021. 12

work page 2021

[35] [35]

Lecture notes on information theory

Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for MIT (6.441), UIUC (ECE 563), Yale (STAT 664), (2012-2017), 2014

work page 2012

[36] [36]

Non-stationary stochastic optimization

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Opera- tions research, 63(5):1227–1244, 2015

work page 2015

[37] [37]

Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems, 9(4):319–337, 2019

work page 2019

[38] [38]

Q-learning lagrange policies for multi-action restless bandits

Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe. Q-learning lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 871–881, 2021

work page 2021

[39] [39]

A review of multistate modelling approaches in monitoring disease progression: Bayesian estimation using the kolmogorov-chapman forward equations

Zvifadzo Matsena Zingoni, Tobias F Chirwa, Jim Todd, and Eustasius Musenge. A review of multistate modelling approaches in monitoring disease progression: Bayesian estimation using the kolmogorov-chapman forward equations. Statistical methods in medical research, 30(5): 1373–1392, 2021

work page 2021

[40] [40]

Monitored markov decision processes

Simone Parisi, Montaser Mohammedalamen, Alireza Kazemipour, Matthew E Taylor, and Michael Bowling. Monitored markov decision processes. arXiv preprint arXiv:2402.06819, 2024

work page arXiv 2024

[41] [41]

Sampling for data freshness optimization: Non-linear age functions

Yin Sun and Benjamin Cyr. Sampling for data freshness optimization: Non-linear age functions. J. Commun. Netw., 21(3):204–219, 2019

work page 2019

[42] [42]

Inequalities for the l1 deviation of the empirical distribution

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, page 125, 2003. 13 A Proof of Lemma 1 We can get Lemma 1 by combining Lemma 2 and Lemma 3, which are provided below: Lemma 2. Given η1 > 0 and t ≥ 1, for (s, a) ∈ Z (n)...

work page 2003