Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays
Pith reviewed 2026-05-25 05:25 UTC · model grok-4.3
The pith
Prudent-Banker achieves minimax-optimal regret in adversarial bandits with delays while keeping constant regret to a safe baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prudent-Banker is the first algorithm to achieve the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against the safe comparator, both with and without delays. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. The paper also proves matching lower bounds showing these guarantees are unimprovable up to logarithmic factors under the baseline-safety requirement.
What carries the argument
The delay-calibrated restart threshold inside the phased-aggression mechanism, which detects when the safe comparator is suboptimal while compensating for the worst-case effect of unobserved delayed feedback.
If this is right
- The same regret guarantees apply whether feedback is delayed or immediate.
- The bounds are tight up to logarithmic factors under the baseline-safety requirement.
- The algorithm balances safety and learning across diverse delay distributions where standard delay-robust baselines do not.
Where Pith is reading between the lines
- The same restart-calibration idea could be tested in other partial-feedback settings where timing of information affects safety constraints.
- If the threshold construction generalizes, safety constraints may be incorporable into many existing online algorithms without inflating the leading regret term.
Load-bearing premise
The delay-calibrated restart threshold accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality.
What would settle it
A sequence of delayed adversarial bandit instances in which Prudent-Banker incurs more than constant regret against the safe comparator would show the main guarantee does not hold.
Figures
read the original abstract
We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Prudent-Banker, an algorithm for adversarial multi-armed bandits (with and without delays) that combines a delay-adapted Online Mirror Descent with a modified phased-aggression mechanism. Its central claim is that a delay-calibrated restart threshold enables the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against a designated safe baseline comparator. The paper also derives new lower bounds showing these rates are unimprovable (up to log factors) under the baseline-safety constraint, and reports experiments across delay distributions.
Significance. If the central claims hold, the work would be significant for online learning: it is presented as the first algorithm to achieve the stated optimal trade-off simultaneously with and without delays, backed by matching lower bounds. The explicit handling of arbitrary delays while preserving the Õ(1) safety guarantee against the baseline addresses a practical gap in existing delay-robust methods. The experiments provide initial evidence of practical balance between safety and learning.
major comments (2)
- [Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.
- [Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.
minor comments (2)
- [Experiments] The experimental section should specify the exact delay distributions, number of runs, and baseline implementations to support reproducibility claims.
- [Preliminaries] Notation for the restart threshold and the safe comparator should be defined consistently in the main text before first use.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address the two major comments point by point below, clarifying the locations of the relevant proofs and constructions while indicating revisions to improve accessibility.
read point-by-point responses
-
Referee: [Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.
Authors: The derivation of the delay-calibrated restart threshold, including its bounding of worst-case distortion from unobserved feedback and preservation of the Õ(1) safety regret, appears in Section 4 (Theorems 4.1–4.2 and Lemmas 4.3–4.5). The complete proof is in Appendix B. To make this more immediately verifiable, we will insert an explicit pointer to these results in the abstract and the opening of Section 4 in the revised version. revision: yes
-
Referee: [Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.
Authors: Section 5 presents the lower-bound construction for the safety-constrained setting (two-arm adversarial instance with arbitrary delays). The safe comparator is defined as the best fixed arm under the baseline-safety constraint; the proof shows Ω(√T + √D) pseudo-regret and Ω(1) safety regret. Constants are compared to the upper bound immediately after Theorem 5.2, confirming matching up to logs. We will expand the construction details, restate the comparator assumptions explicitly, and add a side-by-side constant comparison table in the revision. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives regret bounds for Prudent-Banker via a delay-adapted Online Mirror Descent combined with a phased-aggression mechanism whose restart threshold is constructed from a worst-case analysis of unobserved feedback distortion. The claimed pseudo-regret Õ(√T + √D) and Õ(1) safety regret are obtained by standard potential-function arguments and a separate lower-bound construction; neither quantity is obtained by fitting parameters to the target rates nor by renaming an input quantity. No self-citations are invoked as load-bearing uniqueness theorems, and the derivation relies on externally verifiable martingale and convex-optimization tools rather than internal redefinitions. The analysis is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Competing in the dark: An efficient algorithm for bandit linear optimization
Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. InConference on Learning Theory, number 110, 2009
work page 2009
-
[2]
Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011
work page 2011
-
[3]
Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010
work page 2010
-
[4]
Gambling in a rigged casino: The adversarial multi-armed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995
work page 1995
-
[5]
The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002
work page 2002
-
[6]
Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018
work page 2018
-
[7]
Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003
work page 2003
-
[8]
What Doubling Tricks Can and Can't Do for Multi-Armed Bandits
Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019
work page 2019
-
[10]
Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022
work page 2022
-
[11]
Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007
Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007
work page 2007
-
[12]
Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019
work page 2019
-
[13]
Online learning for qoe-based video streaming to mobile receivers
Nesrine Changuel, Bessem Sayadi, and Michel Kieffer. Online learning for qoe-based video streaming to mobile receivers. In2012 IEEE Globecom Workshops, pages 1319–1324. IEEE, 2012. 10
work page 2012
-
[14]
Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, and Ding Zhao. Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020
-
[15]
Black-box reductions for parameter-free online learning in banach spaces
Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018
work page 2018
-
[16]
Yan Dai, Haipeng Luo, and Liyu Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022
work page 2022
-
[17]
Safely using predictions in general-sum normal form games
Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. InProceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017
work page 2017
-
[18]
Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J
Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J. Mach. Learn. Res., 15(1):3873–3923, 2014
work page 2014
-
[19]
Efficient Optimal Learning for Contextual Bandits
Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369, 2011
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[20]
Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
-
[21]
Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008
work page 2008
-
[22]
Online learning with optimism and delay
Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363–3373. PMLR, 2021
work page 2021
-
[23]
Online convex optimization in the bandit setting: gradient descent without a gradient
Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex opti- mization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[24]
A second-order bound with excess losses
Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014
work page 2014
-
[25]
Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015
Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015
work page 2015
-
[26]
Conserva- tive exploration in reinforcement learning
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conserva- tive exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020
work page 2020
-
[27]
Regret bounds for prediction problems
Geoffrey J Gordon. Regret bounds for prediction problems. InProceedings of the twelfth annual conference on Computational learning theory, pages 29–40, 1999
work page 1999
-
[28]
Adapting to delays and data in adversarial multi-armed bandits
Andras Gyorgy and Pooria Joulani. Adapting to delays and data in adversarial multi-armed bandits. InInternational Conference on Machine Learning, pages 3988–3997. PMLR, 2021
work page 2021
-
[29]
Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021
Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021
-
[30]
Robust wireless scheduling under arbitrary channel dynamics and feedback delay
Jiatai Huang and Longbo Huang. Robust wireless scheduling under arbitrary channel dynamics and feedback delay. In2021 33th International Teletraffic Congress (ITC-33), pages 1–8. IEEE, 2021
work page 2021
-
[31]
Banker online mirror descent: A universal approach for delayed online bandit learning
Jiatai Huang, Yan Dai, and Longbo Huang. Banker online mirror descent: A universal approach for delayed online bandit learning. InInternational Conference on Machine Learning, pages 13814–13844. PMLR, 2023
work page 2023
-
[32]
Adaptive online prediction by following the perturbed leader
Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005. 11
work page 2005
-
[33]
Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020
work page 2020
-
[34]
Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022
work page 2022
-
[35]
Online learning under delayed feedback
Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013
work page 2013
-
[36]
Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011
Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011
work page 2011
-
[37]
Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[38]
The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013
Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013
work page 2013
-
[39]
Learning adversarial markov deci- sion processes with delayed feedback
Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov deci- sion processes with delayed feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7281–7289, 2022
work page 2022
-
[40]
The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015
Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015
work page 2015
-
[41]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020
work page 2020
-
[42]
A contextual-bandit approach to personalized news article recommendation
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010
work page 2010
-
[43]
Mingyang Liu, Chengjie Wu, Qihan Liu, Yansen Jing, Jun Yang, Pingzhong Tang, and Chongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022
work page 2022
-
[44]
Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021
work page 2021
-
[45]
Setting up a reinforcement learning task with a real-world robot
A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018
work page 2018
-
[46]
Arnab Maiti, Kevin Jamieson, and Lillian J Ratliff. Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023
-
[47]
Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022
work page 2022
-
[48]
On-line learning with delayed label feedback
Chris Mesterharm. On-line learning with delayed label feedback. InInternational Conference on Algorithmic Learning Theory, pages 399–413. Springer, 2005
work page 2005
-
[49]
Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025
Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and V olkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025
-
[50]
Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015
work page 2015
-
[51]
A Modern Introduction to Online Learning
Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019. 12
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[52]
Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[53]
Just-in-time adaptive anomaly detection and personalized health feedback
Parvaneh Parvin. Just-in-time adaptive anomaly detection and personalized health feedback. 2020
work page 2020
-
[54]
Some aspects of the sequential design of experiments
Herbert Robbins. Some aspects of the sequential design of experiments. 1952
work page 1952
-
[55]
Exploiting easy data in online optimization
Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[56]
Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025
work page 2025
-
[57]
Tobias Sommer Thune, Nicolò Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[58]
Comparator-adaptive convex bandits
Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits. Advances in Neural Information Processing Systems, 33:19795–19804, 2020
work page 2020
-
[59]
Linear bandits with stochastic delayed feedback
Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. InInternational Confer- ence on Machine Learning, pages 9712–9721. PMLR, 2020
work page 2020
-
[60]
Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence
Manfred K Warmuth. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. InElectronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, 1998, 1998
work page 1998
-
[61]
James MS Wason and Lorenzo Trippa. A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014
work page 2014
-
[62]
More adaptive algorithms for adversarial bandits
Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. InConference On Learning Theory, pages 1263–1291. PMLR, 2018
work page 2018
-
[63]
Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016
work page 2016
-
[64]
Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[65]
Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. InInternational Conference on Artificial Intelligence and Statistics, pages 3285–3294. PMLR, 2020. 13 Contents 1 Introduction 1 1.1 How Delay Breaks Safety: Three Roadblocks . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contributions . . . . . . . . ...
work page 2020
-
[66]
Eliminating the safety certification gap.We derive adelay-calibrated restart conditionthat accounts for the information deficit induced by delayed feedback. In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test. We show that under delays, this test must be relaxed by an explicit slack term...
-
[67]
Cost-free adaptation to unknown delays.A central challenge is adapting to an unknown total delay D without compromising safety. Standard “doubling trick” approaches [3, 4] typically treat each restart as a fresh instance. While such restarts incur only mild overhead for conventional regret, they can be catastrophic for safety: naively resetting the mechan...
-
[68]
Generalizing Banker-OMD to Safe Learning.Finally, we answer the challenge of extending safety guarantees beyond standard immediate-feedback environments. While Müller et al. [49] established the foundations of the (Safe) online linear minimization framework, the feasibility of such guarantees under partial monitoring remained a question. We bridge this ga...
-
[69]
If all feedback arrives by end of phase k(s) (i.e., r+d r ≤end k(s) for all r), then min{dr,end k(s) −r}=d r, andD k(s) endk(s) =P dr =D k(s) endk(s)
-
[70]
arrived” feedback (controlled by the phase-gap definition since restart didn’t trigger) and “missing
If some feedback arrives after endk(s) (i.e., r+d r >end k(s) for some r), then min{dr,end k(s) −r}< d r for thoser, and thusD k(s) endk(s) < D k(s) endk(s) . In all cases,D k(s) endk(s) ≤D (s) endk(s) . 21 C.2 Proof of Lemma C.2 (Bound on Estimated Total Delay) Intuition. This lemma establishes that the updated budget bDs+1 provides a tight upper bound o...
-
[71]
For everym∈[M−1], L2 m ≥ X t∈Bm+1 dt. Proof. We borrow the proof idea from Masoudian et al. [47]. For every m, since bm ∈B m, we have bm +d bm > b m+1 −1. Sinceb m,d bm, andb m+1 are integers, it follows that bm +d bm ≥b m+1, and therefore Lm =b m+1 −b m ≤d bm. On the other hand, by construction of the greedy buckets, there existst m ∈B m such that tm +d ...
work page 2065
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.