Recognition: unknown
Constrained Contextual Bandits with Adversarial Contexts
Pith reviewed 2026-05-08 13:33 UTC · model grok-4.3
The pith
A modular reduction using online regression oracles converts budget-constrained contextual bandits with adversarial contexts into standard unconstrained problems via adaptively defined surrogate rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building an online regression oracle for the reward and cost functions and feeding it into a reduction that produces adaptive surrogate rewards, the constrained problem becomes equivalent to an unconstrained contextual bandit problem; solving the latter with any standard algorithm then yields an efficient solution for the original constrained task that achieves sublinear regret and sublinear budget violation under adversarial contexts.
What carries the argument
The reduction that maps the constrained problem to an unconstrained contextual bandit by replacing true rewards with adaptively defined surrogate reward functions constructed from online regression oracles.
If this is right
- Sublinear regret and sublinear budget violation hold simultaneously even when contexts are chosen adversarially rather than stochastically.
- The same algorithm continues to produce valid decisions after the cumulative cost budget is exhausted.
- The analysis remains compact because it inherits standard regret bounds from any black-box algorithm for unconstrained contextual bandits.
- The framework applies whenever online regression oracles exist for the reward and cost classes.
Where Pith is reading between the lines
- The same surrogate-reward construction may extend directly to other forms of knapsack-style constraints beyond a single budget.
- Because the reduction is modular, any future improvement in unconstrained contextual bandit algorithms immediately improves the constrained adversarial case without new analysis.
Load-bearing premise
Conditioned on the observed context, the expected reward and expected cost belong to known function classes and are realized independently.
What would settle it
Run the algorithm on a sequence of adversarially chosen contexts where the realized cumulative cost exceeds the budget by more than the claimed sublinear amount while regret remains sublinear; if no such sequence exists that violates the bounds, the claim holds.
Figures
read the original abstract
We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In this setting, the objective is to simultaneously control regret and the violation of the budget constraint. Building on the seminal $\mathsf{SquareCB}$ framework of Foster et al. [2018], we propose a simple and modular framework that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to prior works, which focus on stochastic contexts, our reduction yields improved guarantees for more general adversarial contexts, together with an efficient algorithm with a compact and transparent analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. Under the standard realizability assumption (rewards and costs drawn independently from fixed distributions with expectations in known function classes), it focuses on the continuing setting in which the algorithm runs over the full horizon even after the cumulative cost budget is exhausted. The central contribution is a modular reduction, built on the SquareCB framework of Foster et al. (2018), that uses online regression oracles to convert the constrained problem into an unconstrained contextual bandit problem via adaptively defined surrogate reward functions; this yields simultaneous control of regret and budget-violation while claiming improved guarantees relative to prior stochastic-context work, together with an efficient algorithm and compact analysis.
Significance. If the reduction and its analysis are correct, the work is significant because it extends a powerful, oracle-based method (SquareCB) to handle hard budget constraints under adversarial contexts, a setting that is more general and realistic than the stochastic-context focus of most prior constrained-bandit literature. The modularity of the surrogate construction, the transparent handling of the continuing setting via penalty terms inside the surrogate, and the emphasis on compact analysis are genuine strengths that could facilitate adoption and extensions. The explicit use of online regression oracles also aligns with the growing literature on oracle-efficient bandit algorithms.
minor comments (3)
- Abstract: the phrase 'improved guarantees for more general adversarial contexts' is stated without a brief quantitative comparison (e.g., a sentence contrasting the new regret/violation bounds with the best prior stochastic-context bounds). Adding one sentence here or in the introduction would make the contribution easier to assess at a glance.
- The realizability assumption is correctly identified as minimal for the regression oracles, but the manuscript should explicitly state the precise function-class assumptions (e.g., whether the classes are the same for rewards and costs, any boundedness or Lipschitz conditions) in the notation or preliminaries section to support reproducibility of the oracle calls.
- The surrogate-reward construction is described as 'adaptive' yet still amenable to online regression; a short remark clarifying that the adaptation does not invalidate the oracle's regret guarantee (e.g., by referencing the specific property of SquareCB that tolerates this) would strengthen the transparency claim.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work on budget-constrained contextual bandits with adversarial contexts and for highlighting the modularity of the SquareCB-based reduction via surrogate rewards. We appreciate the recognition of the improved guarantees for adversarial contexts and the compact analysis. As no specific major comments were raised, we will incorporate any minor editorial suggestions in the revised version.
Circularity Check
No significant circularity; reduction builds on external SquareCB framework
full rationale
The paper's derivation chain consists of a modular reduction from the budget-constrained problem to an unconstrained contextual bandit via adaptively defined surrogate reward functions, solved using online regression oracles. This explicitly invokes the external SquareCB framework of Foster et al. (2018) as the base, with the realizability assumption stated as the standard minimal condition for those oracles. No load-bearing step reduces by construction to the paper's own inputs, fitted parameters, or self-citations; the surrogate construction and continuing-setting penalty are presented as independent extensions that preserve the base guarantees for adversarial contexts. The analysis is self-contained against the cited external benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Realizability assumption: conditioned on context, reward and cost expectations lie in known function classes.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
A unifying framework for online optimization with long-term constraints , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
arXiv preprint arXiv:2312.16730 , year=
Foundations of reinforcement learning and interactive decision making , author=. arXiv preprint arXiv:2312.16730 , year=
-
[3]
Advances in neural information processing systems , volume=
The epoch-greedy algorithm for contextual multi-armed bandits , author=. Advances in neural information processing systems , volume=. 2007 , publisher=
2007
-
[4]
Learning Safety Constraints for Large Language Models , author=
-
[5]
Forty-second International Conference on Machine Learning , year=
Learning Safety Constraints for Large Language Models , author=. Forty-second International Conference on Machine Learning , year=
-
[6]
Certifying LLM Safety against Adversarial Prompting , author=
-
[7]
IEEE Access , year=
Constraints driven safe reinforcement learning for autonomous driving decision-making , author=. IEEE Access , year=
-
[8]
Optical compressive imaging , pages=
Phase retrieval: An overview of recent developments , author=. Optical compressive imaging , pages=. 2016 , publisher=
2016
-
[9]
Mathematics of Operations Research , volume=
The online saddle point problem and online convex optimization with knapsacks , author=. Mathematics of Operations Research , volume=. 2025 , publisher=
2025
-
[10]
Foundations and Trends
Learning with submodular functions: A convex optimization perspective , author=. Foundations and Trends. 2013 , publisher=
2013
-
[11]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Online dr-submodular maximization: Minimizing regret and constraint violation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[12]
2003 , publisher=
Convex analysis and optimization , author=. 2003 , publisher=
2003
-
[13]
Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms , pages=
Fast algorithms for online stochastic convex programming , author=. Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms , pages=. 2014 , organization=
2014
-
[14]
Foundations and Trends
Online matching and ad allocation , author=. Foundations and Trends. 2013 , publisher=
2013
-
[15]
Operations research , volume=
Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms , author=. Operations research , volume=. 2009 , publisher=
2009
-
[16]
International Conference on Machine Learning , pages=
Online learning with knapsacks: the best of both worlds , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[17]
arXiv preprint arXiv:1306.0155 , year=
Dynamic ad allocation: Bandits with budgets , author=. arXiv preprint arXiv:1306.0155 , year=
-
[18]
2002 , publisher=
Computers and intractability , author=. 2002 , publisher=
2002
-
[19]
International Conference on Artificial Intelligence and Statistics , pages=
Online continuous submodular maximization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
2018
-
[20]
International Conference on Machine Learning , pages=
Stochastic continuous submodular maximization: Boosting via non-oblivious function , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[21]
Dynamic Regret Bounds for Constrained Online Nonconvex Optimization Based on Polyak–Lojasiewicz Regions , year=
Mulvaney-Kemp, Julie and Park, SangWoo and Jin, Ming and Lavaei, Javad , journal=. Dynamic Regret Bounds for Constrained Online Nonconvex Optimization Based on Polyak–Lojasiewicz Regions , year=
-
[22]
International Conference on Artificial Intelligence and Statistics , pages=
Online learning with non-convex losses and non-stationary regret , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
2018
-
[23]
Conference on Learning Theory , pages=
Learning in non-convex games with an optimization oracle , author=. Conference on Learning Theory , pages=. 2019 , organization=
2019
-
[24]
IEEE Trans
Phase retrieval via reweighted amplitude flow , author=. IEEE Trans. Signal Process. , volume=. 2018 , publisher=
2018
-
[25]
On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , year=
Vaze, Rahul , booktitle=. On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , year=
-
[26]
Advances in Neural Information Processing Systems , volume=
Adaptive hedge , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
A low complexity algorithm with
Yu, Hao and Neely, Michael J , journal=. A low complexity algorithm with
-
[28]
arXiv preprint arXiv:2501.16046 , year=
Revisiting Projection-Free Online Learning with Time-Varying Constraints , author=. arXiv preprint arXiv:2501.16046 , year=
-
[29]
2023 , eprint=
Playing in the Dark: No-regret Learning with Adversarial Constraints , author=. 2023 , eprint=
2023
-
[30]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Optimal Algorithms for Online Convex Optimization with Adversarial Constraints , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[31]
IEEE Transactions on automatic control , volume=
Online convex optimization with time-varying constraints and bandit feedback , author=. IEEE Transactions on automatic control , volume=. 2018 , publisher=
2018
-
[32]
Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=
Minimizing queue length regret under adversarial network models , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2018 , publisher=
2018
-
[33]
IEEE Internet of Things Journal , volume=
Bandit convex optimization for scalable and dynamic IoT management , author=. IEEE Internet of Things Journal , volume=. 2018 , publisher=
2018
-
[34]
Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=
Nested convex bodies are chaseable , author=. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2018 , organization=
2018
-
[35]
Proceedings of the Symposium on the Mathematical Theory of Automata , volume=
On convergence proofs on perceptrons , author=. Proceedings of the Symposium on the Mathematical Theory of Automata , volume=. 1962 , organization=
1962
-
[36]
Journal of optimization theory and applications , volume=
Subgradient methods for saddle-point problems , author=. Journal of optimization theory and applications , volume=. 2009 , publisher=
2009
-
[37]
IEEE Transactions on Automatic Control , year=
Regret and cumulative constraint violation analysis for distributed online constrained convex optimization , author=. IEEE Transactions on Automatic Control , year=
-
[38]
Nature , volume=
Dense reinforcement learning for safety validation of autonomous vehicles , author=. Nature , volume=. 2023 , publisher=
2023
-
[39]
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=
A nearly-linear bound for chasing nested convex bodies , author=. Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2019 , organization=
2019
-
[40]
Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=
Chasing nested convex bodies nearly optimally , author=. Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2020 , organization=
2020
-
[41]
IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pages=
Learning to Cache With No Regrets , author=. IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pages=. 2019 , organization=
2019
-
[42]
2012 , publisher=
The Borel-Cantelli Lemma , author=. 2012 , publisher=
2012
-
[43]
Kingma and Jimmy Ba , editor =
Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =
2015
-
[44]
2008 , publisher=
Control techniques for complex networks , author=. 2008 , publisher=
2008
-
[45]
, author=
Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
-
[46]
International Conference on Machine Learning , pages=
Regret and cumulative constraint violation analysis for online convex optimization with long term constraints , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[47]
An Overview of Multi-Task Learning in Deep Neural Networks
An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=
work page internal anchor Pith review arXiv
-
[48]
2015 , publisher=
Random processes for engineers , author=. 2015 , publisher=
2015
-
[49]
Advances in Neural Information Processing Systems , volume=
Online convex optimization with hard constraints: Towards the best of two worlds and beyond , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
Advances in Neural Information Processing Systems , volume=
Adaptive smoothed online multi-task learning , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
International Conference on Computational Learning Theory , pages=
Online multitask learning , author=. International Conference on Computational Learning Theory , pages=. 2006 , organization=
2006
-
[52]
Rockafellar, R. Tyrrel. Convex Analysis
-
[53]
Electronic colloquium on computational complexity (ECCC) , volume=
Adaptive algorithms for online decision problems , author=. Electronic colloquium on computational complexity (ECCC) , volume=
-
[54]
Artificial Intelligence and Statistics , pages=
Improved strongly adaptive online learning using coin betting , author=. Artificial Intelligence and Statistics , pages=. 2017 , organization=
2017
-
[55]
Theoretical Computer Science , volume=
Scale-free online learning , author=. Theoretical Computer Science , volume=. 2018 , publisher=
2018
-
[56]
International Conference on Machine Learning , pages=
A simple yet universal strategy for online convex optimization , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[57]
Machine learning , volume=
Regret bounded by gradual variation for online convex optimization , author=. Machine learning , volume=. 2014 , publisher=
2014
-
[58]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[59]
International Conference on Machine Learning , pages=
Adaptive regret of convex and smooth functions , author=. International Conference on Machine Learning , pages=. 2019 , organization=
2019
-
[60]
Proceedings of the 36th International Conference on Machine Learning , pages =
Cautious Regret Minimization: Online Optimization with Long-Term Budget Constraints , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
2019
-
[61]
2003 , publisher=
Applied probability and queues , author=. 2003 , publisher=
2003
-
[62]
The Journal of Machine Learning Research , volume=
Trading regret for efficiency: online convex optimization with long term constraints , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=
2012
-
[63]
Proceedings of the 34th International Conference on Machine Learning , pages =
Safety-Aware Algorithms for Adversarial Contextual Bandit , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
2017
-
[64]
ACM SIGMETRICS Performance Evaluation Review , volume=
Simultaneously achieving sublinear regret and constraint violations for online convex optimization with time-varying constraints , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2022 , publisher=
2022
-
[65]
arXiv preprint arXiv:2306.00149 , year=
Distributed Online Convex Optimization with Adversarial Constraints: Reduced Cumulative Constraint Violation Bounds under Slater's Condition , author=. arXiv preprint arXiv:2306.00149 , year=
-
[66]
A Low Complexity Algorithm with O (
Yu, Hao and Neely, Michael J , journal=. A Low Complexity Algorithm with O (
-
[67]
29th IEEE Conference on Decision and Control , pages=
Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , author=. 29th IEEE Conference on Decision and Control , pages=. 1990 , organization=
1990
-
[68]
IEEE Transactions on Communications , volume=
Achieving 100\ author=. IEEE Transactions on Communications , volume=. 1999 , publisher=
1999
-
[69]
1992 , publisher=
Integral inequalities and applications , author=. 1992 , publisher=
1992
-
[70]
arXiv preprint arXiv:1702.04783 , year=
Online convex optimization with time-varying constraints , author=. arXiv preprint arXiv:1702.04783 , year=
-
[71]
, author=
Online Learning with Sample Path Constraints. , author=. Journal of Machine Learning Research , volume=
-
[72]
Advances in Neural Information Processing Systems , volume=
Online convex optimization with stochastic constraints , author=. Advances in Neural Information Processing Systems , volume=
-
[73]
International Conference on Machine Learning , pages=
Adaptive algorithms for online convex optimization with long-term constraints , author=. International Conference on Machine Learning , pages=. 2016 , organization=
2016
-
[74]
Advances in Neural Information Processing Systems , volume=
Online convex optimization for cumulative constraints , author=. Advances in Neural Information Processing Systems , volume=
-
[75]
2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=
On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , author=. 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=. 2022 , organization=
2022
-
[76]
Advances in neural information processing systems , volume=
Adaptive online gradient descent , author=. Advances in neural information processing systems , volume=
-
[77]
Advances in Neural Information Processing Systems , volume=
Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[78]
Mathematical Proceedings of the Cambridge Philosophical Society , volume=
The theory of queues with a single server , author=. Mathematical Proceedings of the Cambridge Philosophical Society , volume=. 1952 , organization=
1952
-
[79]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Achieving Counterfactual Fairness for Causal Bandit , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[80]
Advances in Neural Information Processing Systems , volume=
Fair algorithms for multi-agent multi-armed bandits , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.