arxiv: 2605.06190 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Constrained Contextual Bandits with Adversarial Contexts

Abhishek Sinha, Dhruv Sarkar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual banditsconstrained banditsadversarial contextsbudget constraintsonline regression oraclesregret boundssurrogate rewardscontinuing setting

0 comments

The pith

A modular reduction using online regression oracles converts budget-constrained contextual bandits with adversarial contexts into standard unconstrained problems via adaptively defined surrogate rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a constrained contextual bandit problem, where actions produce rewards but consume a limited budget of costs under adversarial contexts, reduces to an ordinary unconstrained contextual bandit by replacing the original rewards with specially constructed surrogate rewards at each step. This reduction preserves the ability to control both cumulative regret and the amount by which the budget constraint is violated, even when the algorithm must continue operating after the budget is exhausted. A sympathetic reader would care because most existing methods assume contexts are drawn from a fixed distribution, while many practical settings feature worst-case context sequences chosen by an adversary.

Core claim

By building an online regression oracle for the reward and cost functions and feeding it into a reduction that produces adaptive surrogate rewards, the constrained problem becomes equivalent to an unconstrained contextual bandit problem; solving the latter with any standard algorithm then yields an efficient solution for the original constrained task that achieves sublinear regret and sublinear budget violation under adversarial contexts.

What carries the argument

The reduction that maps the constrained problem to an unconstrained contextual bandit by replacing true rewards with adaptively defined surrogate reward functions constructed from online regression oracles.

If this is right

Sublinear regret and sublinear budget violation hold simultaneously even when contexts are chosen adversarially rather than stochastically.
The same algorithm continues to produce valid decisions after the cumulative cost budget is exhausted.
The analysis remains compact because it inherits standard regret bounds from any black-box algorithm for unconstrained contextual bandits.
The framework applies whenever online regression oracles exist for the reward and cost classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surrogate-reward construction may extend directly to other forms of knapsack-style constraints beyond a single budget.
Because the reduction is modular, any future improvement in unconstrained contextual bandit algorithms immediately improves the constrained adversarial case without new analysis.

Load-bearing premise

Conditioned on the observed context, the expected reward and expected cost belong to known function classes and are realized independently.

What would settle it

Run the algorithm on a sequence of adversarially chosen contexts where the realized cumulative cost exceeds the budget by more than the claimed sublinear amount while regret remains sublinear; if no such sequence exists that violates the bounds, the claim holds.

Figures

Figures reproduced from arXiv: 2605.06190 by Abhishek Sinha, Dhruv Sarkar.

**Figure 1.** Figure 1: A schematic of the proposed algorithmic scheme for the constrained contextual bandit (CCB) problem. The numbers within the circles show the sequence of operations performed at any round t ⩾ 1. The variable Q(t) denotes the CCV after round t and IGW(.) denotes inverse gap weighting. straightforward proof. On the algorithmic side, while the exploration parameter γt of the IGW policy in Algorithm 2 depends ad… view at source ↗

read the original abstract

We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In this setting, the objective is to simultaneously control regret and the violation of the budget constraint. Building on the seminal $\mathsf{SquareCB}$ framework of Foster et al. [2018], we propose a simple and modular framework that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to prior works, which focus on stochastic contexts, our reduction yields improved guarantees for more general adversarial contexts, together with an efficient algorithm with a compact and transparent analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A modular surrogate reduction that extends constrained contextual bandits to adversarial contexts on top of SquareCB, with the main value in the continuing-horizon handling.

read the letter

The punchline is that this paper gives a clean way to handle budget constraints when contexts can be adversarial by reducing the problem to an ordinary unconstrained contextual bandit via adaptively defined surrogate rewards. It keeps the algorithm efficient and the analysis short by leaning on existing online regression oracles from the SquareCB line of work, and it explicitly targets the continuing setting where the learner does not stop after the budget is exhausted.

Referee Report

0 major / 3 minor

Summary. The paper studies budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. Under the standard realizability assumption (rewards and costs drawn independently from fixed distributions with expectations in known function classes), it focuses on the continuing setting in which the algorithm runs over the full horizon even after the cumulative cost budget is exhausted. The central contribution is a modular reduction, built on the SquareCB framework of Foster et al. (2018), that uses online regression oracles to convert the constrained problem into an unconstrained contextual bandit problem via adaptively defined surrogate reward functions; this yields simultaneous control of regret and budget-violation while claiming improved guarantees relative to prior stochastic-context work, together with an efficient algorithm and compact analysis.

Significance. If the reduction and its analysis are correct, the work is significant because it extends a powerful, oracle-based method (SquareCB) to handle hard budget constraints under adversarial contexts, a setting that is more general and realistic than the stochastic-context focus of most prior constrained-bandit literature. The modularity of the surrogate construction, the transparent handling of the continuing setting via penalty terms inside the surrogate, and the emphasis on compact analysis are genuine strengths that could facilitate adoption and extensions. The explicit use of online regression oracles also aligns with the growing literature on oracle-efficient bandit algorithms.

minor comments (3)

Abstract: the phrase 'improved guarantees for more general adversarial contexts' is stated without a brief quantitative comparison (e.g., a sentence contrasting the new regret/violation bounds with the best prior stochastic-context bounds). Adding one sentence here or in the introduction would make the contribution easier to assess at a glance.
The realizability assumption is correctly identified as minimal for the regression oracles, but the manuscript should explicitly state the precise function-class assumptions (e.g., whether the classes are the same for rewards and costs, any boundedness or Lipschitz conditions) in the notation or preliminaries section to support reproducibility of the oracle calls.
The surrogate-reward construction is described as 'adaptive' yet still amenable to online regression; a short remark clarifying that the adaptation does not invalidate the oracle's regret guarantee (e.g., by referencing the specific property of SquareCB that tolerates this) would strengthen the transparency claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on budget-constrained contextual bandits with adversarial contexts and for highlighting the modularity of the SquareCB-based reduction via surrogate rewards. We appreciate the recognition of the improved guarantees for adversarial contexts and the compact analysis. As no specific major comments were raised, we will incorporate any minor editorial suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity; reduction builds on external SquareCB framework

full rationale

The paper's derivation chain consists of a modular reduction from the budget-constrained problem to an unconstrained contextual bandit via adaptively defined surrogate reward functions, solved using online regression oracles. This explicitly invokes the external SquareCB framework of Foster et al. (2018) as the base, with the realizability assumption stated as the standard minimal condition for those oracles. No load-bearing step reduces by construction to the paper's own inputs, fitted parameters, or self-citations; the surrogate construction and continuing-setting penalty are presented as independent extensions that preserve the base guarantees for adversarial contexts. The analysis is self-contained against the cited external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the realizability assumption and the availability of online regression oracles for the function classes.

axioms (1)

domain assumption Realizability assumption: conditioned on context, reward and cost expectations lie in known function classes.
Explicitly stated in the abstract as the standard assumption enabling the regression oracles.

pith-pipeline@v0.9.0 · 5455 in / 1152 out tokens · 85549 ms · 2026-05-08T13:33:45.913870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

A unifying framework for online optimization with long-term constraints , author=. Advances in Neural Information Processing Systems , volume=
[2]

arXiv preprint arXiv:2312.16730 , year=

Foundations of reinforcement learning and interactive decision making , author=. arXiv preprint arXiv:2312.16730 , year=

work page arXiv
[3]

Advances in neural information processing systems , volume=

The epoch-greedy algorithm for contextual multi-armed bandits , author=. Advances in neural information processing systems , volume=. 2007 , publisher=

2007
[4]

Learning Safety Constraints for Large Language Models , author=
[5]

Forty-second International Conference on Machine Learning , year=

Learning Safety Constraints for Large Language Models , author=. Forty-second International Conference on Machine Learning , year=
[6]

Certifying LLM Safety against Adversarial Prompting , author=
[7]

IEEE Access , year=

Constraints driven safe reinforcement learning for autonomous driving decision-making , author=. IEEE Access , year=
[8]

Optical compressive imaging , pages=

Phase retrieval: An overview of recent developments , author=. Optical compressive imaging , pages=. 2016 , publisher=

2016
[9]

Mathematics of Operations Research , volume=

The online saddle point problem and online convex optimization with knapsacks , author=. Mathematics of Operations Research , volume=. 2025 , publisher=

2025
[10]

Foundations and Trends

Learning with submodular functions: A convex optimization perspective , author=. Foundations and Trends. 2013 , publisher=

2013
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Online dr-submodular maximization: Minimizing regret and constraint violation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[12]

2003 , publisher=

Convex analysis and optimization , author=. 2003 , publisher=

2003
[13]

Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms , pages=

Fast algorithms for online stochastic convex programming , author=. Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms , pages=. 2014 , organization=

2014
[14]

Foundations and Trends

Online matching and ad allocation , author=. Foundations and Trends. 2013 , publisher=

2013
[15]

Operations research , volume=

Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms , author=. Operations research , volume=. 2009 , publisher=

2009
[16]

International Conference on Machine Learning , pages=

Online learning with knapsacks: the best of both worlds , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[17]

arXiv preprint arXiv:1306.0155 , year=

Dynamic ad allocation: Bandits with budgets , author=. arXiv preprint arXiv:1306.0155 , year=

work page arXiv
[18]

2002 , publisher=

Computers and intractability , author=. 2002 , publisher=

2002
[19]

International Conference on Artificial Intelligence and Statistics , pages=

Online continuous submodular maximization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018
[20]

International Conference on Machine Learning , pages=

Stochastic continuous submodular maximization: Boosting via non-oblivious function , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[21]

Dynamic Regret Bounds for Constrained Online Nonconvex Optimization Based on Polyak–Lojasiewicz Regions , year=

Mulvaney-Kemp, Julie and Park, SangWoo and Jin, Ming and Lavaei, Javad , journal=. Dynamic Regret Bounds for Constrained Online Nonconvex Optimization Based on Polyak–Lojasiewicz Regions , year=
[22]

International Conference on Artificial Intelligence and Statistics , pages=

Online learning with non-convex losses and non-stationary regret , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018
[23]

Conference on Learning Theory , pages=

Learning in non-convex games with an optimization oracle , author=. Conference on Learning Theory , pages=. 2019 , organization=

2019
[24]

IEEE Trans

Phase retrieval via reweighted amplitude flow , author=. IEEE Trans. Signal Process. , volume=. 2018 , publisher=

2018
[25]

On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , year=

Vaze, Rahul , booktitle=. On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , year=
[26]

Advances in Neural Information Processing Systems , volume=

Adaptive hedge , author=. Advances in Neural Information Processing Systems , volume=
[27]

A low complexity algorithm with

Yu, Hao and Neely, Michael J , journal=. A low complexity algorithm with
[28]

arXiv preprint arXiv:2501.16046 , year=

Revisiting Projection-Free Online Learning with Time-Varying Constraints , author=. arXiv preprint arXiv:2501.16046 , year=

work page arXiv
[29]

2023 , eprint=

Playing in the Dark: No-regret Learning with Adversarial Constraints , author=. 2023 , eprint=

2023
[30]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Optimal Algorithms for Online Convex Optimization with Adversarial Constraints , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[31]

IEEE Transactions on automatic control , volume=

Online convex optimization with time-varying constraints and bandit feedback , author=. IEEE Transactions on automatic control , volume=. 2018 , publisher=

2018
[32]

Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=

Minimizing queue length regret under adversarial network models , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2018 , publisher=

2018
[33]

IEEE Internet of Things Journal , volume=

Bandit convex optimization for scalable and dynamic IoT management , author=. IEEE Internet of Things Journal , volume=. 2018 , publisher=

2018
[34]

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=

Nested convex bodies are chaseable , author=. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2018 , organization=

2018
[35]

Proceedings of the Symposium on the Mathematical Theory of Automata , volume=

On convergence proofs on perceptrons , author=. Proceedings of the Symposium on the Mathematical Theory of Automata , volume=. 1962 , organization=

1962
[36]

Journal of optimization theory and applications , volume=

Subgradient methods for saddle-point problems , author=. Journal of optimization theory and applications , volume=. 2009 , publisher=

2009
[37]

IEEE Transactions on Automatic Control , year=

Regret and cumulative constraint violation analysis for distributed online constrained convex optimization , author=. IEEE Transactions on Automatic Control , year=
[38]

Nature , volume=

Dense reinforcement learning for safety validation of autonomous vehicles , author=. Nature , volume=. 2023 , publisher=

2023
[39]

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=

A nearly-linear bound for chasing nested convex bodies , author=. Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2019 , organization=

2019
[40]

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=

Chasing nested convex bodies nearly optimally , author=. Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2020 , organization=

2020
[41]

IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pages=

Learning to Cache With No Regrets , author=. IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pages=. 2019 , organization=

2019
[42]

2012 , publisher=

The Borel-Cantelli Lemma , author=. 2012 , publisher=

2012
[43]

Kingma and Jimmy Ba , editor =

Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =

2015
[44]

2008 , publisher=

Control techniques for complex networks , author=. 2008 , publisher=

2008
[45]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
[46]

International Conference on Machine Learning , pages=

Regret and cumulative constraint violation analysis for online convex optimization with long term constraints , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[47]

An Overview of Multi-Task Learning in Deep Neural Networks

An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=

work page internal anchor Pith review arXiv
[48]

2015 , publisher=

Random processes for engineers , author=. 2015 , publisher=

2015
[49]

Advances in Neural Information Processing Systems , volume=

Online convex optimization with hard constraints: Towards the best of two worlds and beyond , author=. Advances in Neural Information Processing Systems , volume=
[50]

Advances in Neural Information Processing Systems , volume=

Adaptive smoothed online multi-task learning , author=. Advances in Neural Information Processing Systems , volume=
[51]

International Conference on Computational Learning Theory , pages=

Online multitask learning , author=. International Conference on Computational Learning Theory , pages=. 2006 , organization=

2006
[52]

Rockafellar, R. Tyrrel. Convex Analysis
[53]

Electronic colloquium on computational complexity (ECCC) , volume=

Adaptive algorithms for online decision problems , author=. Electronic colloquium on computational complexity (ECCC) , volume=
[54]

Artificial Intelligence and Statistics , pages=

Improved strongly adaptive online learning using coin betting , author=. Artificial Intelligence and Statistics , pages=. 2017 , organization=

2017
[55]

Theoretical Computer Science , volume=

Scale-free online learning , author=. Theoretical Computer Science , volume=. 2018 , publisher=

2018
[56]

International Conference on Machine Learning , pages=

A simple yet universal strategy for online convex optimization , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[57]

Machine learning , volume=

Regret bounded by gradual variation for online convex optimization , author=. Machine learning , volume=. 2014 , publisher=

2014
[58]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[59]

International Conference on Machine Learning , pages=

Adaptive regret of convex and smooth functions , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[60]

Proceedings of the 36th International Conference on Machine Learning , pages =

Cautious Regret Minimization: Online Optimization with Long-Term Budget Constraints , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019
[61]

2003 , publisher=

Applied probability and queues , author=. 2003 , publisher=

2003
[62]

The Journal of Machine Learning Research , volume=

Trading regret for efficiency: online convex optimization with long term constraints , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=

2012
[63]

Proceedings of the 34th International Conference on Machine Learning , pages =

Safety-Aware Algorithms for Adversarial Contextual Bandit , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

2017
[64]

ACM SIGMETRICS Performance Evaluation Review , volume=

Simultaneously achieving sublinear regret and constraint violations for online convex optimization with time-varying constraints , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2022 , publisher=

2022
[65]

arXiv preprint arXiv:2306.00149 , year=

Distributed Online Convex Optimization with Adversarial Constraints: Reduced Cumulative Constraint Violation Bounds under Slater's Condition , author=. arXiv preprint arXiv:2306.00149 , year=

work page arXiv
[66]

A Low Complexity Algorithm with O (

Yu, Hao and Neely, Michael J , journal=. A Low Complexity Algorithm with O (
[67]

29th IEEE Conference on Decision and Control , pages=

Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , author=. 29th IEEE Conference on Decision and Control , pages=. 1990 , organization=

1990
[68]

IEEE Transactions on Communications , volume=

Achieving 100\ author=. IEEE Transactions on Communications , volume=. 1999 , publisher=

1999
[69]

1992 , publisher=

Integral inequalities and applications , author=. 1992 , publisher=

1992
[70]

arXiv preprint arXiv:1702.04783 , year=

Online convex optimization with time-varying constraints , author=. arXiv preprint arXiv:1702.04783 , year=

work page arXiv
[71]

, author=

Online Learning with Sample Path Constraints. , author=. Journal of Machine Learning Research , volume=
[72]

Advances in Neural Information Processing Systems , volume=

Online convex optimization with stochastic constraints , author=. Advances in Neural Information Processing Systems , volume=
[73]

International Conference on Machine Learning , pages=

Adaptive algorithms for online convex optimization with long-term constraints , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016
[74]

Advances in Neural Information Processing Systems , volume=

Online convex optimization for cumulative constraints , author=. Advances in Neural Information Processing Systems , volume=
[75]

2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=

On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization , author=. 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=. 2022 , organization=

2022
[76]

Advances in neural information processing systems , volume=

Adaptive online gradient descent , author=. Advances in neural information processing systems , volume=
[77]

Advances in Neural Information Processing Systems , volume=

Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits , author=. Advances in Neural Information Processing Systems , volume=
[78]

Mathematical Proceedings of the Cambridge Philosophical Society , volume=

The theory of queues with a single server , author=. Mathematical Proceedings of the Cambridge Philosophical Society , volume=. 1952 , organization=

1952
[79]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Achieving Counterfactual Fairness for Causal Bandit , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[80]

Advances in Neural Information Processing Systems , volume=

Fair algorithms for multi-agent multi-armed bandits , author=. Advances in Neural Information Processing Systems , volume=

Showing first 80 references.