Capacity-Constrained Online Convex Optimization with Delayed Feedback
Pith reviewed 2026-06-27 10:23 UTC · model grok-4.3
The pith
Capacity of order log T suffices to recover standard regret rates in delayed online convex optimization with first-order feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a hard capacity constraint of C tracked pending rounds, a randomized scheduler that selects rounds to track and importance-weights the resulting observations reduces capacity-constrained delayed OCO to a delayed-and-weighted OCO problem; Delayed-Weighted FTRL applied to this base problem yields regret bounds that recover the standard unconstrained delayed OCO rates up to logarithmic factors whenever C equals Omega(log T) for first-order feedback and that are modulated by powers of (1 + sigma_max / C) for bandit feedback, where sigma_max denotes the maximum number of simultaneously pending observations.
What carries the argument
The randomized scheduler that decides which pending rounds to track at each step and applies importance weights to the returned observations, thereby reducing the capacity-constrained problem to delayed and weighted OCO.
If this is right
- C = Omega(log T) recovers standard delayed OCO regret up to log factors for first-order feedback on convex and strongly convex losses.
- Bandit regret scales explicitly with powers of (1 + sigma_max / C) and degrades gracefully when C is smaller than sigma_max.
- The same scheduler-plus-Delayed-Weighted-FTRL construction supplies the first sublinear regret guarantees under explicit capacity limits.
- The interaction between time-varying importance weights and delayed feedback is characterized explicitly in the base regret bounds.
Where Pith is reading between the lines
- The same reduction technique may extend to other online problems that combine delays with hard resource limits on what can be observed.
- Practical implementations could replace the randomized scheduler with deterministic heuristics while preserving the logarithmic capacity threshold.
- The dependence on sigma_max suggests that capacity requirements are driven by the worst-case burst of simultaneous delays rather than average delay length.
Load-bearing premise
The learner observes delay expirations online and can randomize its tracking decisions together with importance weighting at no extra cost.
What would settle it
A concrete run or calculation showing that for C equal to o(log T) the realized regret exceeds the standard delayed OCO bound by more than a logarithmic factor would falsify the recovery claim for first-order feedback.
Figures
read the original abstract
Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = \Omega(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + \sigma_{\text{max}}/C)$, where $\sigma_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < \sigma_{\text{max}}$, while remaining sublinear.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to provide the first regret guarantees for capacity-constrained delayed online convex optimization (OCO) with convex and strongly convex losses under both first-order and bandit feedback. It introduces a semi-clairvoyant delay model (observing expirations online), reduces the problem via a randomized capacity scheduler with importance weighting to a novel 'delayed and weighted' OCO setting, and analyzes Delayed-Weighted FTRL (and its bandit variant) to show that C = Ω(log T) recovers standard delayed OCO rates up to logarithmic factors for first-order feedback, while bandit regret is modulated by powers of (1 + σ_max/C).
Significance. If the central reduction and variance bounds hold, the work addresses a practical gap between theoretical delayed OCO (unlimited tracking) and resource-constrained settings, providing explicit dependence on capacity C and graceful degradation. The explicit characterization of weight-delay interactions in the base FTRL analysis is a strength, as is the near-optimal capacity threshold for first-order feedback.
major comments (3)
- [§3] §3 (Scheduler and reduction): The claim that the randomized tracking scheduler produces unbiased importance-weighted feedback with second-moment bounded by O(σ_max/C) under the hard cardinality constraint C, while observing expirations online, is load-bearing for the C=Ω(log T) result. The analysis must explicitly control correlations induced by the online expiration observations and the per-step cardinality enforcement; without this, the variance term may introduce factors beyond logarithmic overhead when many rounds compete for slots.
- [§5] §5 (Delayed-Weighted FTRL analysis): The regret bound for the weighted delayed setting must absorb the random importance weights' variance without additional assumptions on the delay sequence. The standard FTRL analysis is invoked, but the interaction between time-varying random weights (with second moment scaling as σ_max/C) and delays requires a concrete variance term in the bound; this is not sketched in sufficient detail to verify the 'up to logarithmic factors' claim for first-order feedback.
- [Bandit regret theorem] Bandit case (Theorem on bandit regret): The modulation by powers of (1 + σ_max/C) relies on the importance-weighted estimator having controlled variance under the scheduler. When C << σ_max this is plausible, but the proof must show that the hard constraint does not inflate the effective variance beyond the stated power; the current high-level claim leaves this unverified.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a short proof sketch or high-level derivation outline for the key reduction and variance bound to improve readability.
- [§2] Notation for σ_max and the pending set should be defined at first use with a clear reference to the delay model.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We appreciate the acknowledgment of the practical gap addressed and the strengths of the reduction and FTRL analysis. We address each major comment below and will incorporate expanded proof details and clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Scheduler and reduction): The claim that the randomized tracking scheduler produces unbiased importance-weighted feedback with second-moment bounded by O(σ_max/C) under the hard cardinality constraint C, while observing expirations online, is load-bearing for the C=Ω(log T) result. The analysis must explicitly control correlations induced by the online expiration observations and the per-step cardinality enforcement; without this, the variance term may introduce factors beyond logarithmic overhead when many rounds compete for slots.
Authors: We agree that explicit control of correlations under the online semi-clairvoyant observations and hard cardinality enforcement is essential. Appendix B derives the second-moment bound O(σ_max/C) via a martingale analysis of the randomized per-step sampling (without replacement under capacity C) that accounts for adaptive expiration revelations; the resulting variance does not introduce super-logarithmic overhead. We will add a dedicated lemma and expanded discussion of this correlation control to Section 3 in the revision. revision: yes
-
Referee: [§5] §5 (Delayed-Weighted FTRL analysis): The regret bound for the weighted delayed setting must absorb the random importance weights' variance without additional assumptions on the delay sequence. The standard FTRL analysis is invoked, but the interaction between time-varying random weights (with second moment scaling as σ_max/C) and delays requires a concrete variance term in the bound; this is not sketched in sufficient detail to verify the 'up to logarithmic factors' claim for first-order feedback.
Authors: Section 5 and Appendix C extend the FTRL analysis to time-varying random weights by incorporating their second-moment term directly into the stability bound of the potential function, yielding an additive O((σ_max/C) log T) term that is absorbed into the logarithmic factors when C = Ω(log T). The derivation holds for arbitrary delay sequences under the semi-clairvoyant model. We will include a more detailed sketch of the variance absorption step in the main text of the revision. revision: yes
-
Referee: [Bandit regret theorem] Bandit case (Theorem on bandit regret): The modulation by powers of (1 + σ_max/C) relies on the importance-weighted estimator having controlled variance under the scheduler. When C << σ_max this is plausible, but the proof must show that the hard constraint does not inflate the effective variance beyond the stated power; the current high-level claim leaves this unverified.
Authors: The proof of the bandit regret theorem combines the scheduler variance bound (Appendix B) with the standard importance-weighted bandit estimator. The hard cardinality constraint is enforced by the same randomized scheduler, and the analysis establishes that the effective variance scales as O((σ_max/C)^k) without further inflation. We will add an explicit supporting lemma verifying this interaction in the revision. revision: yes
Circularity Check
No significant circularity; derivation relies on independent reduction and standard FTRL analysis
full rationale
The paper defines a semi-clairvoyant model, reduces capacity-constrained tracking to a delayed-and-weighted OCO instance via randomized scheduling plus importance weighting, and analyzes the base problem with Delayed-Weighted FTRL. All regret bounds are stated to follow from explicit dependence on the realized weights, delays, and capacity C; no equation is defined in terms of its own output, no fitted parameter is relabeled as a prediction, and no load-bearing step rests on a self-citation whose content is itself unverified. The claimed rates for C = Ω(log T) are therefore derived rather than tautological.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Loss functions are convex or strongly convex
- domain assumption Maximum number of pending observations sigma_max is finite
invented entities (2)
-
semi-clairvoyant delay model
no independent evidence
-
randomized capacity scheduler with importance weighting
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A proof for the queuing formula:
Little, John DC , journal=. A proof for the queuing formula:. 1961 , publisher=
1961
-
[2]
2005 , publisher=
Online computation and competitive analysis , author=. 2005 , publisher=
2005
-
[3]
Theoretical Computer Science , volume=
Scale-free online learning , author=. Theoretical Computer Science , volume=. 2018 , publisher=
2018
-
[4]
2018 , publisher=
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
2018
-
[5]
arXiv preprint arXiv:2602.02634 , year=
A Reduction from Delayed to Immediate Feedback for Online Convex Optimization with Improved Guarantees , author=. arXiv preprint arXiv:2602.02634 , year=
-
[6]
A best-of-both-worlds algorithm for bandits with delayed feedback with robustness to excessive delays , author=
-
[7]
Reinforcement learning with random delays , author=
-
[8]
Regret bounds for adversarial contextual bandits with general function approximation and delayed feedback , author=
-
[9]
Proceedings of the 20th
Online convex programming and generalized infinitesimal gradient ascent , author=. Proceedings of the 20th
-
[10]
Stochastic convex optimization with bandit feedback , author=
-
[11]
2021 , organization=
Online learning with optimism and delay , author=. 2021 , organization=
2021
-
[12]
Delay-tolerant algorithms for asynchronous distributed online learning , author=
-
[13]
arXiv preprint arXiv:2204.04964 , year=
Projection-free online learning with arbitrary delays , author=. arXiv preprint arXiv:2204.04964 , year=
-
[14]
2013 , organization=
On the complexity of bandit and derivative-free stochastic convex optimization , author=. 2013 , organization=
2013
-
[15]
2016 , organization=
Multi-scale exploration of convex functions and bandit convex optimization , author=. 2016 , organization=
2016
-
[16]
arXiv preprint arXiv:1603.04350 , year=
An optimal algorithm for bandit convex optimization , author=. arXiv preprint arXiv:1603.04350 , year=
-
[17]
Mathematical Statistics and Learning , volume=
Improved regret for zeroth-order adversarial bandit convex optimisation , author=. Mathematical Statistics and Learning , volume=
-
[18]
Near-optimal regret for adversarial
Jin, Tiancheng and Lancewicki, Tal and Luo, Haipeng and Mansour, Yishay and Rosenberg, Aviv , journal=. Near-optimal regret for adversarial
-
[19]
Learning adversarial
Lancewicki, Tal and Rosenberg, Aviv and Mansour, Yishay , booktitle=. Learning adversarial
-
[20]
Proceedings of the 40th
Delayed Bandits: When Do Intermediate Observations Help? , author =. Proceedings of the 40th. 2023 , volume =
2023
-
[21]
A best-of-both-worlds algorithm for bandits with delayed feedback , author=
-
[22]
Optimal strategies and minimax lower bounds for online convex games , author=
-
[23]
International Conference on Algorithmic Learning Theory , pages=
On-line learning with delayed label feedback , author=. International Conference on Algorithmic Learning Theory , pages=. 2005 , organization=
2005
-
[24]
2016 , organization=
Delay and cooperation in nonstochastic bandits , author=. 2016 , organization=
2016
-
[25]
A unified analysis of nonstochastic delayed feedback for combinatorial semi-bandits, linear bandits, and
Van Der Hoeven, Dirk and Zierahn, Lukas and Lancewicki, Tal and Rosenberg, Aviv and Cesa-Bianchi, Nicol. A unified analysis of nonstochastic delayed feedback for combinatorial semi-bandits, linear bandits, and. The Thirty Sixth Annual. 2023 , organization=
2023
-
[26]
Machine Learning , volume=
Online strongly convex optimization with unknown delays , author=. Machine Learning , volume=. 2022 , publisher=
2022
-
[27]
Proceedings of the ACM Web Conference 2024 , pages=
Online sequential decision-making with unknown delays , author=. Proceedings of the ACM Web Conference 2024 , pages=
2024
-
[28]
Slow learners are fast , author=
-
[29]
IEEE Transactions on Information Theory , volume=
On delayed prediction of individual sequences , author=. IEEE Transactions on Information Theory , volume=. 2002 , publisher=
2002
-
[30]
2020 , organization=
Gradient-free online learning in continuous games with delayed rewards , author=. 2020 , organization=
2020
-
[31]
Nonstochastic multiarmed bandits with unrestricted delays , author=
-
[32]
International Conference on Artificial Intelligence and Statistics , pages=
An optimal algorithm for adversarial bandits with arbitrary delays , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
2020
-
[33]
Foundations and Trends
Online learning and online convex optimization , author=. Foundations and Trends. 2012 , publisher=
2012
-
[34]
2013 , organization=
Online learning under delayed feedback , author=. 2013 , organization=
2013
-
[35]
arXiv preprint arXiv:2402.06535 , year=
Bandit convex optimisation , author=. arXiv preprint arXiv:2402.06535 , year=
-
[36]
Foundations and Trends in Optimization , volume=
Introduction to online convex optimization , author=. Foundations and Trends in Optimization , volume=. 2016 , publisher=
2016
-
[37]
Machine Learning , volume=
Logarithmic regret algorithms for online convex optimization , author=. Machine Learning , volume=. 2007 , publisher=
2007
-
[38]
Online learning with adversarial delays , author=
-
[39]
, author=
Optimal algorithms for online convex optimization with multi-point bandit feedback. , author=. COLT , pages=
-
[40]
IEEE Transactions on Information Theory , volume=
Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=. 2015 , publisher=
2015
-
[41]
Journal of Machine Learning Research , volume=
An optimal algorithm for bandit and zero-order convex optimization with two-point feedback , author=. Journal of Machine Learning Research , volume=
-
[42]
Journal of Machine Learning Research , volume=
No weighted-regret learning in adversarial bandits with delays , author=. Journal of Machine Learning Research , volume=
-
[43]
Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages =
Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient , author =. Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages =
-
[44]
arXiv preprint arXiv:1912.13213 , year=
A modern introduction to online learning , author=. arXiv preprint arXiv:1912.13213 , year=
Pith/arXiv arXiv 1912
-
[45]
Improved regret for bandit convex optimization with delayed feedback , author=
-
[46]
Proceedings of the 42nd
Exploiting Curvature in Online Convex Optimization with Delayed Feedback , author =. Proceedings of the 42nd. 2025 , volume =
2025
-
[47]
Proceedings of Thirty Eighth
Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs , author =. Proceedings of Thirty Eighth. 2025 , volume =
2025
-
[48]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[49]
2006 , publisher=
Prediction, learning, and games , author=. 2006 , publisher=
2006
-
[50]
2022 , publisher =
Scheduling: Theory, Algorithms, and Systems , author =. 2022 , publisher =
2022
-
[51]
SODA , volume=
Online Interval Scheduling , author=. SODA , volume=
-
[52]
Theoretical Computer Science , volume=
On-line scheduling of jobs with fixed start and end times , author=. Theoretical Computer Science , volume=. 1994 , publisher=
1994
-
[53]
Proceedings of the twenty-eighth annual
The space complexity of approximating the frequency moments , author=. Proceedings of the twenty-eighth annual
-
[54]
2005 , publisher=
Data streams: Algorithms and Applications , author=. 2005 , publisher=
2005
-
[55]
An improved data stream summary: the
Cormode, Graham and Muthukrishnan, Shan , journal=. An improved data stream summary: the. 2005 , publisher=
2005
-
[56]
2007 , publisher=
Discrete Mathematics & Theoretical Computer Science , author=. 2007 , publisher=
2007
-
[57]
Forty-first
Non-stationary online convex optimization with arbitrary delays , author=. Forty-first
-
[58]
Forty-third International Conference on Machine Learning , year=
Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory , author=. Forty-third International Conference on Machine Learning , year=
-
[59]
arXiv preprint arXiv:2601.07901 , year=
Decentralized online convex optimization with unknown feedback delays , author=. arXiv preprint arXiv:2601.07901 , year=
-
[60]
Forty-third International Conference on Machine Learning , year =
Reinforcement Learning with Action-Triggered Observations , author =. Forty-third International Conference on Machine Learning , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.