With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
SIAM journal on computing , volume=
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Presents first online L2D algorithm for multiclass classification with bandit feedback and varying experts, achieving O((n+n_e)T^{2/3}) regret generally and O((n+n_e)√T) under low noise.
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
F is learnable in adversarial noisy bandits iff the generalized maximin volume on co(F) is positive at every scale, equivalently for oblivious and adaptive adversaries.
A restarting-based nonparametric online learning method for dynamic pricing with one-point revenue feedback that achieves regret bounds scaling with time horizon and total market variation.
citing papers explorer
-
Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
-
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
-
Online Learning-to-Defer with Varying Experts
Presents first online L2D algorithm for multiclass classification with bandit feedback and varying experts, achieving O((n+n_e)T^{2/3}) regret generally and O((n+n_e)√T) under low noise.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
-
A Complete Characterization of Learnability for Adversarial Noisy Bandits
F is learnable in adversarial noisy bandits iff the generalized maximin volume on co(F) is positive at every scale, equivalently for oblivious and adaptive adversaries.
-
Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity
A restarting-based nonparametric online learning method for dynamic pricing with one-point revenue feedback that achieves regret bounds scaling with time horizon and total market variation.