Recognition: unknown
Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier
Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3
The pith
Log-barrier regularization achieves the optimal last-iterate convergence rate of tilde-O of t to the minus one quarter with high probability in uncoupled zero-sum matrix games under bandit feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors prove that online mirror descent equipped with log-barrier regularization, when analyzed from the dual side, produces last-iterate strategies whose exploitability gap shrinks as O-tilde of t to the power of negative one quarter with high probability in the uncoupled bandit-feedback setting for zero-sum matrix games. They further show that the identical rate holds when the same regularization and analysis are carried over to extensive-form games.
What carries the argument
The log-barrier regularizer inside the mirror-descent update, together with the dual-focused analysis that tracks convergence through dual variables rather than directly through the played mixed strategies.
If this is right
- The last-iterate exploitability gap reaches the information-theoretically optimal rate.
- The algorithm requires no extra communication or full-information sharing between players.
- The same convergence rate applies to extensive-form games.
- High-probability bounds hold under stochastic payoff realizations.
Where Pith is reading between the lines
- Barrier-style regularizers paired with dual analyses may close optimality gaps in other limited-feedback online learning problems for games.
- The dual perspective could simplify rate proofs for additional regularized dynamics in bandit game settings.
- Extensions to non-zero-sum or multi-player interactions become plausible if the dual tracking can be preserved.
Load-bearing premise
The dual-focused analysis of the log-barrier regularizer continues to work when each player receives only its own bandit observations and the two players never coordinate or exchange payoff matrices.
What would settle it
On a fixed small zero-sum matrix, run the log-barrier algorithm for increasing horizons t and observe that the last-iterate exploitability gap fails to decrease at a rate comparable to t to the minus one quarter.
read the original abstract
We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies last-iterate convergence to minimax equilibria in zero-sum matrix games under bandit feedback. It proposes an online mirror descent algorithm regularized by the log-barrier, analyzed via a dual-focused potential, and claims to obtain high-probability exploitability of order Õ(t^{-1/4}), matching the lower bound of Fiegel et al. (2025). The same rate is shown to extend to extensive-form games.
Significance. If the high-probability last-iterate bound is correct, the result is significant: it supplies the first algorithm attaining the optimal rate in the fully uncoupled bandit setting, where prior methods fell short. The dual-focused analysis of the log-barrier is a technical contribution that may generalize to other limited-feedback settings. The extension to extensive-form games broadens applicability. The paper provides a high-probability guarantee rather than an in-expectation one, which strengthens the claim.
major comments (2)
- [§4] §4 (Dual Analysis) and the proof of the main theorem: the contraction argument for the dual variables must be shown to rely exclusively on each player's local noisy payoff observation and its own strategy update. It is not yet clear whether the potential decrease avoids any identity that holds only in expectation over the product distribution or requires simultaneous access to both players' dual variables; this step is load-bearing for the high-probability uncoupled bandit claim.
- [Theorem 1] Theorem 1 (high-probability bound): the variance control for the log-barrier under bandit feedback is only sketched; an explicit concentration argument that closes the gap between the observed payoff and the true expected payoff without additional coordination must be supplied, as any hidden joint expectation would invalidate the uncoupled setting.
minor comments (2)
- Notation for the dual variables and the log-barrier parameter should be introduced with a single consistent definition before the analysis begins.
- The extension to extensive-form games is stated at the same rate; a brief remark on how the sequence-form representation interacts with the bandit sampling would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and insightful report. The two major comments highlight areas where the presentation of the uncoupled, high-probability analysis can be strengthened. We will revise the manuscript to make the local-only nature of the dual contraction and the explicit concentration arguments fully transparent, without altering the core claims or results.
read point-by-point responses
-
Referee: [§4] §4 (Dual Analysis) and the proof of the main theorem: the contraction argument for the dual variables must be shown to rely exclusively on each player's local noisy payoff observation and its own strategy update. It is not yet clear whether the potential decrease avoids any identity that holds only in expectation over the product distribution or requires simultaneous access to both players' dual variables; this step is load-bearing for the high-probability uncoupled bandit claim.
Authors: We agree that the current write-up leaves room for clarification on this load-bearing step. The dual-focused potential is defined separately for each player and the contraction is derived from the Bregman divergence generated by the log-barrier using only the player's own noisy payoff vector (obtained via bandit feedback) and its own dual-variable update. No identity that holds solely in expectation over the joint product distribution is invoked; the decrease is shown pathwise for each player independently. In the revision we will add an explicit paragraph that isolates the one-player view, states the information available at each step, and confirms that the argument never references the opponent's dual variables or requires simultaneous access. revision: partial
-
Referee: [Theorem 1] Theorem 1 (high-probability bound): the variance control for the log-barrier under bandit feedback is only sketched; an explicit concentration argument that closes the gap between the observed payoff and the true expected payoff without additional coordination must be supplied, as any hidden joint expectation would invalidate the uncoupled setting.
Authors: We accept that the variance-control step is currently only sketched and requires an explicit, self-contained argument. In the revised manuscript we will insert a dedicated lemma that applies a martingale concentration inequality (Freedman's inequality) directly to the sequence of bandit observations available to a single player. The bound relates the observed payoff to the true expected payoff under the player's current mixed strategy, using only that player's local samples and the known variance proxy induced by the log-barrier. The argument is applied independently to each player; the overall high-probability statement follows from a union bound over the two independent noise processes. No joint expectation or coordination between players is used. revision: yes
Circularity Check
No significant circularity in convergence proof
full rationale
The paper derives an algorithmic upper bound on last-iterate exploitability using log-barrier regularization and a dual-focused analysis, matching a previously established lower bound of Omega(t^{-1/4}). The derivation chain consists of standard online mirror descent updates, potential-function arguments, and high-probability concentration bounds that operate on each player's local bandit observations. No equation or step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the cited lower bound (Fiegel et al. 2025) serves only as an optimality benchmark and is not invoked to justify the upper-bound analysis itself. The central claim therefore remains an independent constructive proof rather than a renaming or tautological restatement of its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Zero-sum matrix games admit a minimax theorem and the strategy space is the probability simplex.
- domain assumption Bandit feedback provides unbiased estimates of payoffs with bounded variance.
Forward citations
Cited by 1 Pith paper
-
Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
Reference graph
Works this paper leans on
-
[1]
Minimax policies for adversarial and stochastic bandits
Jean-Yves Audibert and S \'e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, 2009
2009
-
[2]
Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback
Yang Cai, Haipeng Luo, Chen-Yu Wei, and Weiqiang Zheng. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. In Neural Information Processing Systems, 2023
2023
-
[3]
The harder path: Last iterate convergence for uncoupled learning in zero-sum games with bandit feedback https://proceedings.mlr.press/v267/fiegel25a.html
C\^ o me Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, and Vianney Perchet. The harder path: Last iterate convergence for uncoupled learning in zero-sum games with bandit feedback https://proceedings.mlr.press/v267/fiegel25a.html. In International Conference on Machine Learning, 2025
2025
-
[4]
Fundamentals of Convex Analysis, pages 163--208
Jean-Baptiste Hiriart-Urruty and Claude Lemar \'e chal. Fundamentals of Convex Analysis, pages 163--208. Springer, 2001
2001
-
[5]
Let's be honest: An optimal no-regret framework for zero-sum games https://proceedings.mlr.press/v80/kangarshahi18a.html
Ehsan Asadi Kangarshahi, Ya-Ping Hsieh, Mehmet Fatih Sahin, and Volkan Cevher. Let's be honest: An optimal no-regret framework for zero-sum games https://proceedings.mlr.press/v80/kangarshahi18a.html. In International Conference on Machine Learning, 2018
2018
-
[6]
Faster first-order methods for extensive-form game solving
Christian Kroer, Kevin Waugh, Fatma Kilin c -Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In ACM Conference on Economics and Computation, 2015
2015
-
[7]
Martinet
B. Martinet. https://www.esaim-m2an.org/articles/m2an/abs/1970/03/m2an197004R301541/m2an197004R301541.html Br\`eve communication. R \'e gularisation d'in \'e quations variationnelles par approximations successives . Revue fran c aise d'informatique et de recherche op \'e rationnelle. S \'e rie rouge , 4(R3):154--158, 1970
1970
-
[8]
L. D. Popov. A modification of the Arrow-Hurwicz method for search of saddle points https://doi.org/10.1007/BF01141092. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845--848, 1980
-
[9]
Optimization, learning, and games with predictable sequences
Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Neural Information Processing Systems, 2013
2013
-
[10]
Formes bilin \'e aires coercitives sur les ensembles convexes
Guido Stampacchia. Formes bilin \'e aires coercitives sur les ensembles convexes. In Comptes Rendus de l'Acad \'e mie des Sciences, Paris , 1964
1964
-
[11]
Monotone operators and the proximal point algorithm https://epubs.siam.org/doi/10.1137/0314056
Rockafellar Ralph Tyrrell. Monotone operators and the proximal point algorithm https://epubs.siam.org/doi/10.1137/0314056. SIAM Journal on Control and Optimization, 1976
-
[12]
Mathematische Annalen , year =
J. von Neumann. Zur Theorie der Gesellschaftsspiele https://doi.org/10.1007/BF01448847. Mathematische Annalen, 100:295--320, 1928
-
[13]
Efficient computation of behavior strategies
Bernhard von Stengel . Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220--246, 1996
1996
-
[14]
Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games http://proceedings.mlr.press/v134/wei21a.html
Chen - Yu Wei, Chung - Wei Lee, Mengxiao Zhang, and Haipeng Luo. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games http://proceedings.mlr.press/v134/wei21a.html. In Conference on Learning Theory, 2021
2021
-
[15]
An optimal algorithm for stochastic and adversarial bandits http://proceedings.mlr.press/v89/zimmert19a.html
Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits http://proceedings.mlr.press/v89/zimmert19a.html. In International Conference on Artificial Intelligence and Statistics, 2019
2019
-
[16]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.