pith. machine review for the scientific record. sign in

arxiv: 2604.15242 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords last-iterate convergencebandit feedbackmatrix gameslog-barrier regularizationonline mirror descentzero-sum gamesextensive-form gamesminimax policies
0
0 comments X

The pith

Log-barrier regularization achieves the optimal last-iterate convergence rate of tilde-O of t to the minus one quarter with high probability in uncoupled zero-sum matrix games under bandit feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous results established that uncoupled players in zero-sum matrix games with only bandit feedback cannot hope for last-iterate convergence faster than Omega of t to the minus one quarter. The paper shows that replacing standard regularizers with a log-barrier term and switching to a dual-focused analysis yields a matching high-probability upper bound. The same technique extends the rate to extensive-form games. A reader cares because the bound is now tight and applies directly to the practical case where each player sees only its own payoff samples and must play a near-equilibrium strategy at the final iterate.

Core claim

The authors prove that online mirror descent equipped with log-barrier regularization, when analyzed from the dual side, produces last-iterate strategies whose exploitability gap shrinks as O-tilde of t to the power of negative one quarter with high probability in the uncoupled bandit-feedback setting for zero-sum matrix games. They further show that the identical rate holds when the same regularization and analysis are carried over to extensive-form games.

What carries the argument

The log-barrier regularizer inside the mirror-descent update, together with the dual-focused analysis that tracks convergence through dual variables rather than directly through the played mixed strategies.

If this is right

  • The last-iterate exploitability gap reaches the information-theoretically optimal rate.
  • The algorithm requires no extra communication or full-information sharing between players.
  • The same convergence rate applies to extensive-form games.
  • High-probability bounds hold under stochastic payoff realizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Barrier-style regularizers paired with dual analyses may close optimality gaps in other limited-feedback online learning problems for games.
  • The dual perspective could simplify rate proofs for additional regularized dynamics in bandit game settings.
  • Extensions to non-zero-sum or multi-player interactions become plausible if the dual tracking can be preserved.

Load-bearing premise

The dual-focused analysis of the log-barrier regularizer continues to work when each player receives only its own bandit observations and the two players never coordinate or exchange payoff matrices.

What would settle it

On a fixed small zero-sum matrix, run the log-barrier algorithm for increasing horizons t and observe that the last-iterate exploitability gap fails to decrease at a rate comparable to t to the minus one quarter.

read the original abstract

We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies last-iterate convergence to minimax equilibria in zero-sum matrix games under bandit feedback. It proposes an online mirror descent algorithm regularized by the log-barrier, analyzed via a dual-focused potential, and claims to obtain high-probability exploitability of order Õ(t^{-1/4}), matching the lower bound of Fiegel et al. (2025). The same rate is shown to extend to extensive-form games.

Significance. If the high-probability last-iterate bound is correct, the result is significant: it supplies the first algorithm attaining the optimal rate in the fully uncoupled bandit setting, where prior methods fell short. The dual-focused analysis of the log-barrier is a technical contribution that may generalize to other limited-feedback settings. The extension to extensive-form games broadens applicability. The paper provides a high-probability guarantee rather than an in-expectation one, which strengthens the claim.

major comments (2)
  1. [§4] §4 (Dual Analysis) and the proof of the main theorem: the contraction argument for the dual variables must be shown to rely exclusively on each player's local noisy payoff observation and its own strategy update. It is not yet clear whether the potential decrease avoids any identity that holds only in expectation over the product distribution or requires simultaneous access to both players' dual variables; this step is load-bearing for the high-probability uncoupled bandit claim.
  2. [Theorem 1] Theorem 1 (high-probability bound): the variance control for the log-barrier under bandit feedback is only sketched; an explicit concentration argument that closes the gap between the observed payoff and the true expected payoff without additional coordination must be supplied, as any hidden joint expectation would invalidate the uncoupled setting.
minor comments (2)
  1. Notation for the dual variables and the log-barrier parameter should be introduced with a single consistent definition before the analysis begins.
  2. The extension to extensive-form games is stated at the same rate; a brief remark on how the sequence-form representation interacts with the bandit sampling would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful report. The two major comments highlight areas where the presentation of the uncoupled, high-probability analysis can be strengthened. We will revise the manuscript to make the local-only nature of the dual contraction and the explicit concentration arguments fully transparent, without altering the core claims or results.

read point-by-point responses
  1. Referee: [§4] §4 (Dual Analysis) and the proof of the main theorem: the contraction argument for the dual variables must be shown to rely exclusively on each player's local noisy payoff observation and its own strategy update. It is not yet clear whether the potential decrease avoids any identity that holds only in expectation over the product distribution or requires simultaneous access to both players' dual variables; this step is load-bearing for the high-probability uncoupled bandit claim.

    Authors: We agree that the current write-up leaves room for clarification on this load-bearing step. The dual-focused potential is defined separately for each player and the contraction is derived from the Bregman divergence generated by the log-barrier using only the player's own noisy payoff vector (obtained via bandit feedback) and its own dual-variable update. No identity that holds solely in expectation over the joint product distribution is invoked; the decrease is shown pathwise for each player independently. In the revision we will add an explicit paragraph that isolates the one-player view, states the information available at each step, and confirms that the argument never references the opponent's dual variables or requires simultaneous access. revision: partial

  2. Referee: [Theorem 1] Theorem 1 (high-probability bound): the variance control for the log-barrier under bandit feedback is only sketched; an explicit concentration argument that closes the gap between the observed payoff and the true expected payoff without additional coordination must be supplied, as any hidden joint expectation would invalidate the uncoupled setting.

    Authors: We accept that the variance-control step is currently only sketched and requires an explicit, self-contained argument. In the revised manuscript we will insert a dedicated lemma that applies a martingale concentration inequality (Freedman's inequality) directly to the sequence of bandit observations available to a single player. The bound relates the observed payoff to the true expected payoff under the player's current mixed strategy, using only that player's local samples and the known variance proxy induced by the log-barrier. The argument is applied independently to each player; the overall high-probability statement follows from a union bound over the two independent noise processes. No joint expectation or coordination between players is used. revision: yes

Circularity Check

0 steps flagged

No significant circularity in convergence proof

full rationale

The paper derives an algorithmic upper bound on last-iterate exploitability using log-barrier regularization and a dual-focused analysis, matching a previously established lower bound of Omega(t^{-1/4}). The derivation chain consists of standard online mirror descent updates, potential-function arguments, and high-probability concentration bounds that operate on each player's local bandit observations. No equation or step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the cited lower bound (Fiegel et al. 2025) serves only as an optimality benchmark and is not invoked to justify the upper-bound analysis itself. The central claim therefore remains an independent constructive proof rather than a renaming or tautological restatement of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on standard convexity and martingale concentration properties of bandit feedback plus the specific choice of log-barrier; no new entities are introduced.

axioms (2)
  • domain assumption Zero-sum matrix games admit a minimax theorem and the strategy space is the probability simplex.
    Standard setup for the problem; invoked implicitly when defining exploitability gap.
  • domain assumption Bandit feedback provides unbiased estimates of payoffs with bounded variance.
    Required for high-probability bounds in online learning.

pith-pipeline@v0.9.0 · 5429 in / 1293 out tokens · 30553 ms · 2026-05-10T10:52:47.349193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

    cs.LG 2026-05 unverdicted novelty 8.0

    With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.

Reference graph

Works this paper leans on

16 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    Minimax policies for adversarial and stochastic bandits

    Jean-Yves Audibert and S \'e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, 2009

  2. [2]

    Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback

    Yang Cai, Haipeng Luo, Chen-Yu Wei, and Weiqiang Zheng. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. In Neural Information Processing Systems, 2023

  3. [3]

    The harder path: Last iterate convergence for uncoupled learning in zero-sum games with bandit feedback https://proceedings.mlr.press/v267/fiegel25a.html

    C\^ o me Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, and Vianney Perchet. The harder path: Last iterate convergence for uncoupled learning in zero-sum games with bandit feedback https://proceedings.mlr.press/v267/fiegel25a.html. In International Conference on Machine Learning, 2025

  4. [4]

    Fundamentals of Convex Analysis, pages 163--208

    Jean-Baptiste Hiriart-Urruty and Claude Lemar \'e chal. Fundamentals of Convex Analysis, pages 163--208. Springer, 2001

  5. [5]

    Let's be honest: An optimal no-regret framework for zero-sum games https://proceedings.mlr.press/v80/kangarshahi18a.html

    Ehsan Asadi Kangarshahi, Ya-Ping Hsieh, Mehmet Fatih Sahin, and Volkan Cevher. Let's be honest: An optimal no-regret framework for zero-sum games https://proceedings.mlr.press/v80/kangarshahi18a.html. In International Conference on Machine Learning, 2018

  6. [6]

    Faster first-order methods for extensive-form game solving

    Christian Kroer, Kevin Waugh, Fatma Kilin c -Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In ACM Conference on Economics and Computation, 2015

  7. [7]

    Martinet

    B. Martinet. https://www.esaim-m2an.org/articles/m2an/abs/1970/03/m2an197004R301541/m2an197004R301541.html Br\`eve communication. R \'e gularisation d'in \'e quations variationnelles par approximations successives . Revue fran c aise d'informatique et de recherche op \'e rationnelle. S \'e rie rouge , 4(R3):154--158, 1970

  8. [8]

    L. D. Popov. A modification of the Arrow-Hurwicz method for search of saddle points https://doi.org/10.1007/BF01141092. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845--848, 1980

  9. [9]

    Optimization, learning, and games with predictable sequences

    Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Neural Information Processing Systems, 2013

  10. [10]

    Formes bilin \'e aires coercitives sur les ensembles convexes

    Guido Stampacchia. Formes bilin \'e aires coercitives sur les ensembles convexes. In Comptes Rendus de l'Acad \'e mie des Sciences, Paris , 1964

  11. [11]

    Monotone operators and the proximal point algorithm https://epubs.siam.org/doi/10.1137/0314056

    Rockafellar Ralph Tyrrell. Monotone operators and the proximal point algorithm https://epubs.siam.org/doi/10.1137/0314056. SIAM Journal on Control and Optimization, 1976

  12. [12]

    Mathematische Annalen , year =

    J. von Neumann. Zur Theorie der Gesellschaftsspiele https://doi.org/10.1007/BF01448847. Mathematische Annalen, 100:295--320, 1928

  13. [13]

    Efficient computation of behavior strategies

    Bernhard von Stengel . Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220--246, 1996

  14. [14]

    Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games http://proceedings.mlr.press/v134/wei21a.html

    Chen - Yu Wei, Chung - Wei Lee, Mengxiao Zhang, and Haipeng Luo. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games http://proceedings.mlr.press/v134/wei21a.html. In Conference on Learning Theory, 2021

  15. [15]

    An optimal algorithm for stochastic and adversarial bandits http://proceedings.mlr.press/v89/zimmert19a.html

    Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits http://proceedings.mlr.press/v89/zimmert19a.html. In International Conference on Artificial Intelligence and Statistics, 2019

  16. [16]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...