Recognition: 2 theorem links
· Lean TheoremNear-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
Pith reviewed 2026-05-12 03:39 UTC · model grok-4.3
The pith
In zero-sum games where players see both losses and opponent actions, last-iterate duality gap converges at the optimal t^{-1/2} rate with high probability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In two-player zero-sum games with bandit feedback plus full observation of the opponent's action, there exists an efficient algorithm achieving last-iterate convergence of the duality gap at rate t^{-1/2} with high probability; the algorithm updates its strategy only infrequently by solving an estimated log-barrier-regularized game.
What carries the argument
Infrequent strategy updates obtained by solving an estimated log-barrier-regularized game, which circumvents obstacles that block direct application of single-player bandit analyses to the two-player setting.
If this is right
- Last-iterate convergence matches the optimal average-iterate rate whenever opponent actions are observed.
- The same algorithm yields improved rates for dueling bandits as a direct special case.
- The method remains computationally efficient because updates occur only at sparse times.
- Practical preference-learning settings that supply opponent actions now admit provably faster last-iterate guarantees.
Where Pith is reading between the lines
- Similar infrequent-update log-barrier techniques may extend to other partial-information game settings where one extra observation per round is available.
- The separation between average- and last-iterate rates disappears once opponent actions are visible, suggesting that feedback richness can eliminate certain information-theoretic barriers.
- The approach could be tested in multi-agent reinforcement learning environments that naturally provide opponent-action signals.
Load-bearing premise
The novel analysis truly overcomes the fundamental obstacles that stop standard multi-armed bandit arguments from generalizing to games and produces the claimed high-probability bound without hidden dependence on game parameters.
What would settle it
Run the proposed algorithm on a small zero-sum game for increasing horizon t and measure the realized duality gap after the last update; if the gap fails to scale as O(t^{-1/2}) on a positive fraction of trials, the high-probability claim is false.
Figures
read the original abstract
Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent's action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that when players in two-player zero-sum games receive bandit feedback on their loss together with the opponent's chosen action, last-iterate convergence in duality gap can be improved from the t^{-1/3} (expectation) / t^{-1/4} (high probability) rates that hold under loss-only feedback to the optimal t^{-1/2} rate with high probability. The algorithm achieves this by updating its strategy only infrequently, each time by solving an estimated log-barrier-regularized saddle-point problem; a novel analysis is developed to overcome obstacles that prevent direct generalization of standard multi-armed-bandit last-iterate arguments to the game setting. Experiments on synthetic and dueling-bandit instances confirm faster convergence than baselines that ignore opponent-action feedback, and the results are noted to improve prior bounds for the skew-symmetric special case of dueling bandits.
Significance. If the high-probability t^{-1/2} last-iterate bound holds without hidden dependence on game parameters, the result is significant: it shows that the additional opponent-action observation suffices to close the average- versus last-iterate gap that Fiegel et al. (2025) established under loss-only feedback. The infrequent-update log-barrier approach and the accompanying analysis constitute a concrete algorithmic and technical contribution. Reproducible experiments and the explicit improvement for dueling bandits are further strengths.
major comments (1)
- [Main theorem and analysis section] The central high-probability bound (presumably stated in the main theorem) relies on the claim that solving the estimated log-barrier-regularized game at infrequent intervals controls both the bias induced by the update schedule and the variance of the payoff estimates simultaneously. The provided analysis sketch must be checked to confirm that the concentration and stability arguments for the regularized equilibrium do not re-introduce game-dependent factors that would degrade the rate back to t^{-1/3} or t^{-1/4}.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a one-sentence statement of the precise high-probability bound (including the dependence on the number of actions and the failure probability) rather than only the exponent.
- [Algorithm description] Notation for the estimated regularized game and the infrequent-update schedule should be introduced with a clear table or diagram in the algorithm section to aid readability.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our contribution and for the detailed feedback. We are grateful for the opportunity to clarify the analysis supporting the high-probability last-iterate bound. We address the single major comment below.
read point-by-point responses
-
Referee: [Main theorem and analysis section] The central high-probability bound (presumably stated in the main theorem) relies on the claim that solving the estimated log-barrier-regularized game at infrequent intervals controls both the bias induced by the update schedule and the variance of the payoff estimates simultaneously. The provided analysis sketch must be checked to confirm that the concentration and stability arguments for the regularized equilibrium do not re-introduce game-dependent factors that would degrade the rate back to t^{-1/3} or t^{-1/4}.
Authors: We thank the referee for raising this key point about the analysis. Our infrequent-update schedule is chosen so that the number of samples between updates grows sufficiently fast to ensure that the variance of the payoff-matrix estimator is controlled at the desired rate; the log-barrier term simultaneously guarantees that the solved regularized equilibrium remains stable under small perturbations, bounding the bias from delayed updates. The concentration arguments rely on matrix concentration inequalities whose deviation terms depend only on the number of samples and the regularization strength, not on game-specific quantities such as the payoff range or the number of actions in a manner that would alter the final t^{-1/2} exponent. We have re-examined the proofs and confirm that no such hidden factors appear in the final high-probability bound. To make the separation between bias control, variance control, and game-parameter independence fully explicit, we will expand the relevant lemmas and add a short remark in the revised manuscript. revision: partial
Circularity Check
No circularity: new algorithm and analysis derive t^{-1/2} rate independently
full rationale
The paper introduces a novel algorithm that updates strategies infrequently by solving an estimated log-barrier-regularized game and provides a new analysis to overcome obstacles that limit standard bandit methods to slower rates. The central t^{-1/2} high-probability last-iterate claim follows from this construction and analysis rather than reducing to any fitted parameter, self-defined quantity, or self-citation chain. No load-bearing step equates the result to its inputs by construction, and the derivation remains self-contained against external benchmarks such as the cited separation results from Fiegel et al.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard assumptions of bounded payoffs and convex strategy sets in zero-sum games
- domain assumption Existence of efficient solvers for log-barrier-regularized games
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
updates its strategy infrequently by solving an estimated log-barrier-regularized game... Φ_s(x,y) ≜ x⊤bA_s y + γ_s ∑ log(1/x_i) - γ_s ∑ log(1/y_j)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multiplicative stability between two consecutive strategies
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[2]
Mathematische annalen , volume=
Zur theorie der gesellschaftsspiele , author=. Mathematische annalen , volume=. 1928 , publisher=
work page 1928
-
[3]
Uncertainty in Artificial Intelligence , pages=
Matrix games with bandit feedback , author=. Uncertainty in Artificial Intelligence , pages=. 2021 , organization=
work page 2021
-
[4]
Advances in Neural Information Processing Systems , volume=
No-regret learning in unknown games with correlated payoffs , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
International Symposium on Algorithmic Game Theory , pages=
On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games Under Noisy Feedback , author=. International Symposium on Algorithmic Game Theory , pages=. 2025 , organization=
work page 2025
-
[7]
Proceedings of the 42nd International Conference on Machine Learning , year=
The harder path: Last iterate convergence for uncoupled learning in zero-sum games with bandit feedback , author=. Proceedings of the 42nd International Conference on Machine Learning , year=
-
[8]
Mathematics of Operations Research , volume=
Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability , author=. Mathematics of Operations Research , volume=. 2022 , publisher=
work page 2022
-
[9]
Neural Information Processing Systems , year=
From average-iterate to last-iterate convergence in games: A reduction and its applications , author=. Neural Information Processing Systems , year=
-
[10]
Proceedings of the 12th ACM conference on Electronic commerce , pages=
Distributed algorithms via gradient descent for Fisher markets , author=. Proceedings of the 12th ACM conference on Electronic commerce , pages=
-
[11]
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing , pages=
Proportional response dynamics leads to market equilibrium , author=. Proceedings of the thirty-ninth annual ACM symposium on Theory of computing , pages=
-
[12]
Proceedings of the 26th ACM Conference on Economics and Computation , pages=
Proportional Response Dynamics in Gross Substitutes Markets , author=. Proceedings of the 26th ACM Conference on Economics and Computation , pages=
-
[13]
International conference on machine learning , pages=
Anytime online-to-batch, optimism and acceleration , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
-
[14]
Advances in Neural Information Processing Systems , volume=
Regularized gradient descent ascent for two-player zero-sum Markov games , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
SIAM Journal on Control and Optimization , volume=
Learning zero-sum linear quadratic games with improved sample complexity and last-iterate convergence , author=. SIAM Journal on Control and Optimization , volume=. 2025 , publisher=
work page 2025
-
[16]
International Conference on Artificial Intelligence and Statistics , pages=
Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=
work page 2023
-
[17]
Uncertainty in Artificial Intelligence , pages=
Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games , author=. Uncertainty in Artificial Intelligence , pages=. 2022 , organization=
work page 2022
-
[18]
arXiv preprint arXiv:2408.08395 , year=
Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback , author=. arXiv preprint arXiv:2408.08395 , year=
-
[19]
The Eleventh International Conference on Learning Representations , year=
The Power of Regularization in Solving Extensive-Form Games , author=. The Eleventh International Conference on Learning Representations , year=
-
[20]
Advances in Neural Information Processing Systems , volume=
Multi-player zero-sum markov games with networked separable interactions , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
International Congress of Mathematicians , year=
Independent learning in stochastic games , author=. International Congress of Mathematicians , year=
-
[22]
The Thirteenth International Conference on Learning Representations , year=
Boosting Perturbed Gradient Ascent for Last-Iterate Convergence in Games , author=. The Thirteenth International Conference on Learning Representations , year=
-
[23]
International Conference on Machine Learning , pages=
Adaptively Perturbed Mirror Descent for Learning in Games , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[24]
Advances in Neural Information Processing Systems , volume=
On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
Advances in Neural Information Processing Systems , volume=
No-regret learning in games with noisy feedback: Faster rates and adaptivity via learning rate separation , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Advances in Neural Information Processing Systems , volume=
A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Proceedings of the 41st International Conference on Machine Learning , pages=
Accelerated algorithms for constrained nonconvex-nonconcave min-max optimization and comonotone inclusion , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[29]
Adaptive, doubly optimal no-regret learning in strongly monotone and exp-concave games with gradient feedback , author=. Operations Research , year=
-
[30]
Doubly optimal no-regret online learning in strongly monotone games with bandit feedback , author=. Operations Research , year=
-
[31]
Advances in Neural Information Processing Systems , volume=
Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[32]
Games and Economic Behavior , volume=
Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=
work page 1999
-
[33]
Proceedings of the 2018 ACM Conference on Economics and Computation , pages=
Multiplicative weights update in zero-sum games , author=. Proceedings of the 2018 ACM Conference on Economics and Computation , pages=
work page 2018
-
[34]
NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability , year=
COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences , author=. NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability , year=
work page 2024
-
[35]
Statistical impossibility and possibility of aligning llms with human preferences: From condorcet paradox to nash equilibrium , author=. arXiv preprint arXiv:2503.10990 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Forty-first International Conference on Machine Learning , year=
A Minimaximalist Approach to Reinforcement Learning from Human Feedback , author=. Forty-first International Conference on Machine Learning , year=
-
[37]
arXiv preprint arXiv:2405.00675 , year=
Self-play preference optimization for language model alignment , author=. arXiv preprint arXiv:2405.00675 , year=
-
[38]
arXiv preprint arXiv:2407.00617 , year=
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning , author=. arXiv preprint arXiv:2407.00617 , year=
-
[39]
Forty-first International Conference on Machine Learning , year=
Nash Learning from Human Feedback , author=. Forty-first International Conference on Machine Learning , year=
-
[40]
arXiv preprint arXiv:2402.07314 , year=
A theoretical analysis of nash learning from human feedback under general kl-regularized preference , author=. arXiv preprint arXiv:2402.07314 , year=
-
[41]
Human-level play in the game of Diplomacy by combining language models with strategic reasoning , author=. Science , volume=. 2022 , publisher=
work page 2022
-
[42]
Heads-up limit hold’em poker is solved , author=. Science , volume=. 2015 , publisher=
work page 2015
-
[43]
Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , author=. Science , volume=. 2018 , publisher=
work page 2018
-
[44]
Superhuman AI for multiplayer poker , author=. Science , volume=. 2019 , publisher=
work page 2019
-
[45]
Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms , pages=
Cycles in adversarial regularized learning , author=. Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms , pages=. 2018 , organization=
work page 2018
-
[46]
6th Annual Learning for Dynamics & Control Conference , pages=
Mao, Weichao and Qiu, Haoran and Wang, Chen and Franke, Hubertus and Kalbarczyk, Zbigniew and Ba. 6th Annual Learning for Dynamics & Control Conference , pages=. 2024 , organization=
work page 2024
-
[47]
International Conference on Artificial Intelligence and Statistics , pages=
Near-optimal policy optimization for correlated equilibrium in general-sum markov games , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
work page 2024
-
[48]
Advances in Neural Information Processing Systems , volume=
Fast policy extragradient methods for competitive games with entropy regularization , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
Journal of machine learning Research , volume=
Fast policy extragradient methods for competitive games with entropy regularization , author=. Journal of machine learning Research , volume=
-
[50]
International Conference on Machine Learning , pages=
Doubly optimal no-regret learning in monotone games , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[51]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[52]
Equilibria of polymatrix games , author=. Management Science , volume=. 1972 , publisher=
work page 1972
-
[53]
Lithuanian Mathematical Journal , volume=
Equilibrium points in polymatrix games , author=. Lithuanian Mathematical Journal , volume=
-
[54]
Mathematics of Operations Research , volume=
Zero-sum polymatrix games: A generalization of minmax , author=. Mathematics of Operations Research , volume=. 2016 , publisher=
work page 2016
-
[55]
Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=
Fast swap regret minimization and applications to approximate correlated equilibria , author=. Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=
-
[56]
Proceedings of the 56th Annual ACM Symposium on Theory of Computing , year=
Faster Rates for No-Regret Learning in General Games via Cautious Optimism , author=. Proceedings of the 56th Annual ACM Symposium on Theory of Computing , year=
-
[57]
Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=
From External to Swap Regret 2.0: An Efficient Reduction for Large Action Spaces , author=. Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=
-
[58]
arXiv preprint arXiv:2406.13116 , year=
A Lower Bound on Swap Regret in Extensive-Form Games , author=. arXiv preprint arXiv:2406.13116 , year=
-
[59]
Mathematical Programming , volume=
On the complexity of finding a local minimizer of a quadratic function over a polytope , author=. Mathematical Programming , volume=. 2022 , publisher=
work page 2022
-
[60]
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing , pages=
Linear degree extractors and the inapproximability of max clique and chromatic number , author=. Proceedings of the thirty-eighth annual ACM symposium on Theory of computing , pages=
-
[61]
The multiplicative weights update method: a meta-algorithm and applications , author=. Theory of computing , volume=. 2012 , publisher=
work page 2012
-
[62]
International Conference on Machine Learning , pages=
The limits of min-max optimization algorithms: Convergence to spurious non-critical sets , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[63]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
No Internal Regret with Non-convex Loss Functions , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[64]
Advances in Neural Information Processing Systems , year =
Brian Hu Zhang and Ioannis Anagnostides and Gabriele Farina and Tuomas Sandholm , title=. Advances in Neural Information Processing Systems , year =
-
[65]
Journal of the ACM (JACM) , volume=
Matching algorithmic bounds for finding a Brouwer fixed point , author=. Journal of the ACM (JACM) , volume=. 2008 , publisher=
work page 2008
-
[66]
Mathematical methods and theory in games, programming, and economics: Volume II: the theory of infinite games , author=. 1959 , publisher=
work page 1959
-
[67]
Proceedings of the 24th Annual Conference on Learning Theory , pages=
Online learning: Beyond regret , author=. Proceedings of the 24th Annual Conference on Learning Theory , pages=. 2011 , organization=
work page 2011
-
[68]
Journal of Artificial Intelligence Research , volume=
Evolutionary dynamics and phi-regret minimization in games , author=. Journal of Artificial Intelligence Research , volume=
-
[69]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Hindsight and sequential rationality of correlated play , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[70]
Mathematics of Operations Research , volume=
Extensive-form correlated equilibrium: Definition and computational complexity , author=. Mathematics of Operations Research , volume=. 2008 , publisher=
work page 2008
-
[71]
Advances in Neural Information Processing Systems , volume=
Sample-efficient learning of correlated equilibria in extensive-form games , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
Advances in Neural Information Processing Systems , volume=
Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent , author=. Advances in Neural Information Processing Systems , volume=
-
[73]
International Conference on Machine Learning , pages=
Efficient deviation types and learning for hindsight rationality in extensive-form games , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[74]
International Conference on Machine Learning (ICML) , year=
Near-Optimal Phi-Regret Learning in Extensive-Form Games , author=. International Conference on Machine Learning (ICML) , year=
-
[75]
Simple Uncoupled No-regret Learning Dynamics for Extensive-form Correlated Equilibrium , author=. Journal of the ACM , year=
-
[76]
Journal of the ACM (JACM) , volume=
Settling the complexity of computing two-player Nash equilibria , author=. Journal of the ACM (JACM) , volume=. 2009 , publisher=
work page 2009
-
[77]
Mathematical Methods and Theory in Games, Programming, and Economics: Volume 2: The Theory of Infinite Games , author=. 2014 , publisher=
work page 2014
-
[78]
Communications of the ACM , volume=
The complexity of computing a Nash equilibrium , author=. Communications of the ACM , volume=. 2009 , publisher=
work page 2009
-
[79]
arXiv preprint arXiv:1910.07512 , year=
On solving minimax optimization locally: A follow-the-ridge approach , author=. arXiv preprint arXiv:1910.07512 , year=
-
[80]
IEEE transactions on automatic control , volume=
On the characterization of local Nash equilibria in continuous games , author=. IEEE transactions on automatic control , volume=. 2016 , publisher=
work page 2016
-
[81]
the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
The limit points of (optimistic) gradient descent in min-max optimization , author=. the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[82]
Proceedings of the American Mathematical Society , volume=
A further generalization of the Kakutani fixed theorem, with application to Nash equilibrium points , author=. Proceedings of the American Mathematical Society , volume=
-
[83]
International Conference on Machine Learning , year=
Constrained Phi-Equilibria , author=. International Conference on Machine Learning , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.