Recognition: unknown
The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback
Pith reviewed 2026-05-10 08:06 UTC · model grok-4.3
The pith
Uncoupled algorithms in zero-sum games with only bandit feedback cannot make their current policies converge to Nash equilibrium faster than order T to the minus one fourth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to performance for uncoupled algorithms under bandit feedback, with the best attainable rate being Omega of T to the minus one fourth. This lower bound is matched up to constants and logarithmic factors by two explicit algorithms: a straightforward exploration-exploitation tradeoff and a regularization technique based on two-step mirror descent.
What carries the argument
An exploration-exploitation tradeoff schedule together with a two-step mirror descent regularization that together control the trajectory of the current policy under private payoff observations.
If this is right
- Average iterates can still converge at the faster T to the minus one half rate even while the last iterate lags.
- Any practical algorithm must incorporate explicit exploration phases or regularization to reach the optimal last-iterate rate.
- The gap between last-iterate and average rates quantifies the cost of full decentralization under bandit feedback.
- The same limitation applies to the exploitability of the played strategy rather than its historical average.
Where Pith is reading between the lines
- Adding even limited communication between players could potentially restore faster last-iterate rates, suggesting a sharp threshold effect.
- The T to the minus one fourth barrier may persist in continuous-action or function-approximation settings and could be tested there directly.
- In multi-agent reinforcement learning, designers may need to optimize explicitly for average performance when instantaneous guarantees are costly.
Load-bearing premise
Players act independently, receive only their individual bandit payoff each round, and share no communication or coordination.
What would settle it
A concrete zero-sum matrix game and uncoupled algorithm where the last-iterate exploitability gap decays faster than order T to the minus one fourth would refute the lower bound.
read the original abstract
We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $\Omega(T^{-1/4})$ in contrast to the usual $\Omega(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies uncoupled learning in zero-sum matrix games under bandit feedback, where each player observes only their own payoff. It establishes that last-iterate convergence of policy profiles to Nash equilibrium cannot exceed Ω(T^{-1/4}) exploitability gap for uncoupled algorithms (in contrast to the Ω(T^{-1/2}) rate achievable for average iterates), and proposes two algorithms achieving this rate up to constants and logs: one via direct exploration-exploitation tradeoff and one via two-step regularized mirror descent.
Significance. If the lower bound and matching upper bounds hold, the work clarifies a fundamental rate tradeoff between last-iterate and average-iterate guarantees in the bandit uncoupled setting, extending prior non-bandit results. The constructive algorithms and first-principles lower bound are strengths that make the contribution concrete and falsifiable.
major comments (2)
- [Lower bound theorem (likely §3 or §4)] The lower-bound argument establishing Ω(T^{-1/4}) as the best attainable rate for last-iterate exploitability must be verified in full; it is load-bearing for the central claim that last-iterate convergence is detrimental relative to average-iterate rates.
- [Algorithm sections and regret analysis] For the two-step regularized mirror descent algorithm, confirm that the regularization schedule and step-size choices achieve the stated rate without hidden dependence on game parameters or prior knowledge of the equilibrium; the exploration-exploitation algorithm should similarly be checked for its explicit rate derivation.
minor comments (2)
- Notation for exploitability gap and last-iterate policy profiles should be introduced with a dedicated preliminary section or table for clarity.
- [Abstract and main theorems] The abstract states the rates but omits the precise dependence on the number of actions; this should be made explicit in the theorem statements.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our paper. We address the major comments point by point below, providing clarifications on the lower bound and algorithm analyses. We believe our results are correct as stated, but we are prepared to make minor revisions for added clarity.
read point-by-point responses
-
Referee: [Lower bound theorem (likely §3 or §4)] The lower-bound argument establishing Ω(T^{-1/4}) as the best attainable rate for last-iterate exploitability must be verified in full; it is load-bearing for the central claim that last-iterate convergence is detrimental relative to average-iterate rates.
Authors: We appreciate the referee's emphasis on the lower bound, which is indeed central to our contribution. The complete proof of the Ω(T^{-1/4}) lower bound for the last-iterate exploitability gap under uncoupled bandit feedback is detailed in Section 4 of the manuscript (with supporting lemmas in the appendix). The argument proceeds by reducing the problem to a carefully constructed two-action zero-sum game where the payoff matrix is chosen adversarially based on the players' past actions, showing that any algorithm without coupling must incur at least Ω(T^{-1/4}) exploitability in the last iterate. This is derived using a combination of regret lower bounds and a direct analysis of the policy convergence. The proof does not rely on any unverified assumptions and is fully rigorous. To address the referee's concern, we can include a brief intuitive overview of the lower bound construction in the main text in the revised version. revision: partial
-
Referee: [Algorithm sections and regret analysis] For the two-step regularized mirror descent algorithm, confirm that the regularization schedule and step-size choices achieve the stated rate without hidden dependence on game parameters or prior knowledge of the equilibrium; the exploration-exploitation algorithm should similarly be checked for its explicit rate derivation.
Authors: We confirm that both proposed algorithms achieve the O(T^{-1/4} polylog T) rate without any hidden dependence on the game parameters or knowledge of the Nash equilibrium. For the two-step regularized mirror descent algorithm (Section 5), the regularization strength is scheduled as λ_t = Θ(t^{-1/2}), and the mirror descent step sizes are η_t = Θ(t^{-1/2}), chosen independently of the payoff values; the analysis in the appendix derives the bound explicitly using standard mirror descent regret bounds adapted to the two-step structure, resulting in the desired rate uniformly over all zero-sum games. Similarly, the exploration-exploitation algorithm (Section 6) uses an exploration probability of Θ(T^{-1/2}) in a phased manner or decreasing schedule, with explicit regret decomposition showing the last-iterate exploitability bound without requiring prior knowledge. All constants are absolute and the rates hold for any finite action sets. If the referee finds the derivations insufficiently explicit, we can expand the main-text presentation of the step-size choices. revision: no
Circularity Check
No significant circularity detected
full rationale
The paper establishes a lower bound of Ω(T^{-1/4}) for last-iterate convergence in uncoupled zero-sum games with bandit feedback via standard minimax and information-theoretic arguments that do not depend on the proposed algorithms or any fitted parameters from the result. The two algorithms (exploration-exploitation tradeoff and two-step regularized mirror descent) are explicitly constructed to achieve this rate up to logs and constants, with no reduction of the bound or algorithm performance to self-citations, ansatzes, or renamings of known results. The derivation chain remains self-contained against external benchmarks in online learning, with the prior O(T^{-1/8}) bound cited only as motivation rather than as a load-bearing premise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Players have access only to bandit feedback in repeated zero-sum matrix games
Reference graph
Works this paper leans on
-
[1]
Last-iterate convergence with full and noisy feedback in two-player zero-sum games, 2023
Abe, K., Ariu, K., Sakamoto, M., Toyoshima, K., and Iwasaki, A. Last-iterate convergence with full and noisy feedback in two-player zero-sum games, 2023
2023
-
[2]
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The Nonstochastic Multiarmed Bandit Problem . SIAM Journal on Computing, 32 0 (1): 0 48--77, January 2002. ISSN 0097-5397. doi:10.1137/S0097539701398375. URL https://epubs.siam.org/doi/10.1137/S0097539701398375. Publisher: Society for Industrial and Applied Mathematics
-
[3]
The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities, 2021
Azizian, W., Iutzeler, F., Malick, J., and Mertikopoulos, P. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities, 2021
2021
-
[4]
Doubly Optimal No - Regret Online Learning in Strongly Monotone Games with Bandit Feedback
Ba, W., Lin, T., Zhang, J., and Zhou, Z. Doubly Optimal No - Regret Online Learning in Strongly Monotone Games with Bandit Feedback . Operations Research, January 2025. ISSN 0030-364X. doi:10.1287/opre.2021.0445. URL https://pubsonline.informs.org/doi/abs/10.1287/opre.2021.0445. Publisher: INFORMS
-
[5]
Théorie des opérations linéaires
Banach, S. Théorie des opérations linéaires. 1932. URL https://eudml.org/doc/271931
1932
-
[6]
Bandit Learning in Concave N - Person Games
Bravo, M., Leslie, D., and Mertikopoulos, P. Bandit Learning in Concave N - Person Games . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/47fd3c87f42f55d4b233417d49c34783-Abstract.html
2018
-
[7]
Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback
Cai, Y., Luo, H., Wei, C.-Y., and Zheng, W. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2023. Curran Associates Inc
2023
-
[8]
Fast Last - Iterate Convergence of Learning in Games Requires Forgetful Algorithms
Cai, Y., Farina, G., Grand-Clément, J., Kroer, C., Lee, C.-W., Luo, H., and Zheng, W. Fast Last - Iterate Convergence of Learning in Games Requires Forgetful Algorithms . November 2024. URL https://openreview.net/forum?id=hK7XTpCtBi&referrer=
2024
-
[9]
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning , and Games . Cambridge University Press, Cambridge, 2006. ISBN 978-0-521-84108-5. doi:10.1017/CBO9780511546921. URL https://www.cambridge.org/core/books/prediction-learning-and-games/A05C9F6ABC752FAB8954C885D0065C8F
-
[10]
Online Optimization with Gradual Variations
Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. Online Optimization with Gradual Variations . In Proceedings of the 25th Annual Conference on Learning Theory , pp.\ 6.1--6.20. JMLR Workshop and Conference Proceedings, June 2012. URL https://proceedings.mlr.press/v23/chiang12.html. ISSN: 1938-7228
2012
-
[11]
arXiv preprint arXiv:2408.08395 , year=
Dong, J., Wang, B., and Yu, Y. Uncoupled and convergent learning in monotone games under bandit feedback. arXiv preprint arXiv:2408.08395, 2024
-
[12]
Drusvyatskiy, D., Fazel, M., and Ratliff, L. J. Improved Rates for Derivative Free Gradient Play in Strongly Monotone Games . In 2022 IEEE 61st Conference on Decision and Control ( CDC ) , pp.\ 3403--3408, December 2022. doi:10.1109/CDC51059.2022.9992950. URL https://ieeexplore.ieee.org/document/9992950. ISSN: 2576-2370
-
[13]
Facchinei, F. and Pang, J.-S. (eds.). Finite- Dimensional Variational Inequalities and Complementarity Problems . Springer Series in Operations Research and Financial Engineering . Springer, New York, NY, 2004. ISBN 978-0-387-95580-3. doi:10.1007/b97543. URL http://link.springer.com/10.1007/b97543
-
[14]
Fictitious self-play in extensive-form games
Heinrich, J., Lanctot, M., and Silver, D. Fictitious self-play in extensive-form games. In International conference on machine learning, 2015
2015
-
[15]
Hiriart-Urruty, J.-B. and Lemaréchal, C. Fundamentals of Convex Analysis, pp.\ 163--208. 01 2001. ISBN 978-3-540-42205-1. doi:10.1007/978-3-642-56468-0_5
-
[16]
On the Convergence of Single - Call Stochastic Extra - Gradient Methods
Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. On the Convergence of Single - Call Stochastic Extra - Gradient Methods . In Advances in Neural Information Processing Systems , Vancouver, Canada, December 2019. URL https://inria.hal.science/hal-02403555
2019
-
[17]
Zeroth-order learning in continuous games via residual pseudogradient estimates,
Huang, Y. and Hu, J. Zeroth- Order Learning in Continuous Games Via Residual Pseudogradient Estimates . IEEE Transactions on Automatic Control, pp.\ 1--16, 2024. ISSN 1558-2523. doi:10.1109/TAC.2024.3479874. URL https://ieeexplore.ieee.org/document/10715648?denied=. Conference Name: IEEE Transactions on Automatic Control
-
[18]
Juditsky, A. B., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with Stochastic Mirror - Prox algorithm. Stochastic Systems, 1 0 (1): 0 17, 2011. doi:10.1214/10-SSY011. URL https://hal.science/hal-00318043
-
[19]
A., Hsieh, Y.-P., Sahin, M
Kangarshahi, E. A., Hsieh, Y.-P., Sahin, M. F., and Cevher, V. Let’s be Honest : An Optimal No - Regret Framework for Zero - Sum Games . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 2488--2496. PMLR, July 2018. URL https://proceedings.mlr.press/v80/kangarshahi18a.html. ISSN: 2640-3498
2018
-
[20]
Efficient learning by implicit exploration in bandit problems with side observations
Koc\' a k, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations . In Advances in Neural Information Processing Systems 27, 2014
2014
-
[21]
Korpelevich, G. M. The extragradient method for finding saddle points and other problems. 1976
1976
-
[22]
Model-free learning for two-player zero-sum partially observable Markov games with perfect recall
Kozuno, T., Ménard, P., Munos, R., and Valko, M. Model-free learning for two-player zero-sum partially observable Markov games with perfect recall. In Proceedings of the 35th International Conference on Neural Information Processing Systems , NIPS '21, pp.\ 11987--11998, Red Hook, NY, USA, December 2021. Curran Associates Inc. ISBN 978-1-71384-539-3
2021
-
[23]
Faster First - Order Methods for Extensive - Form Game Solving
Kroer, C., Waugh, K., Kilinç-Karzan, F., and Sandholm, T. Faster First - Order Methods for Extensive - Form Game Solving . In Proceedings of the Sixteenth ACM Conference on Economics and Computation , EC '15, pp.\ 817--834, New York, NY, USA, June 2015. Association for Computing Machinery. ISBN 978-1-4503-3410-5. doi:10.1145/2764468.2764476. URL https://d...
-
[24]
Kuhn, H. W. Extensive Games and the Problem of Information . Annals of Mathematics Studies, 28: 0 193--216, 1953
1953
-
[25]
Mancino, O. G. and Stampacchia, G. Convex programming and variational inequalities. Journal of Optimization Theory and Applications, 9 0 (1): 0 3--23, January 1972. ISSN 1573-2878. doi:10.1007/BF00932801. URL https://doi.org/10.1007/BF00932801
-
[26]
Martinet, B. Brève communication. Régularisation d'inéquations variationnelles par approximations successives. Revue française d'informatique et de recherche opérationnelle. Série rouge, 4 0 (R3): 0 154--158, 1970. ISSN 0373-8000, 2777-3515. doi:10.1051/m2an/197004R301541. URL https://www.esaim-m2an.org/articles/m2an/abs/1970/03/m2an197004R301541/m2an1970...
-
[27]
McAleer, S., Farina, G., Lanctot, M., and Sandholm, T. ESCHER: E schewing importance sampling in games by computing a history value function to estimate regret. CoRR, abs/2206.04122, 2022. doi:10.48550/arXiv.2206.04122. URL https://doi.org/10.48550/arXiv.2206.04122
-
[28]
G., Rowland, M., Guo, Z
Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Michi, A., Selvi, M., Girgin, S., Momchev, N., Bachem, O., Mankowitz, D. J., Precup, D., and Piot, B. Nash learning from human feedback, 2023
2023
-
[29]
On the Impossibility of Convergence of Mixed Strategies with Optimal No - Regret Learning
Muthukumar, V., Phade, S., and Sahai, A. On the Impossibility of Convergence of Mixed Strategies with Optimal No - Regret Learning . Mathematics of Operations Research, November 2020. ISSN 0364-765X. doi:10.1287/moor.2022.0016. URL https://pubsonline.informs.org/doi/10.1287/moor.2022.0016. Publisher: INFORMS
-
[30]
Nash Jr, J. F. Equilibrium points in N-person games . Proceedings of the National Academy of Sciences of the United States of America, 36 0 (1): 0 48--49, 1950
1950
-
[31]
Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems
Nemirovski, A. Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems . SIAM Journal on Optimization, 15 0 (1): 0 229--251, 2004
2004
-
[32]
Robust stochastic approximation approach to stochastic programming,
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust Stochastic Approximation Approach to Stochastic Programming . SIAM Journal on Optimization, 19 0 (4): 0 1574--1609, January 2009. ISSN 1052-6234. doi:10.1137/070704277. URL https://epubs.siam.org/doi/10.1137/070704277. Publisher: Society for Industrial and Applied Mathematics
-
[33]
Nemirovskiĭ, A. S. and IUdin, D. B. Problem Complexity and Method Efficiency in Optimization . Wiley, 1983. ISBN 978-0-471-10345-5. Google-Books-ID: 6ULvAAAAMAAJ
1983
-
[34]
Explore no more: improved high-probability regret bounds for non-stochastic bandits
Neu, G. Explore no more: improved high-probability regret bounds for non-stochastic bandits. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2 , volume 2 of NIPS '15 , pp.\ 3168--3176, Cambridge, MA, USA, December 2015. MIT Press
2015
-
[35]
Information and information stability of random variables and processes
Pinsker, M. Information and information stability of random variables and processes. 1964
1964
-
[36]
Popov, L. D. A modification of the Arrow - Hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28 0 (5): 0 845--848, November 1980. ISSN 1573-8876. doi:10.1007/BF01141092. URL https://doi.org/10.1007/BF01141092
-
[37]
and Sridharan, K
Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , volume 2 of NIPS '13 , pp.\ 3066--3074, Red Hook, NY, USA, December 2013. Curran Associates Inc
2013
-
[38]
Real and Complex Analysis
Rudin, W. Real and Complex Analysis . Tata McGraw-Hill, 2006. ISBN 978-0-07-061987-6. Google-Books-ID: 3d8umJoRd08C
2006
-
[39]
Tatarenko, T. and Kamgarpour, M. On the Rate of Convergence of Payoff -based Algorithms to Nash Equilibrium in Strongly Monotone Games , February 2022. URL http://arxiv.org/abs/2202.11147. arXiv:2202.11147 [math]
-
[40]
On linear convergence of iterative methods for the variational inequality problem
Tseng, P. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60 0 (1): 0 237--252, 1995. ISSN 0377-0427. doi:https://doi.org/10.1016/0377-0427(94)00094-H. Proceedings of the International Meeting on Linear/Nonlinear Iterative Methods and Verification of Solution
-
[41]
Tyrrell, R. R. Monotone Operators and the Proximal Point Algorithm SIAM Journal on Control and Optimization , 1976. URL https://epubs.siam.org/doi/10.1137/0314056
-
[42]
Mathematische Annalen , year =
v. Neumann, J. Zur Theorie der Gesellschaftsspiele . Mathematische Annalen, 100 0 (1): 0 295--320, December 1928. ISSN 1432-1807. doi:10.1007/BF01448847. URL https://doi.org/10.1007/BF01448847
-
[43]
Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games
Wei, C., Lee, C., Zhang, M., and Luo, H. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA , volume 134 of Proceedings of Machine Learning Research, pp.\ 4259--42...
2021
-
[44]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[45]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.