pith. sign in

arxiv: 2606.06486 · v1 · pith:XVNA6HDFnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.GT

Regret Minimization with Adaptive Opponents in Repeated Games

Pith reviewed 2026-06-28 01:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GT
keywords regret minimizationrepeated gamesadaptive opponentsRP-Regretsubgame perfect equilibriumonline learningnon-convex optimization
0
0 comments X

The pith

Minimizing Repeated Policy Regret in games with adaptive opponents enables learning certain subgame perfect equilibria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Repeated Policy Regret to measure performance against history-responsive best-in-hindsight strategies, addressing the failure of standard external regret when opponents adapt. It derives necessary conditions on comparator variation and memory lengths that permit sublinear RP-Regret bounds. Three algorithms are developed to minimize the non-convex RP-Regret or its convex linearization, including an oracle-based method and one for slowly changing opponents. When every player applies such an algorithm, the resulting play converges to certain subgame perfect equilibria of the repeated game. Experiments in games such as Stag-Hunt illustrate that this produces higher joint utilities than standard regret minimization.

Core claim

RP-Regret is defined as the difference between a player's realized accumulated payoff and the payoff of the best policy that could have responded to the full history of play. Necessary conditions require bounded variation in the comparator sequence and finite memory for both the comparator and the opponents' strategies. Under these conditions, an optimization-oracle algorithm, a linearized convex surrogate, and a direct method for slow opponent drift each achieve sublinear RP-Regret. When all players simultaneously minimize RP-Regret or its linearization, the joint play converges to subgame perfect equilibria of the repeated game.

What carries the argument

Repeated Policy Regret (RP-Regret), the gap between realized cumulative utility and the utility of the best history-responsive policy in hindsight.

If this is right

  • Sublinear RP-Regret is attainable only when comparator variation and memory lengths satisfy explicit bounds.
  • A convex linearization of RP-Regret can be minimized by standard online convex optimization routines.
  • Collective minimization of RP-Regret or its linearization produces subgame perfect equilibria.
  • In matrix games such as Stag-Hunt, RP-Regret minimization yields strictly higher average payoffs than external-regret minimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regret notion could be applied to multi-agent reinforcement learning with recurrent policies.
  • Equilibrium selection properties might extend to stochastic games with imperfect monitoring.
  • Computational cost of the optimization oracle version may limit use to small action spaces.

Load-bearing premise

The comparator strategies must vary slowly and both comparators and opponents must have finite memory for sublinear RP-Regret to be possible.

What would settle it

A repeated game in which all players run an RP-Regret minimizer yet the observed play fails to satisfy the one-shot deviation principle for any subgame perfect equilibrium.

Figures

Figures reproduced from arXiv: 2606.06486 by Asuman Ozdaglar, Kaiqing Zhang, Mingyang Liu, Tiancheng Yu.

Figure 1
Figure 1. Figure 1: Examples of the utility matrices for Prisoner’s Dilemma (upper left) and Stag Hunt [PITH_FULL_IMAGE:figures/full_fig_p022_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustrative proof of Lemma M.4. In (i), the maximum tree depth remains 3. In (ii), the maximum tree depth increases from 3 to 4. h0 → h1 → h2 → h0 and h0 → h2 → h0 with length depth(h2) +2 and depth(h2) +1 where depth(h) is the depth of h in the rooted tree. By Frobenius number (Sylvester, 1882), C0 = depth(h2) · (depth(h2) + 1) ≤ (log2 |HM| + 1)(log2 |HM| + 2) = log2 2 |HM| + 3log2 |HM| + 2 by Lemma M… view at source ↗
read the original abstract

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Repeated Policy Regret (RP-Regret), a regret metric for repeated games that accounts for adaptive, history-dependent opponents by comparing realized utility to the best-in-hindsight utility under counterfactual responses. It states necessary conditions on bounded variation of comparator strategies and bounded memory of comparators and opponents for sublinear RP-Regret. Three algorithms are proposed to minimize RP-Regret (or a linearized convex surrogate): an optimization-oracle method, a surrogate minimization method, and a method for slowly changing opponents. The central claim is that when all players minimize RP-Regret or its surrogate, certain subgame perfect equilibria of the repeated game become learnable. Experiments on games such as Stag-Hunt are reported to show higher cooperative utilities.

Significance. If the conditions for sublinear regret are compatible with the claimed equilibria and the algorithms achieve the stated guarantees, the work would meaningfully extend online learning to adaptive repeated-game settings by enabling stronger comparators than prior regret notions while still supporting equilibrium learning. The native formulation for repeated play and the empirical demonstration of cooperative outcomes are positive aspects.

major comments (2)
  1. [Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.
  2. [Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.
minor comments (1)
  1. [Abstract] Abstract: the experimental section is mentioned but no details on game instances, baselines, or quantitative results are provided, making it impossible to assess the strength of the empirical support.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We respond to each major comment below and indicate where revisions will be made to clarify the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.

    Authors: The manuscript derives necessary conditions (bounded variation of comparator strategies and bounded memory of comparators/opponents) for sublinear RP-Regret in the main text. The abstract's use of 'certain' SPE already restricts the claim to those equilibria satisfying these conditions; any history-dependent SPE inducing unbounded variation would violate the necessary conditions and is therefore excluded from the claim. We will revise the abstract to explicitly note this restriction to SPE satisfying the stated bounds. revision: yes

  2. Referee: [Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.

    Authors: The abstract is a concise summary of results whose proofs and derivations appear in the body of the manuscript, including the analysis showing that the optimization-oracle method and the linearized surrogate yield sublinear RP-Regret under the variation/memory conditions. The connection from surrogate minimization to the original RP-Regret bound is established via the regret decomposition and bounding arguments in the main text. No revision to the abstract is needed, as proof sketches are not standard in abstracts. revision: no

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper introduces the novel RP-Regret metric to capture adaptive opponents and counterfactual reasoning in repeated games. It states necessary conditions on comparator variation and memory bounds for sublinear regret, then develops three independent algorithms (optimization oracle, linearized convex surrogate, and direct minimization under slow opponent changes) to minimize the non-convex RP-Regret. Finally it shows that mutual minimization yields certain subgame-perfect equilibria. None of these steps reduce by construction to self-definitional equivalences, renamed fitted inputs, or load-bearing self-citations; the central claims rest on direct analysis of the new metric and its algorithmic properties without circular reduction to prior fitted quantities or author-specific ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the work relies on standard background results from online learning and repeated game theory with no explicit free parameters, new entities, or ad-hoc axioms listed.

axioms (1)
  • standard math Standard assumptions from online learning and repeated game theory (e.g., finite action spaces, utility functions)
    Invoked implicitly to define regret and equilibria.

pith-pipeline@v0.9.1-grok · 5887 in / 1182 out tokens · 23557 ms · 2026-06-28T01:59:51.963730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 1 linked inside Pith

  1. [1]

    Kakade, and Karan Singh

    Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning (ICML), 2019

  2. [2]

    Online learning for adversaries with memory: price of past mistakes

    Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Neural Information Processing Systems (NeurIPS), 2015

  3. [3]

    Online bandit learning against an adaptive adversary: from regret to policy regret

    Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pages 1747--1754, 2012

  4. [4]

    Policy regret in repeated games

    Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games. In Neural Information Processing Systems (NeurIPS), 2018

  5. [5]

    Long term competition: A game theoretic analysis', mimeograph

    Robert Aumann and Lloyd Shapley. Long term competition: A game theoretic analysis', mimeograph. Hebrew University, 1976

  6. [6]

    The evolution of cooperation

    Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 0 (4489): 0 1390--1396, 1981

  7. [7]

    Multiplicative weights update in zero-sum games

    James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM Conference on Economics and Computation (EC), 2018

  8. [8]

    Finitely repeated games

    Jean-Pierre Benoit and Vijay Krishna. Finitely repeated games. Econometrica, 53 0 (4): 0 905--22, 1985. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:53:y:1985:i:4:p:905-22

  9. [9]

    Adaptive regret minimization in bounded-memory games

    Jeremiah Blocki, Nicolas Christin, Anupam Datta, and Arunesh Sinha. Adaptive regret minimization in bounded-memory games. In Decision and Game Theory for Security (GameSec), 2013

  10. [10]

    The myth of the folk theorem

    Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In ACM Symposium on Theory of Computing (STOC), 2008

  11. [11]

    Superhuman ai for heads-up no-limit poker: Libratus beats top professionals

    Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359 0 (6374): 0 418--424, 2018

  12. [12]

    Superhuman ai for multiplayer poker

    Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365 0 (6456): 0 885--890, 2019

  13. [13]

    Single sample path-based optimization of markov chains

    Xi-Ren Cao. Single sample path-based optimization of markov chains. Journal of optimization theory and applications, 100: 0 527--548, 1999

  14. [14]

    Online convex optimization with time-varying constraints and bandit feedback

    Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64 0 (7): 0 2665--2680, 2018

  15. [15]

    Fast policy extragradient methods for competitive games with entropy regularization

    Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. In Neural Information Processing Systems (NeurIPS), 2021

  16. [16]

    Prediction, Learning, and Games

    Nicolo Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

  17. [17]

    Multiagent learning in the presence of memory-bounded agents

    Doran Chakraborty and Peter Stone. Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28 0 (2): 0 182--213, 2014

  18. [18]

    An online convex optimization approach to proactive network resource allocation

    Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65 0 (24): 0 6350--6364, 2017

  19. [19]

    Computing nash equilibria: Approximation and smoothed complexity

    Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In Symposium on Foundations of Computer Science (FOCS), 2006

  20. [20]

    The approximation complexity of win-lose games

    Xi Chen, Shang-Hua Teng, and Paul Valiant. The approximation complexity of win-lose games. 2007

  21. [21]

    Strongly adaptive online learning

    Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning (ICML), 2015

  22. [22]

    The complexity of computing a nash equilibrium

    Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. Communications of the ACM, 52 0 (2): 0 89--97, 2009

  23. [23]

    The complexity of M arkov equilibrium in stochastic games

    Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of M arkov equilibrium in stochastic games. In Conference on Learning Theory (COLT), 2023

  24. [24]

    How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

    Daniela de Farias and Nimrod Megiddo. How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

  25. [25]

    Competitive M arkov decision processes

    Jerzy Filar and Koos Vrieze. Competitive M arkov decision processes . Springer Science & Business Media, 2012

  26. [26]

    Learning with opponent-learning awareness

    Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017

  27. [27]

    The folk theorem in repeated games with discounting or with incomplete information

    Drew Fudenberg and Eric Maskin. The folk theorem in repeated games with discounting or with incomplete information. In A long-run collaboration on long-run games, pages 209--230. World Scientific, 2009

  28. [28]

    Stochastic games with zero stop probabilities

    Dean Gillette. Stochastic games with zero stop probabilities. Contributions to the Theory of Games, 3 0 (39): 0 179--187, 1957

  29. [29]

    Approximation to bayes risk in repeated play

    James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3: 0 97--139, 1957

  30. [30]

    A simple adaptive procedure leading to correlated equilibrium

    Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68 0 (5): 0 1127--1150, 2000

  31. [31]

    Logarithmic regret algorithms for online convex optimization

    Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69 0 (2): 0 169--192, 2007

  32. [32]

    Introduction to online convex optimization

    Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

  33. [33]

    Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

    Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

  34. [34]

    V-learning -- A simple, efficient, decentralized algorithm for multiagent RL

    Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning -- A simple, efficient, decentralized algorithm for multiagent RL . arXiv preprint arXiv:2110.14555, 2021

  35. [35]

    Finite rationality and interpersonal complexity in repeated games

    Ehud Kalai and William Stanford. Finite rationality and interpersonal complexity in repeated games. Econometrica: Journal of the Econometric Society, pages 397--410, 1988

  36. [36]

    A policy gradient algorithm for learning to learn in multiagent reinforcement learning

    Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning (ICML), 2021

  37. [37]

    Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson

    Alistair Letcher, Jakob N. Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations (ICLR), 2019

  38. [38]

    A polynomial-time nash equilibrium algorithm for repeated games

    Michael L Littman and Peter Stone. A polynomial-time nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, 2003

  39. [39]

    The power of regularization in solving extensive-form games

    Mingyang Liu, Asuman E Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. In International Conference on Learning Representations (ICLR), 2022

  40. [40]

    On the impossibility of learning to cooperate with adaptive partner strategies in repeated games

    Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning (ICML), pages 14197--14209. PMLR, 2022

  41. [41]

    Model-free opponent shaping

    Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning (ICML), pages 14398--14411. PMLR, 2022

  42. [42]

    Provably efficient reinforcement learning in decentralized general-sum markov games

    Weichao Mao and Tamer Ba s ar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, pages 1--22, 2022

  43. [43]

    Near-optimal model-free reinforcement learning in non-stationary episodic MDPs

    Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs . In International Conference on Machine Learning (ICML), 2021

  44. [44]

    On sequential strategies for loss functions with memory

    Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48 0 (7): 0 1947--1958, 2002

  45. [45]

    Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma

    Abraham Neyman. Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma. Economics Letters, 19 0 (3): 0 227--229, 1985

  46. [46]

    Markov chains

    James R Norris. Markov chains. Number 2. Cambridge university press, 1998

  47. [47]

    Online learning of feasible strategies in unknown environments

    Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62 0 (6): 0 2807--2822, 2016

  48. [48]

    On the interpretation of decision problems with imperfect recall

    Michele Piccione and Ariel Rubinstein. On the interpretation of decision problems with imperfect recall. Games and Economic Behavior, 20 0 (1): 0 3--24, 1997

  49. [49]

    Markov decision processes: D iscrete stochastic dynamic programming

    Martin L Puterman. Markov decision processes: D iscrete stochastic dynamic programming . John Wiley & Sons, 2014

  50. [50]

    Can bounded rationality resolve the Prisoner's dilemma

    Roy Radner. Can bounded rationality resolve the Prisoner's dilemma. Essays in honor of Gerard Debreu, pages 387--399, 1986

  51. [51]

    Algorithmic game theory

    Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53 0 (7): 0 78--86, 2010

  52. [52]

    On the impossibility of achieving no regrets in repeated games

    Karl Schlag and Andriy Zapechelnyuk. On the impossibility of achieving no regrets in repeated games. Journal of Economic Behavior & Organization, 81 0 (1): 0 153--158, 2012

  53. [53]

    Stochastic games

    Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39 0 (10): 0 1095--1100, 1953

  54. [54]

    Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer

    Samuel Sokota, Ryan D'Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR), 2023

  55. [55]

    When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

    Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

  56. [56]

    Online non-convex learning: Following the perturbed leader is optimal

    Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In International Conference on Algorithmic Learning Theory (ALT), 2020

  57. [57]

    On subvariants, ie semi-invariants to binary quantics of an unlimited order

    James J Sylvester. On subvariants, ie semi-invariants to binary quantics of an unlimited order. American Journal of Mathematics, 5 0 (1): 0 79--136, 1882

  58. [58]

    Stability and perfection of Nash equilibria, volume 339

    Eric Van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991

  59. [59]

    Strategy: an introduction to game theory, volume 139

    Joel Watson. Strategy: an introduction to game theory, volume 139. WW Norton New York, 2002

  60. [60]

    A practical use of imperfect recall

    Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael H Bowling. A practical use of imperfect recall. In SARA, 2009

  61. [61]

    Cola: consistent learning with opponent-learning awareness

    Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. Cola: consistent learning with opponent-learning awareness. In International Conference on Machine Learning (ICML), pages 23804--23831. PMLR, 2022

  62. [62]

    Adaptive online learning in dynamic environments

    Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Neural Information Processing Systems (NeurIPS), 2018 a

  63. [63]

    Dynamic regret of strongly adaptive methods

    Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), pages 5882--5891. PMLR, 2018 b

  64. [64]

    Dynamic regret of strongly adaptive methods

    Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), 2018 c

  65. [65]

    Non-stationary online learning with memory and non-stochastic control

    Peng Zhao, Yu-Xiang Wang, and Zhi-Hua Zhou. Non-stationary online learning with memory and non-stochastic control. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

  66. [66]

    Online convex programming and generalized infinitesimal gradient ascent

    Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), 2003

  67. [67]

    Response regret

    Martin Zinkevich. Response regret. In AAAI Fall Symposium: Coevolutionary and Coadaptive Systems, page 41, 2005