Regret Minimization with Adaptive Opponents in Repeated Games

Asuman Ozdaglar; Kaiqing Zhang; Mingyang Liu; Tiancheng Yu

arxiv: 2606.06486 · v1 · pith:XVNA6HDFnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.GT

Regret Minimization with Adaptive Opponents in Repeated Games

Mingyang Liu , Asuman Ozdaglar , Tiancheng Yu , Kaiqing Zhang This is my paper

Pith reviewed 2026-06-28 01:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GT

keywords regret minimizationrepeated gamesadaptive opponentsRP-Regretsubgame perfect equilibriumonline learningnon-convex optimization

0 comments

The pith

Minimizing Repeated Policy Regret in games with adaptive opponents enables learning certain subgame perfect equilibria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Repeated Policy Regret to measure performance against history-responsive best-in-hindsight strategies, addressing the failure of standard external regret when opponents adapt. It derives necessary conditions on comparator variation and memory lengths that permit sublinear RP-Regret bounds. Three algorithms are developed to minimize the non-convex RP-Regret or its convex linearization, including an oracle-based method and one for slowly changing opponents. When every player applies such an algorithm, the resulting play converges to certain subgame perfect equilibria of the repeated game. Experiments in games such as Stag-Hunt illustrate that this produces higher joint utilities than standard regret minimization.

Core claim

RP-Regret is defined as the difference between a player's realized accumulated payoff and the payoff of the best policy that could have responded to the full history of play. Necessary conditions require bounded variation in the comparator sequence and finite memory for both the comparator and the opponents' strategies. Under these conditions, an optimization-oracle algorithm, a linearized convex surrogate, and a direct method for slow opponent drift each achieve sublinear RP-Regret. When all players simultaneously minimize RP-Regret or its linearization, the joint play converges to subgame perfect equilibria of the repeated game.

What carries the argument

Repeated Policy Regret (RP-Regret), the gap between realized cumulative utility and the utility of the best history-responsive policy in hindsight.

If this is right

Sublinear RP-Regret is attainable only when comparator variation and memory lengths satisfy explicit bounds.
A convex linearization of RP-Regret can be minimized by standard online convex optimization routines.
Collective minimization of RP-Regret or its linearization produces subgame perfect equilibria.
In matrix games such as Stag-Hunt, RP-Regret minimization yields strictly higher average payoffs than external-regret minimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regret notion could be applied to multi-agent reinforcement learning with recurrent policies.
Equilibrium selection properties might extend to stochastic games with imperfect monitoring.
Computational cost of the optimization oracle version may limit use to small action spaces.

Load-bearing premise

The comparator strategies must vary slowly and both comparators and opponents must have finite memory for sublinear RP-Regret to be possible.

What would settle it

A repeated game in which all players run an RP-Regret minimizer yet the observed play fails to satisfy the one-shot deviation principle for any subgame perfect equilibrium.

Figures

Figures reproduced from arXiv: 2606.06486 by Asuman Ozdaglar, Kaiqing Zhang, Mingyang Liu, Tiancheng Yu.

**Figure 2.** Figure 2: An illustrative proof of Lemma M.4. In (i), the maximum tree depth remains 3. In (ii), the maximum tree depth increases from 3 to 4. h0 → h1 → h2 → h0 and h0 → h2 → h0 with length depth(h2) +2 and depth(h2) +1 where depth(h) is the depth of h in the rooted tree. By Frobenius number (Sylvester, 1882), C0 = depth(h2) · (depth(h2) + 1) ≤ (log2 |HM| + 1)(log2 |HM| + 2) = log2 2 |HM| + 3log2 |HM| + 2 by Lemma M… view at source ↗

read the original abstract

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RP-Regret is a reasonable new metric for adaptive opponents, but the variation and memory conditions look likely to restrict the SPE learning claim to a narrow set of cases.

read the letter

The paper's core move is to replace external regret with RP-Regret, which lets the comparator and the opponents both respond to the full history. That is a natural step once you accept that opponents adapt. They also give three concrete algorithms: an oracle-based one, a linearized convex surrogate, and a direct method that works when opponents move slowly. The experiments on Stag-Hunt are at least directionally consistent with the claim that minimizing the new regret can produce more cooperative play.

The necessary conditions on bounded variation of the comparator and bounded memory of both sides are stated up front, which is honest. The stress-test concern is on point: standard subgame-perfect equilibria are history-dependent, and nothing in the abstract shows that those strategies satisfy the variation bound. If the bound is violated, the sublinear-regret guarantee does not apply, so the equilibrium-learning result only covers a restricted class of SPE. Without the proofs it is impossible to tell how tight the conditions really are or whether the algorithms still give useful guarantees when the conditions are relaxed.

The work is aimed at researchers who already work on regret in repeated games and want a metric that respects counterfactual reasoning. It is worth sending to referees because the gap it targets is real and the algorithmic proposals are concrete, even if the equilibrium claim needs tighter qualification. A serious review would focus on whether the variation condition can be met by interesting equilibria and whether the surrogate algorithm preserves the equilibrium property.

Referee Report

2 major / 1 minor

Summary. The paper introduces Repeated Policy Regret (RP-Regret), a regret metric for repeated games that accounts for adaptive, history-dependent opponents by comparing realized utility to the best-in-hindsight utility under counterfactual responses. It states necessary conditions on bounded variation of comparator strategies and bounded memory of comparators and opponents for sublinear RP-Regret. Three algorithms are proposed to minimize RP-Regret (or a linearized convex surrogate): an optimization-oracle method, a surrogate minimization method, and a method for slowly changing opponents. The central claim is that when all players minimize RP-Regret or its surrogate, certain subgame perfect equilibria of the repeated game become learnable. Experiments on games such as Stag-Hunt are reported to show higher cooperative utilities.

Significance. If the conditions for sublinear regret are compatible with the claimed equilibria and the algorithms achieve the stated guarantees, the work would meaningfully extend online learning to adaptive repeated-game settings by enabling stronger comparators than prior regret notions while still supporting equilibrium learning. The native formulation for repeated play and the empirical demonstration of cooperative outcomes are positive aspects.

major comments (2)

[Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.
[Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.

minor comments (1)

[Abstract] Abstract: the experimental section is mentioned but no details on game instances, baselines, or quantitative results are provided, making it impossible to assess the strength of the empirical support.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We respond to each major comment below and indicate where revisions will be made to clarify the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.

Authors: The manuscript derives necessary conditions (bounded variation of comparator strategies and bounded memory of comparators/opponents) for sublinear RP-Regret in the main text. The abstract's use of 'certain' SPE already restricts the claim to those equilibria satisfying these conditions; any history-dependent SPE inducing unbounded variation would violate the necessary conditions and is therefore excluded from the claim. We will revise the abstract to explicitly note this restriction to SPE satisfying the stated bounds. revision: yes
Referee: [Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.

Authors: The abstract is a concise summary of results whose proofs and derivations appear in the body of the manuscript, including the analysis showing that the optimization-oracle method and the linearized surrogate yield sublinear RP-Regret under the variation/memory conditions. The connection from surrogate minimization to the original RP-Regret bound is established via the regret decomposition and bounding arguments in the main text. No revision to the abstract is needed, as proof sketches are not standard in abstracts. revision: no

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper introduces the novel RP-Regret metric to capture adaptive opponents and counterfactual reasoning in repeated games. It states necessary conditions on comparator variation and memory bounds for sublinear regret, then develops three independent algorithms (optimization oracle, linearized convex surrogate, and direct minimization under slow opponent changes) to minimize the non-convex RP-Regret. Finally it shows that mutual minimization yields certain subgame-perfect equilibria. None of these steps reduce by construction to self-definitional equivalences, renamed fitted inputs, or load-bearing self-citations; the central claims rest on direct analysis of the new metric and its algorithmic properties without circular reduction to prior fitted quantities or author-specific ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the work relies on standard background results from online learning and repeated game theory with no explicit free parameters, new entities, or ad-hoc axioms listed.

axioms (1)

standard math Standard assumptions from online learning and repeated game theory (e.g., finite action spaces, utility functions)
Invoked implicitly to define regret and equilibria.

pith-pipeline@v0.9.1-grok · 5887 in / 1182 out tokens · 23557 ms · 2026-06-28T01:59:51.963730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 1 linked inside Pith

[1]

Kakade, and Karan Singh

Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning (ICML), 2019

2019
[2]

Online learning for adversaries with memory: price of past mistakes

Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Neural Information Processing Systems (NeurIPS), 2015

2015
[3]

Online bandit learning against an adaptive adversary: from regret to policy regret

Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pages 1747--1754, 2012

2012
[4]

Policy regret in repeated games

Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games. In Neural Information Processing Systems (NeurIPS), 2018

2018
[5]

Long term competition: A game theoretic analysis', mimeograph

Robert Aumann and Lloyd Shapley. Long term competition: A game theoretic analysis', mimeograph. Hebrew University, 1976

1976
[6]

The evolution of cooperation

Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 0 (4489): 0 1390--1396, 1981

1981
[7]

Multiplicative weights update in zero-sum games

James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM Conference on Economics and Computation (EC), 2018

2018
[8]

Finitely repeated games

Jean-Pierre Benoit and Vijay Krishna. Finitely repeated games. Econometrica, 53 0 (4): 0 905--22, 1985. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:53:y:1985:i:4:p:905-22

1985
[9]

Adaptive regret minimization in bounded-memory games

Jeremiah Blocki, Nicolas Christin, Anupam Datta, and Arunesh Sinha. Adaptive regret minimization in bounded-memory games. In Decision and Game Theory for Security (GameSec), 2013

2013
[10]

The myth of the folk theorem

Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In ACM Symposium on Theory of Computing (STOC), 2008

2008
[11]

Superhuman ai for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359 0 (6374): 0 418--424, 2018

2018
[12]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365 0 (6456): 0 885--890, 2019

2019
[13]

Single sample path-based optimization of markov chains

Xi-Ren Cao. Single sample path-based optimization of markov chains. Journal of optimization theory and applications, 100: 0 527--548, 1999

1999
[14]

Online convex optimization with time-varying constraints and bandit feedback

Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64 0 (7): 0 2665--2680, 2018

2018
[15]

Fast policy extragradient methods for competitive games with entropy regularization

Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. In Neural Information Processing Systems (NeurIPS), 2021

2021
[16]

Prediction, Learning, and Games

Nicolo Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

2006
[17]

Multiagent learning in the presence of memory-bounded agents

Doran Chakraborty and Peter Stone. Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28 0 (2): 0 182--213, 2014

2014
[18]

An online convex optimization approach to proactive network resource allocation

Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65 0 (24): 0 6350--6364, 2017

2017
[19]

Computing nash equilibria: Approximation and smoothed complexity

Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In Symposium on Foundations of Computer Science (FOCS), 2006

2006
[20]

The approximation complexity of win-lose games

Xi Chen, Shang-Hua Teng, and Paul Valiant. The approximation complexity of win-lose games. 2007

2007
[21]

Strongly adaptive online learning

Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning (ICML), 2015

2015
[22]

The complexity of computing a nash equilibrium

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. Communications of the ACM, 52 0 (2): 0 89--97, 2009

2009
[23]

The complexity of M arkov equilibrium in stochastic games

Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of M arkov equilibrium in stochastic games. In Conference on Learning Theory (COLT), 2023

2023
[24]

How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

Daniela de Farias and Nimrod Megiddo. How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

2003
[25]

Competitive M arkov decision processes

Jerzy Filar and Koos Vrieze. Competitive M arkov decision processes . Springer Science & Business Media, 2012

2012
[26]

Learning with opponent-learning awareness

Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017

Pith/arXiv arXiv 2017
[27]

The folk theorem in repeated games with discounting or with incomplete information

Drew Fudenberg and Eric Maskin. The folk theorem in repeated games with discounting or with incomplete information. In A long-run collaboration on long-run games, pages 209--230. World Scientific, 2009

2009
[28]

Stochastic games with zero stop probabilities

Dean Gillette. Stochastic games with zero stop probabilities. Contributions to the Theory of Games, 3 0 (39): 0 179--187, 1957

1957
[29]

Approximation to bayes risk in repeated play

James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3: 0 97--139, 1957

1957
[30]

A simple adaptive procedure leading to correlated equilibrium

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68 0 (5): 0 1127--1150, 2000

2000
[31]

Logarithmic regret algorithms for online convex optimization

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69 0 (2): 0 169--192, 2007

2007
[32]

Introduction to online convex optimization

Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

2016
[33]

Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

2018
[34]

V-learning -- A simple, efficient, decentralized algorithm for multiagent RL

Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning -- A simple, efficient, decentralized algorithm for multiagent RL . arXiv preprint arXiv:2110.14555, 2021

arXiv 2021
[35]

Finite rationality and interpersonal complexity in repeated games

Ehud Kalai and William Stanford. Finite rationality and interpersonal complexity in repeated games. Econometrica: Journal of the Econometric Society, pages 397--410, 1988

1988
[36]

A policy gradient algorithm for learning to learn in multiagent reinforcement learning

Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning (ICML), 2021

2021
[37]

Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson

Alistair Letcher, Jakob N. Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations (ICLR), 2019

2019
[38]

A polynomial-time nash equilibrium algorithm for repeated games

Michael L Littman and Peter Stone. A polynomial-time nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, 2003

2003
[39]

The power of regularization in solving extensive-form games

Mingyang Liu, Asuman E Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. In International Conference on Learning Representations (ICLR), 2022

2022
[40]

On the impossibility of learning to cooperate with adaptive partner strategies in repeated games

Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning (ICML), pages 14197--14209. PMLR, 2022

2022
[41]

Model-free opponent shaping

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning (ICML), pages 14398--14411. PMLR, 2022

2022
[42]

Provably efficient reinforcement learning in decentralized general-sum markov games

Weichao Mao and Tamer Ba s ar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, pages 1--22, 2022

2022
[43]

Near-optimal model-free reinforcement learning in non-stationary episodic MDPs

Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs . In International Conference on Machine Learning (ICML), 2021

2021
[44]

On sequential strategies for loss functions with memory

Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48 0 (7): 0 1947--1958, 2002

1947
[45]

Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma

Abraham Neyman. Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma. Economics Letters, 19 0 (3): 0 227--229, 1985

1985
[46]

Markov chains

James R Norris. Markov chains. Number 2. Cambridge university press, 1998

1998
[47]

Online learning of feasible strategies in unknown environments

Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62 0 (6): 0 2807--2822, 2016

2016
[48]

On the interpretation of decision problems with imperfect recall

Michele Piccione and Ariel Rubinstein. On the interpretation of decision problems with imperfect recall. Games and Economic Behavior, 20 0 (1): 0 3--24, 1997

1997
[49]

Markov decision processes: D iscrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: D iscrete stochastic dynamic programming . John Wiley & Sons, 2014

2014
[50]

Can bounded rationality resolve the Prisoner's dilemma

Roy Radner. Can bounded rationality resolve the Prisoner's dilemma. Essays in honor of Gerard Debreu, pages 387--399, 1986

1986
[51]

Algorithmic game theory

Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53 0 (7): 0 78--86, 2010

2010
[52]

On the impossibility of achieving no regrets in repeated games

Karl Schlag and Andriy Zapechelnyuk. On the impossibility of achieving no regrets in repeated games. Journal of Economic Behavior & Organization, 81 0 (1): 0 153--158, 2012

2012
[53]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39 0 (10): 0 1095--1100, 1953

1953
[54]

Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer

Samuel Sokota, Ryan D'Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR), 2023

2023
[55]

When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

2022
[56]

Online non-convex learning: Following the perturbed leader is optimal

Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In International Conference on Algorithmic Learning Theory (ALT), 2020

2020
[57]

On subvariants, ie semi-invariants to binary quantics of an unlimited order

James J Sylvester. On subvariants, ie semi-invariants to binary quantics of an unlimited order. American Journal of Mathematics, 5 0 (1): 0 79--136, 1882
[58]

Stability and perfection of Nash equilibria, volume 339

Eric Van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991

1991
[59]

Strategy: an introduction to game theory, volume 139

Joel Watson. Strategy: an introduction to game theory, volume 139. WW Norton New York, 2002

2002
[60]

A practical use of imperfect recall

Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael H Bowling. A practical use of imperfect recall. In SARA, 2009

2009
[61]

Cola: consistent learning with opponent-learning awareness

Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. Cola: consistent learning with opponent-learning awareness. In International Conference on Machine Learning (ICML), pages 23804--23831. PMLR, 2022

2022
[62]

Adaptive online learning in dynamic environments

Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Neural Information Processing Systems (NeurIPS), 2018 a

2018
[63]

Dynamic regret of strongly adaptive methods

Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), pages 5882--5891. PMLR, 2018 b

2018
[64]

Dynamic regret of strongly adaptive methods

Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), 2018 c

2018
[65]

Non-stationary online learning with memory and non-stochastic control

Peng Zhao, Yu-Xiang Wang, and Zhi-Hua Zhou. Non-stationary online learning with memory and non-stochastic control. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

2022
[66]

Online convex programming and generalized infinitesimal gradient ascent

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), 2003

2003
[67]

Response regret

Martin Zinkevich. Response regret. In AAAI Fall Symposium: Coevolutionary and Coadaptive Systems, page 41, 2005

2005

[1] [1]

Kakade, and Karan Singh

Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning (ICML), 2019

2019

[2] [2]

Online learning for adversaries with memory: price of past mistakes

Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Neural Information Processing Systems (NeurIPS), 2015

2015

[3] [3]

Online bandit learning against an adaptive adversary: from regret to policy regret

Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pages 1747--1754, 2012

2012

[4] [4]

Policy regret in repeated games

Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games. In Neural Information Processing Systems (NeurIPS), 2018

2018

[5] [5]

Long term competition: A game theoretic analysis', mimeograph

Robert Aumann and Lloyd Shapley. Long term competition: A game theoretic analysis', mimeograph. Hebrew University, 1976

1976

[6] [6]

The evolution of cooperation

Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 0 (4489): 0 1390--1396, 1981

1981

[7] [7]

Multiplicative weights update in zero-sum games

James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM Conference on Economics and Computation (EC), 2018

2018

[8] [8]

Finitely repeated games

Jean-Pierre Benoit and Vijay Krishna. Finitely repeated games. Econometrica, 53 0 (4): 0 905--22, 1985. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:53:y:1985:i:4:p:905-22

1985

[9] [9]

Adaptive regret minimization in bounded-memory games

Jeremiah Blocki, Nicolas Christin, Anupam Datta, and Arunesh Sinha. Adaptive regret minimization in bounded-memory games. In Decision and Game Theory for Security (GameSec), 2013

2013

[10] [10]

The myth of the folk theorem

Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In ACM Symposium on Theory of Computing (STOC), 2008

2008

[11] [11]

Superhuman ai for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359 0 (6374): 0 418--424, 2018

2018

[12] [12]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365 0 (6456): 0 885--890, 2019

2019

[13] [13]

Single sample path-based optimization of markov chains

Xi-Ren Cao. Single sample path-based optimization of markov chains. Journal of optimization theory and applications, 100: 0 527--548, 1999

1999

[14] [14]

Online convex optimization with time-varying constraints and bandit feedback

Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64 0 (7): 0 2665--2680, 2018

2018

[15] [15]

Fast policy extragradient methods for competitive games with entropy regularization

Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. In Neural Information Processing Systems (NeurIPS), 2021

2021

[16] [16]

Prediction, Learning, and Games

Nicolo Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

2006

[17] [17]

Multiagent learning in the presence of memory-bounded agents

Doran Chakraborty and Peter Stone. Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28 0 (2): 0 182--213, 2014

2014

[18] [18]

An online convex optimization approach to proactive network resource allocation

Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65 0 (24): 0 6350--6364, 2017

2017

[19] [19]

Computing nash equilibria: Approximation and smoothed complexity

Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In Symposium on Foundations of Computer Science (FOCS), 2006

2006

[20] [20]

The approximation complexity of win-lose games

Xi Chen, Shang-Hua Teng, and Paul Valiant. The approximation complexity of win-lose games. 2007

2007

[21] [21]

Strongly adaptive online learning

Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning (ICML), 2015

2015

[22] [22]

The complexity of computing a nash equilibrium

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. Communications of the ACM, 52 0 (2): 0 89--97, 2009

2009

[23] [23]

The complexity of M arkov equilibrium in stochastic games

Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of M arkov equilibrium in stochastic games. In Conference on Learning Theory (COLT), 2023

2023

[24] [24]

How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

Daniela de Farias and Nimrod Megiddo. How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003

2003

[25] [25]

Competitive M arkov decision processes

Jerzy Filar and Koos Vrieze. Competitive M arkov decision processes . Springer Science & Business Media, 2012

2012

[26] [26]

Learning with opponent-learning awareness

Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017

Pith/arXiv arXiv 2017

[27] [27]

The folk theorem in repeated games with discounting or with incomplete information

Drew Fudenberg and Eric Maskin. The folk theorem in repeated games with discounting or with incomplete information. In A long-run collaboration on long-run games, pages 209--230. World Scientific, 2009

2009

[28] [28]

Stochastic games with zero stop probabilities

Dean Gillette. Stochastic games with zero stop probabilities. Contributions to the Theory of Games, 3 0 (39): 0 179--187, 1957

1957

[29] [29]

Approximation to bayes risk in repeated play

James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3: 0 97--139, 1957

1957

[30] [30]

A simple adaptive procedure leading to correlated equilibrium

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68 0 (5): 0 1127--1150, 2000

2000

[31] [31]

Logarithmic regret algorithms for online convex optimization

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69 0 (2): 0 169--192, 2007

2007

[32] [32]

Introduction to online convex optimization

Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016

2016

[33] [33]

Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018

2018

[34] [34]

V-learning -- A simple, efficient, decentralized algorithm for multiagent RL

Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning -- A simple, efficient, decentralized algorithm for multiagent RL . arXiv preprint arXiv:2110.14555, 2021

arXiv 2021

[35] [35]

Finite rationality and interpersonal complexity in repeated games

Ehud Kalai and William Stanford. Finite rationality and interpersonal complexity in repeated games. Econometrica: Journal of the Econometric Society, pages 397--410, 1988

1988

[36] [36]

A policy gradient algorithm for learning to learn in multiagent reinforcement learning

Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning (ICML), 2021

2021

[37] [37]

Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson

Alistair Letcher, Jakob N. Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations (ICLR), 2019

2019

[38] [38]

A polynomial-time nash equilibrium algorithm for repeated games

Michael L Littman and Peter Stone. A polynomial-time nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, 2003

2003

[39] [39]

The power of regularization in solving extensive-form games

Mingyang Liu, Asuman E Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. In International Conference on Learning Representations (ICLR), 2022

2022

[40] [40]

On the impossibility of learning to cooperate with adaptive partner strategies in repeated games

Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning (ICML), pages 14197--14209. PMLR, 2022

2022

[41] [41]

Model-free opponent shaping

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning (ICML), pages 14398--14411. PMLR, 2022

2022

[42] [42]

Provably efficient reinforcement learning in decentralized general-sum markov games

Weichao Mao and Tamer Ba s ar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, pages 1--22, 2022

2022

[43] [43]

Near-optimal model-free reinforcement learning in non-stationary episodic MDPs

Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs . In International Conference on Machine Learning (ICML), 2021

2021

[44] [44]

On sequential strategies for loss functions with memory

Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48 0 (7): 0 1947--1958, 2002

1947

[45] [45]

Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma

Abraham Neyman. Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma. Economics Letters, 19 0 (3): 0 227--229, 1985

1985

[46] [46]

Markov chains

James R Norris. Markov chains. Number 2. Cambridge university press, 1998

1998

[47] [47]

Online learning of feasible strategies in unknown environments

Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62 0 (6): 0 2807--2822, 2016

2016

[48] [48]

On the interpretation of decision problems with imperfect recall

Michele Piccione and Ariel Rubinstein. On the interpretation of decision problems with imperfect recall. Games and Economic Behavior, 20 0 (1): 0 3--24, 1997

1997

[49] [49]

Markov decision processes: D iscrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: D iscrete stochastic dynamic programming . John Wiley & Sons, 2014

2014

[50] [50]

Can bounded rationality resolve the Prisoner's dilemma

Roy Radner. Can bounded rationality resolve the Prisoner's dilemma. Essays in honor of Gerard Debreu, pages 387--399, 1986

1986

[51] [51]

Algorithmic game theory

Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53 0 (7): 0 78--86, 2010

2010

[52] [52]

On the impossibility of achieving no regrets in repeated games

Karl Schlag and Andriy Zapechelnyuk. On the impossibility of achieving no regrets in repeated games. Journal of Economic Behavior & Organization, 81 0 (1): 0 153--158, 2012

2012

[53] [53]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39 0 (10): 0 1095--1100, 1953

1953

[54] [54]

Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer

Samuel Sokota, Ryan D'Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR), 2023

2023

[55] [55]

When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022

2022

[56] [56]

Online non-convex learning: Following the perturbed leader is optimal

Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In International Conference on Algorithmic Learning Theory (ALT), 2020

2020

[57] [57]

On subvariants, ie semi-invariants to binary quantics of an unlimited order

James J Sylvester. On subvariants, ie semi-invariants to binary quantics of an unlimited order. American Journal of Mathematics, 5 0 (1): 0 79--136, 1882

[58] [58]

Stability and perfection of Nash equilibria, volume 339

Eric Van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991

1991

[59] [59]

Strategy: an introduction to game theory, volume 139

Joel Watson. Strategy: an introduction to game theory, volume 139. WW Norton New York, 2002

2002

[60] [60]

A practical use of imperfect recall

Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael H Bowling. A practical use of imperfect recall. In SARA, 2009

2009

[61] [61]

Cola: consistent learning with opponent-learning awareness

Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. Cola: consistent learning with opponent-learning awareness. In International Conference on Machine Learning (ICML), pages 23804--23831. PMLR, 2022

2022

[62] [62]

Adaptive online learning in dynamic environments

Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Neural Information Processing Systems (NeurIPS), 2018 a

2018

[63] [63]

Dynamic regret of strongly adaptive methods

Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), pages 5882--5891. PMLR, 2018 b

2018

[64] [64]

Dynamic regret of strongly adaptive methods

Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), 2018 c

2018

[65] [65]

Non-stationary online learning with memory and non-stochastic control

Peng Zhao, Yu-Xiang Wang, and Zhi-Hua Zhou. Non-stationary online learning with memory and non-stochastic control. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

2022

[66] [66]

Online convex programming and generalized infinitesimal gradient ascent

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), 2003

2003

[67] [67]

Response regret

Martin Zinkevich. Response regret. In AAAI Fall Symposium: Coevolutionary and Coadaptive Systems, page 41, 2005

2005