Regret Minimization with Adaptive Opponents in Repeated Games
Pith reviewed 2026-06-28 01:59 UTC · model grok-4.3
The pith
Minimizing Repeated Policy Regret in games with adaptive opponents enables learning certain subgame perfect equilibria.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RP-Regret is defined as the difference between a player's realized accumulated payoff and the payoff of the best policy that could have responded to the full history of play. Necessary conditions require bounded variation in the comparator sequence and finite memory for both the comparator and the opponents' strategies. Under these conditions, an optimization-oracle algorithm, a linearized convex surrogate, and a direct method for slow opponent drift each achieve sublinear RP-Regret. When all players simultaneously minimize RP-Regret or its linearization, the joint play converges to subgame perfect equilibria of the repeated game.
What carries the argument
Repeated Policy Regret (RP-Regret), the gap between realized cumulative utility and the utility of the best history-responsive policy in hindsight.
If this is right
- Sublinear RP-Regret is attainable only when comparator variation and memory lengths satisfy explicit bounds.
- A convex linearization of RP-Regret can be minimized by standard online convex optimization routines.
- Collective minimization of RP-Regret or its linearization produces subgame perfect equilibria.
- In matrix games such as Stag-Hunt, RP-Regret minimization yields strictly higher average payoffs than external-regret minimization.
Where Pith is reading between the lines
- The same regret notion could be applied to multi-agent reinforcement learning with recurrent policies.
- Equilibrium selection properties might extend to stochastic games with imperfect monitoring.
- Computational cost of the optimization oracle version may limit use to small action spaces.
Load-bearing premise
The comparator strategies must vary slowly and both comparators and opponents must have finite memory for sublinear RP-Regret to be possible.
What would settle it
A repeated game in which all players run an RP-Regret minimizer yet the observed play fails to satisfy the one-shot deviation principle for any subgame perfect equilibrium.
Figures
read the original abstract
In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Repeated Policy Regret (RP-Regret), a regret metric for repeated games that accounts for adaptive, history-dependent opponents by comparing realized utility to the best-in-hindsight utility under counterfactual responses. It states necessary conditions on bounded variation of comparator strategies and bounded memory of comparators and opponents for sublinear RP-Regret. Three algorithms are proposed to minimize RP-Regret (or a linearized convex surrogate): an optimization-oracle method, a surrogate minimization method, and a method for slowly changing opponents. The central claim is that when all players minimize RP-Regret or its surrogate, certain subgame perfect equilibria of the repeated game become learnable. Experiments on games such as Stag-Hunt are reported to show higher cooperative utilities.
Significance. If the conditions for sublinear regret are compatible with the claimed equilibria and the algorithms achieve the stated guarantees, the work would meaningfully extend online learning to adaptive repeated-game settings by enabling stronger comparators than prior regret notions while still supporting equilibrium learning. The native formulation for repeated play and the empirical demonstration of cooperative outcomes are positive aspects.
major comments (2)
- [Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.
- [Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.
minor comments (1)
- [Abstract] Abstract: the experimental section is mentioned but no details on game instances, baselines, or quantitative results are provided, making it impossible to assess the strength of the empirical support.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We respond to each major comment below and indicate where revisions will be made to clarify the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'certain subgame perfect equilibria' can be learned when all players minimize RP-Regret (or its linearized variant) requires that those equilibria satisfy the necessary conditions on bounded variation of comparator strategies and bounded memory. History-dependent SPE strategies can induce unbounded variation in the best-in-hindsight comparator; the manuscript must explicitly identify which SPE satisfy the variation/memory bounds or show that the claimed equilibria are restricted to those that do.
Authors: The manuscript derives necessary conditions (bounded variation of comparator strategies and bounded memory of comparators/opponents) for sublinear RP-Regret in the main text. The abstract's use of 'certain' SPE already restricts the claim to those equilibria satisfying these conditions; any history-dependent SPE inducing unbounded variation would violate the necessary conditions and is therefore excluded from the claim. We will revise the abstract to explicitly note this restriction to SPE satisfying the stated bounds. revision: yes
-
Referee: [Abstract] Abstract: the three proposed algorithms address non-convexity of RP-Regret, but the abstract provides no derivation or proof sketch that the optimization-oracle algorithm or the linearized surrogate yields sublinear RP-Regret under the stated variation conditions. A load-bearing step is missing: how the surrogate minimization translates back to sublinear RP-Regret bounds.
Authors: The abstract is a concise summary of results whose proofs and derivations appear in the body of the manuscript, including the analysis showing that the optimization-oracle method and the linearized surrogate yield sublinear RP-Regret under the variation/memory conditions. The connection from surrogate minimization to the original RP-Regret bound is established via the regret decomposition and bounding arguments in the main text. No revision to the abstract is needed, as proof sketches are not standard in abstracts. revision: no
Circularity Check
Derivation chain is self-contained with no circular reductions
full rationale
The paper introduces the novel RP-Regret metric to capture adaptive opponents and counterfactual reasoning in repeated games. It states necessary conditions on comparator variation and memory bounds for sublinear regret, then develops three independent algorithms (optimization oracle, linearized convex surrogate, and direct minimization under slow opponent changes) to minimize the non-convex RP-Regret. Finally it shows that mutual minimization yields certain subgame-perfect equilibria. None of these steps reduce by construction to self-definitional equivalences, renamed fitted inputs, or load-bearing self-citations; the central claims rest on direct analysis of the new metric and its algorithmic properties without circular reduction to prior fitted quantities or author-specific ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions from online learning and repeated game theory (e.g., finite action spaces, utility functions)
Reference graph
Works this paper leans on
-
[1]
Kakade, and Karan Singh
Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning (ICML), 2019
2019
-
[2]
Online learning for adversaries with memory: price of past mistakes
Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Neural Information Processing Systems (NeurIPS), 2015
2015
-
[3]
Online bandit learning against an adaptive adversary: from regret to policy regret
Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pages 1747--1754, 2012
2012
-
[4]
Policy regret in repeated games
Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games. In Neural Information Processing Systems (NeurIPS), 2018
2018
-
[5]
Long term competition: A game theoretic analysis', mimeograph
Robert Aumann and Lloyd Shapley. Long term competition: A game theoretic analysis', mimeograph. Hebrew University, 1976
1976
-
[6]
The evolution of cooperation
Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211 0 (4489): 0 1390--1396, 1981
1981
-
[7]
Multiplicative weights update in zero-sum games
James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM Conference on Economics and Computation (EC), 2018
2018
-
[8]
Finitely repeated games
Jean-Pierre Benoit and Vijay Krishna. Finitely repeated games. Econometrica, 53 0 (4): 0 905--22, 1985. URL https://EconPapers.repec.org/RePEc:ecm:emetrp:v:53:y:1985:i:4:p:905-22
1985
-
[9]
Adaptive regret minimization in bounded-memory games
Jeremiah Blocki, Nicolas Christin, Anupam Datta, and Arunesh Sinha. Adaptive regret minimization in bounded-memory games. In Decision and Game Theory for Security (GameSec), 2013
2013
-
[10]
The myth of the folk theorem
Christian Borgs, Jennifer Chayes, Nicole Immorlica, Adam Tauman Kalai, Vahab Mirrokni, and Christos Papadimitriou. The myth of the folk theorem. In ACM Symposium on Theory of Computing (STOC), 2008
2008
-
[11]
Superhuman ai for heads-up no-limit poker: Libratus beats top professionals
Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359 0 (6374): 0 418--424, 2018
2018
-
[12]
Superhuman ai for multiplayer poker
Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365 0 (6456): 0 885--890, 2019
2019
-
[13]
Single sample path-based optimization of markov chains
Xi-Ren Cao. Single sample path-based optimization of markov chains. Journal of optimization theory and applications, 100: 0 527--548, 1999
1999
-
[14]
Online convex optimization with time-varying constraints and bandit feedback
Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64 0 (7): 0 2665--2680, 2018
2018
-
[15]
Fast policy extragradient methods for competitive games with entropy regularization
Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. In Neural Information Processing Systems (NeurIPS), 2021
2021
-
[16]
Prediction, Learning, and Games
Nicolo Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006
2006
-
[17]
Multiagent learning in the presence of memory-bounded agents
Doran Chakraborty and Peter Stone. Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28 0 (2): 0 182--213, 2014
2014
-
[18]
An online convex optimization approach to proactive network resource allocation
Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65 0 (24): 0 6350--6364, 2017
2017
-
[19]
Computing nash equilibria: Approximation and smoothed complexity
Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Computing nash equilibria: Approximation and smoothed complexity. In Symposium on Foundations of Computer Science (FOCS), 2006
2006
-
[20]
The approximation complexity of win-lose games
Xi Chen, Shang-Hua Teng, and Paul Valiant. The approximation complexity of win-lose games. 2007
2007
-
[21]
Strongly adaptive online learning
Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning (ICML), 2015
2015
-
[22]
The complexity of computing a nash equilibrium
Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. Communications of the ACM, 52 0 (2): 0 89--97, 2009
2009
-
[23]
The complexity of M arkov equilibrium in stochastic games
Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of M arkov equilibrium in stochastic games. In Conference on Learning Theory (COLT), 2023
2023
-
[24]
How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003
Daniela de Farias and Nimrod Megiddo. How to combine expert (and novice) advice when actions impact the environment? In Neural Information Processing Systems (NeurIPS), volume 16, 2003
2003
-
[25]
Competitive M arkov decision processes
Jerzy Filar and Koos Vrieze. Competitive M arkov decision processes . Springer Science & Business Media, 2012
2012
-
[26]
Learning with opponent-learning awareness
Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017
Pith/arXiv arXiv 2017
-
[27]
The folk theorem in repeated games with discounting or with incomplete information
Drew Fudenberg and Eric Maskin. The folk theorem in repeated games with discounting or with incomplete information. In A long-run collaboration on long-run games, pages 209--230. World Scientific, 2009
2009
-
[28]
Stochastic games with zero stop probabilities
Dean Gillette. Stochastic games with zero stop probabilities. Contributions to the Theory of Games, 3 0 (39): 0 179--187, 1957
1957
-
[29]
Approximation to bayes risk in repeated play
James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3: 0 97--139, 1957
1957
-
[30]
A simple adaptive procedure leading to correlated equilibrium
Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68 0 (5): 0 1127--1150, 2000
2000
-
[31]
Logarithmic regret algorithms for online convex optimization
Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69 0 (2): 0 169--192, 2007
2007
-
[32]
Introduction to online convex optimization
Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization , 2 0 (3-4): 0 157--325, 2016
2016
-
[33]
Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Neural Information Processing Systems (NeurIPS), 2018
2018
-
[34]
V-learning -- A simple, efficient, decentralized algorithm for multiagent RL
Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning -- A simple, efficient, decentralized algorithm for multiagent RL . arXiv preprint arXiv:2110.14555, 2021
arXiv 2021
-
[35]
Finite rationality and interpersonal complexity in repeated games
Ehud Kalai and William Stanford. Finite rationality and interpersonal complexity in repeated games. Econometrica: Journal of the Econometric Society, pages 397--410, 1988
1988
-
[36]
A policy gradient algorithm for learning to learn in multiagent reinforcement learning
Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning (ICML), 2021
2021
-
[37]
Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson
Alistair Letcher, Jakob N. Foerster, David Balduzzi, Tim Rockt \" a schel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations (ICLR), 2019
2019
-
[38]
A polynomial-time nash equilibrium algorithm for repeated games
Michael L Littman and Peter Stone. A polynomial-time nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, 2003
2003
-
[39]
The power of regularization in solving extensive-form games
Mingyang Liu, Asuman E Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. In International Conference on Learning Representations (ICLR), 2022
2022
-
[40]
On the impossibility of learning to cooperate with adaptive partner strategies in repeated games
Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning (ICML), pages 14197--14209. PMLR, 2022
2022
-
[41]
Model-free opponent shaping
Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning (ICML), pages 14398--14411. PMLR, 2022
2022
-
[42]
Provably efficient reinforcement learning in decentralized general-sum markov games
Weichao Mao and Tamer Ba s ar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, pages 1--22, 2022
2022
-
[43]
Near-optimal model-free reinforcement learning in non-stationary episodic MDPs
Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs . In International Conference on Machine Learning (ICML), 2021
2021
-
[44]
On sequential strategies for loss functions with memory
Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48 0 (7): 0 1947--1958, 2002
1947
-
[45]
Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma
Abraham Neyman. Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma. Economics Letters, 19 0 (3): 0 227--229, 1985
1985
-
[46]
Markov chains
James R Norris. Markov chains. Number 2. Cambridge university press, 1998
1998
-
[47]
Online learning of feasible strategies in unknown environments
Santiago Paternain and Alejandro Ribeiro. Online learning of feasible strategies in unknown environments. IEEE Transactions on Automatic Control, 62 0 (6): 0 2807--2822, 2016
2016
-
[48]
On the interpretation of decision problems with imperfect recall
Michele Piccione and Ariel Rubinstein. On the interpretation of decision problems with imperfect recall. Games and Economic Behavior, 20 0 (1): 0 3--24, 1997
1997
-
[49]
Markov decision processes: D iscrete stochastic dynamic programming
Martin L Puterman. Markov decision processes: D iscrete stochastic dynamic programming . John Wiley & Sons, 2014
2014
-
[50]
Can bounded rationality resolve the Prisoner's dilemma
Roy Radner. Can bounded rationality resolve the Prisoner's dilemma. Essays in honor of Gerard Debreu, pages 387--399, 1986
1986
-
[51]
Algorithmic game theory
Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53 0 (7): 0 78--86, 2010
2010
-
[52]
On the impossibility of achieving no regrets in repeated games
Karl Schlag and Andriy Zapechelnyuk. On the impossibility of achieving no regrets in repeated games. Journal of Economic Behavior & Organization, 81 0 (1): 0 153--158, 2012
2012
-
[53]
Stochastic games
Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39 0 (10): 0 1095--1100, 1953
1953
-
[54]
Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer
Samuel Sokota, Ryan D'Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR), 2023
2023
-
[55]
When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022
Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations (ICLR), 2022
2022
-
[56]
Online non-convex learning: Following the perturbed leader is optimal
Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In International Conference on Algorithmic Learning Theory (ALT), 2020
2020
-
[57]
On subvariants, ie semi-invariants to binary quantics of an unlimited order
James J Sylvester. On subvariants, ie semi-invariants to binary quantics of an unlimited order. American Journal of Mathematics, 5 0 (1): 0 79--136, 1882
-
[58]
Stability and perfection of Nash equilibria, volume 339
Eric Van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991
1991
-
[59]
Strategy: an introduction to game theory, volume 139
Joel Watson. Strategy: an introduction to game theory, volume 139. WW Norton New York, 2002
2002
-
[60]
A practical use of imperfect recall
Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael H Bowling. A practical use of imperfect recall. In SARA, 2009
2009
-
[61]
Cola: consistent learning with opponent-learning awareness
Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. Cola: consistent learning with opponent-learning awareness. In International Conference on Machine Learning (ICML), pages 23804--23831. PMLR, 2022
2022
-
[62]
Adaptive online learning in dynamic environments
Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Neural Information Processing Systems (NeurIPS), 2018 a
2018
-
[63]
Dynamic regret of strongly adaptive methods
Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), pages 5882--5891. PMLR, 2018 b
2018
-
[64]
Dynamic regret of strongly adaptive methods
Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), 2018 c
2018
-
[65]
Non-stationary online learning with memory and non-stochastic control
Peng Zhao, Yu-Xiang Wang, and Zhi-Hua Zhou. Non-stationary online learning with memory and non-stochastic control. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022
2022
-
[66]
Online convex programming and generalized infinitesimal gradient ascent
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), 2003
2003
-
[67]
Response regret
Martin Zinkevich. Response regret. In AAAI Fall Symposium: Coevolutionary and Coadaptive Systems, page 41, 2005
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.