EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

Albert Wang; Chase McDonald; Eugene Vinitsky; JB Lanier; Nathan Tsang; Roy Fox; Tristan Maidment; Wesley N. Kerr

arxiv: 2606.23995 · v1 · pith:URTNIFOBnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.GT· cs.MA

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

Tristan Maidment , JB Lanier , Chase McDonald , Nathan Tsang , Eugene Vinitsky , Roy Fox , Albert Wang , Wesley N. Kerr This is my paper

Pith reviewed 2026-06-26 08:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GTcs.MA

keywords self-playpolicy gradientregularizationexponential moving averageexploitabilityzero-sum gamesPPOimperfect information

0 comments

The pith

Replacing uniform regularization with an EMA of policy parameters reduces exploitability in policy-gradient self-play.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EMAgnet to address the limitation of uniform regularization in self-play methods for two-player zero-sum games. Instead of pulling equally toward all actions, it uses an exponential moving average of the agent's own recent policy parameters as the target. Experiments show this adaptive approach achieves lower exploitability than standard uniform-magnet PPO self-play in most tested settings. The gains are particularly reliable in games that include many strictly dominated strategies. This matters because it suggests a straightforward way to improve existing self-play algorithms without introducing new hyperparameters or changing annealing schedules.

Core claim

EMAgnet regularizes toward an exponential moving average of the last-iterate policy parameters rather than a uniform distribution, providing an adaptive target that changes as the strategy improves. When evaluated against PPO self-play with uniform-magnet regularization under linear and power-law annealing, it produces lower exploitability in the majority of environments, including those with exploration challenges, and delivers consistent gains whenever strictly dominated strategies are present.

What carries the argument

Exponential moving average of policy parameters as the regularization target in parameter space.

If this is right

Self-play algorithms can benefit from adaptive rather than fixed regularization targets.
Games with strictly dominated strategies see reliable improvements from this change.
The method integrates directly into PPO without altering the core self-play loop or annealing.
Exploitability reductions hold across both standard benchmarks and modified ones with added exploration challenges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other policy optimization methods beyond PPO.
Further gains could come from combining EMA targets with learned or game-specific regularization.
Testing on three-player or non-zero-sum settings would reveal if the benefit is specific to two-player zero-sum dynamics.
Parameter-space EMA might interact differently with function approximation in very large state spaces.

Load-bearing premise

An exponential moving average of policy parameters supplies a meaningfully better adaptive regularization target than a static uniform distribution when self-play dynamics and annealing schedules remain unchanged.

What would settle it

A replication study on the same benchmarks that finds uniform regularization achieving equal or lower exploitability in most environments would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2606.23995 by Albert Wang, Chase McDonald, Eugene Vinitsky, JB Lanier, Nathan Tsang, Roy Fox, Tristan Maidment, Wesley N. Kerr.

**Figure 1.** Figure 1: Self-play policy trajectories in Control Biased RPS [12], where agents must solve gridworld navigation tasks to execute each RPS action or else forfeit. (a,b) Regularizing toward uniform forces the policy to use strictly dominated strategies that fail navigation and forfeit. By the time annealed regularization is weak enough to avoid forfeiting, the policy fails to explore and find the Nash equilibrium (g… view at source ↗

**Figure 2.** Figure 2: Exploitability over environment steps for each game variant. Top row (a–c): standard games. Middle row (d–f): FF variants with a strictly dominated forfeit action added. Bottom row (g–i): control variants where most strategies are dominated. Best hyperparameter configuration per method (selected via Bayesian sweep), mean across 24 seeds with standard error bands. PPO-EMAg’s last-iterate and magnet policie… view at source ↗

read the original abstract

Recent work has established that regularized policy gradient methods such as PPO, when used in self-play, can match or exceed specialized game-theoretic algorithms for solving two-player zero-sum imperfect-information games. The uniform distribution has emerged as a strong policy regularization target for this purpose, but it regularizes equally toward all actions regardless of their viability. We introduce EMAgnet, which instead regularizes toward an exponential moving average (EMA) of the last-iterate policy's parameters, providing an adaptive regularization target that evolves with the agent's improving strategy. We evaluate EMAgnet on both standard two-player zero-sum benchmarks and modified benchmarks with exploration challenges and large numbers of strictly dominated strategies. Relative to PPO self-play with uniform-magnet regularization under both linear and power-law annealing schedules, EMAgnet achieves lower exploitability in the majority of tested environments, with consistent performance gains across games containing strictly dominated strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMAgnet replaces the uniform regularization target with a parameter-space EMA in self-play PPO and reports lower exploitability on most tested games, especially those with dominated strategies.

read the letter

The main takeaway is that this paper swaps the uniform distribution for an exponential moving average of the policy parameters as the regularization target inside PPO self-play. The rest of the pipeline stays the same, including the linear and power-law annealing schedules, so the comparison stays direct.

What is new is the specific choice of an adaptive, parameter-space EMA rather than a fixed uniform magnet. The evaluation adds modified benchmarks that contain large numbers of strictly dominated strategies and other exploration challenges. The reported outcome is lower exploitability for EMAgnet in the majority of environments, with the gains appearing consistent in the dominated-strategy cases.

The work is incremental within the existing regularized self-play literature. It does not change the overall algorithm or add new theory, but it isolates one design choice and tests it on relevant variants of the standard benchmarks. That focused comparison is the useful part.

The soft spots are the usual ones for an empirical RL paper at this stage: the abstract gives no numbers, error bars, or run counts, so the size and reliability of the improvement are still unclear. If the full paper supplies those details and the gains hold under the same hyperparameter budgets, the result is a modest but practical engineering note. If the advantage shrinks once variance is shown or requires extra tuning of the EMA decay, the contribution narrows further.

This is aimed at people already running self-play PPO on large imperfect-information games who want to try a small change to the regularization target. A reader in that niche can get value from the idea and the benchmark variants. It is not broad enough for a general RL audience.

It deserves peer review. The claim is narrow and testable, the setting matches current work, and referees can check whether the adaptive target actually delivers measurable gains once the full tables and methods are available.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EMAgnet, a regularization technique for PPO self-play in two-player zero-sum imperfect-information games that replaces the uniform distribution with an exponential moving average of the policy parameters as the regularization target. It reports that EMAgnet attains lower exploitability than uniform-magnet PPO under both linear and power-law annealing schedules in the majority of tested environments, with consistent gains on modified benchmarks containing strictly dominated strategies.

Significance. If the empirical comparisons are robust, the work demonstrates that a simple adaptive parameter-space target can outperform a static uniform regularizer in self-play without introducing new hyperparameters beyond the EMA decay rate, offering a practical enhancement to existing policy-gradient methods for large games.

major comments (2)

[Experiments] Experiments section: the central claim that EMAgnet wins in the majority of environments (including those with strictly dominated strategies) is load-bearing, yet the supplied text provides no quantitative exploitability tables, error bars, number of independent runs, or statistical tests; without these the majority claim cannot be verified.
[Methods] Methods section: the construction of the modified benchmarks with exploration challenges and strictly dominated strategies is not described in sufficient detail to allow reproduction or to assess whether the performance gains are attributable to the EMA target rather than benchmark-specific artifacts.

minor comments (2)

[Abstract] Abstract: the phrase 'majority of tested environments' would be strengthened by stating the total number of environments and the exact count on which EMAgnet improved.
[Method] Notation: the EMA update rule for policy parameters should be written explicitly with the decay factor to distinguish it from the uniform-magnet baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the experimental claims require stronger quantitative support and that the benchmark modifications need clearer documentation for reproducibility. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that EMAgnet wins in the majority of environments (including those with strictly dominated strategies) is load-bearing, yet the supplied text provides no quantitative exploitability tables, error bars, number of independent runs, or statistical tests; without these the majority claim cannot be verified.

Authors: We agree that the current presentation lacks the necessary quantitative details. In the revised version we will add full exploitability tables reporting mean values and standard deviations across independent runs (minimum 5 seeds per environment), error bars on all plots, and statistical tests (e.g., paired t-tests or Wilcoxon tests) comparing EMAgnet against uniform-magnet PPO under both annealing schedules. These additions will directly substantiate the majority-win claim. revision: yes
Referee: [Methods] Methods section: the construction of the modified benchmarks with exploration challenges and strictly dominated strategies is not described in sufficient detail to allow reproduction or to assess whether the performance gains are attributable to the EMA target rather than benchmark-specific artifacts.

Authors: We acknowledge the need for greater detail. The revised Methods section will explicitly describe the benchmark modifications: for each game we will list the added dominated actions, their payoff structure, and the precise mechanism used to create exploration challenges (e.g., reward scaling or action masking). This will enable reproduction and allow readers to evaluate whether observed gains stem from the adaptive EMA target. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces EMAgnet as an empirical regularization technique in PPO self-play and reports comparative exploitability results across environments. No derivation chain, equations, or first-principles claims are present; the method is defined by a hyper-parameter choice (EMA of policy parameters) whose performance is tested directly against uniform regularization baselines under matched annealing schedules. The central claim is therefore an experimental outcome rather than a reduction of any prediction to fitted inputs or self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5714 in / 1023 out tokens · 13959 ms · 2026-06-26T08:30:44.266327+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 linked inside Pith

[1]

Dota 2 with large scale deep reinforcement learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

Pith/arXiv arXiv 1912
[2]

George W. Brown. Iterative solution of games by fictitious play. In T. C. Koopmans, editor, Activity Analysis of Production and Allocation . Wiley, New York, 1951

1951
[3]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. 6 EMA GNET

2018
[4]

Combining deep reinforce- ment learning and search for imperfect-information games

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforce- ment learning and search for imperfect-information games. InAdvances in Neural Information Processing Systems, volume 33, pages 17057–17069, 2020

2020
[5]

Enhancing robustness in multi-agent reinforcement learn- ing via temporal consistency regularization: A self-distillation framework

Huang Chen and MingJun Dai. Enhancing robustness in multi-agent reinforcement learn- ing via temporal consistency regularization: A self-distillation framework. Knowledge-Based Systems, page 115940, 2026

2026
[6]

V ortices instead of equilibria in minmax opti- mization: Chaos and butterfly effects of online learning in zero-sum games

Yun Kuen Cheung and Georgios Piliouras. V ortices instead of equilibria in minmax opti- mization: Chaos and butterfly effects of online learning in zero-sum games. In Proceedings of the Thirty-Second Conference on Learning Theory , volume 99 of Proceedings of Machine Learning Research, pages 807–834. PMLR, 2019

2019
[7]

Deep reinforcement learning from self-play in imperfect- information games, 2016

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect- information games, 2016

2016
[8]

Neural replicator dynamics: Multiagent learning via hedging policy gradients

Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, R ´emi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Du ´e˜nez-Guzm´an, et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. In Pro- ceedings of the 19th International Conference on Autonomous Agents and Multiagent Syste...

2020
[9]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018

Pith/arXiv arXiv 2018
[10]

A unified game-theoretic approach to multiagent reinforcement learning

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems , volume 30, 2017

2017
[12]

URL http://arxiv.org/abs/1908.09453

arXiv 1908
[13]

Data-augmented game starts for accelerating self-play exploration in imperfect information games

JB Lanier, Nathan Monette, Pierre Baldi, and Roy Fox. Data-augmented game starts for accelerating self-play exploration in imperfect information games. preprint, 2026

2026
[14]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks. arXiv preprint arXiv:2406.02596, 2024

arXiv 2024
[15]

Continuous control with deep reinforcement learning, September 15 2020

Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daniel Pieter Wierstra. Continuous control with deep reinforcement learning, September 15 2020. US Patent 10,776,692. 7 EMA GNET

2020
[16]

NeuPL: Neural population learning

Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, and Thore Graepel. NeuPL: Neural population learning. In International Conference on Learning Representations , 2022

2022
[17]

Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games

Stephen McAleer, John Banister Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games. In Advances in Neural Information Processing Systems, volume 33, pages 20238–20248, 2020

2020
[18]

Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox

Stephen McAleer, John Banister Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox. Toward optimal policy population growth in two-player zero-sum games. In International Conference on Learning Representations , 2024

2024
[19]

Wang, Pierre Baldi, and Roy Fox

Stephen Marcus McAleer, John Banister Lanier, Kevin A. Wang, Pierre Baldi, and Roy Fox. XDO: A double oracle algorithm for extensive-form games. In Advances in Neural Informa- tion Processing Systems, 2021

2021
[20]

Escher: Eschewing importance sampling in games by computing a history value function to estimate regret

Stephen Marcus McAleer, Gabriele Farina, Marc Lanctot, and Tuomas Sandholm. Escher: Eschewing importance sampling in games by computing a history value function to estimate regret. In The Eleventh International Conference on Learning Representations , 2023

2023
[21]

Exponential moving average of weights in deep learning: Dynamics and benefits

Daniel Morales-Brotons, Thijs V ogels, and Hadrien Hendrikx. Exponential moving average of weights in deep learning: Dynamics and benefits. arXiv preprint arXiv:2411.18704, 2024

arXiv 2024
[22]

Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H

Julien P ´erolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Tobias Pohlen, Tom Eccles, Mark Rowland, Marc Lanc- tot, Jean-...

2022
[23]

Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768, 2024

Alexandre Ram ´e, Johan Ferret, Nino Vieillard, Robert Dadashi, L ´eonard Hussenot, Pierre- Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768, 2024

arXiv 2024
[24]

Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gra- dient methods for imperfect-information games. In International Conference on Learning Representations (ICLR), 2026

2026
[25]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

2017
[26]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In The Eleventh International Conference on Learning Representations, 2023. 8 EMA GNET

2023
[27]

Zico Kolter, and Gabriele Farina

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J. Zico Kolter, and Gabriele Farina. Superhu- man ai for stratego using self-play reinforcement learning and test-time search. arXiv preprint arXiv:2511.07312, 2025

arXiv 2025
[28]

DREAM: Deep regret minimization with advantage baselines and model-free learning, 2020

Eric Steinberger, Adam Lerer, and Noam Brown. DREAM: Deep regret minimization with advantage baselines and model-free learning, 2020

2020
[29]

Czarnecki, et al

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575:350–354, 2019

2019
[30]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning , pages 23965– 2399...

2022
[31]

Ema policy gradient: Taming reinforcement learning for llms with ema anchor and top-k kl

Lunjun Zhang and Jimmy Ba. Ema policy gradient: Taming reinforcement learning for llms with ema anchor and top-k kl. arXiv preprint arXiv:2602.04417, 2026. Appendix A. Related Work A.1. Two-Player Zero-Sum Game Solving A central challenge in two-player zero-sum imperfect-information games is that naive self-play with policy gradient methods can cycle or d...

arXiv 2026
[32]

and ESCHER [19] adapt counterfactual regret minimization to function approximation. A third family of regularized policy-gradient methods, including NeuRD [8], R-NaD [21], and magnetic mirror descent [MMD, 25], stabilizes last-iterate convergence through explicit regularization terms in the policy objective. Our work builds on this last family, proposing ...
[33]

model soups

takes a different approach, regularizing via reward shaping toward a periodically updated ref- erence policy. At scale, DeepNash gradually transitions between regularization targets using linear interpolation and uses an EMA of the policy parameters to approximate fixed points. However, the regularization targets themselves remain discrete snapshots set a...

[1] [1]

Dota 2 with large scale deep reinforcement learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

Pith/arXiv arXiv 1912

[2] [2]

George W. Brown. Iterative solution of games by fictitious play. In T. C. Koopmans, editor, Activity Analysis of Production and Allocation . Wiley, New York, 1951

1951

[3] [3]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. 6 EMA GNET

2018

[4] [4]

Combining deep reinforce- ment learning and search for imperfect-information games

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforce- ment learning and search for imperfect-information games. InAdvances in Neural Information Processing Systems, volume 33, pages 17057–17069, 2020

2020

[5] [5]

Enhancing robustness in multi-agent reinforcement learn- ing via temporal consistency regularization: A self-distillation framework

Huang Chen and MingJun Dai. Enhancing robustness in multi-agent reinforcement learn- ing via temporal consistency regularization: A self-distillation framework. Knowledge-Based Systems, page 115940, 2026

2026

[6] [6]

V ortices instead of equilibria in minmax opti- mization: Chaos and butterfly effects of online learning in zero-sum games

Yun Kuen Cheung and Georgios Piliouras. V ortices instead of equilibria in minmax opti- mization: Chaos and butterfly effects of online learning in zero-sum games. In Proceedings of the Thirty-Second Conference on Learning Theory , volume 99 of Proceedings of Machine Learning Research, pages 807–834. PMLR, 2019

2019

[7] [7]

Deep reinforcement learning from self-play in imperfect- information games, 2016

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect- information games, 2016

2016

[8] [8]

Neural replicator dynamics: Multiagent learning via hedging policy gradients

Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, R ´emi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Du ´e˜nez-Guzm´an, et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. In Pro- ceedings of the 19th International Conference on Autonomous Agents and Multiagent Syste...

2020

[9] [9]

Averaging weights leads to wider optima and better generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018

Pith/arXiv arXiv 2018

[10] [10]

A unified game-theoretic approach to multiagent reinforcement learning

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems , volume 30, 2017

2017

[11] [12]

URL http://arxiv.org/abs/1908.09453

arXiv 1908

[12] [13]

Data-augmented game starts for accelerating self-play exploration in imperfect information games

JB Lanier, Nathan Monette, Pierre Baldi, and Roy Fox. Data-augmented game starts for accelerating self-play exploration in imperfect information games. preprint, 2026

2026

[13] [14]

Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks

Hojoon Lee, Hyeonseo Cho, Hyunseung Kim, Donghu Kim, Dugki Min, Jaegul Choo, and Clare Lyle. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks. arXiv preprint arXiv:2406.02596, 2024

arXiv 2024

[14] [15]

Continuous control with deep reinforcement learning, September 15 2020

Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daniel Pieter Wierstra. Continuous control with deep reinforcement learning, September 15 2020. US Patent 10,776,692. 7 EMA GNET

2020

[15] [16]

NeuPL: Neural population learning

Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, and Thore Graepel. NeuPL: Neural population learning. In International Conference on Learning Representations , 2022

2022

[16] [17]

Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games

Stephen McAleer, John Banister Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games. In Advances in Neural Information Processing Systems, volume 33, pages 20238–20248, 2020

2020

[17] [18]

Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox

Stephen McAleer, John Banister Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox. Toward optimal policy population growth in two-player zero-sum games. In International Conference on Learning Representations , 2024

2024

[18] [19]

Wang, Pierre Baldi, and Roy Fox

Stephen Marcus McAleer, John Banister Lanier, Kevin A. Wang, Pierre Baldi, and Roy Fox. XDO: A double oracle algorithm for extensive-form games. In Advances in Neural Informa- tion Processing Systems, 2021

2021

[19] [20]

Escher: Eschewing importance sampling in games by computing a history value function to estimate regret

Stephen Marcus McAleer, Gabriele Farina, Marc Lanctot, and Tuomas Sandholm. Escher: Eschewing importance sampling in games by computing a history value function to estimate regret. In The Eleventh International Conference on Learning Representations , 2023

2023

[20] [21]

Exponential moving average of weights in deep learning: Dynamics and benefits

Daniel Morales-Brotons, Thijs V ogels, and Hadrien Hendrikx. Exponential moving average of weights in deep learning: Dynamics and benefits. arXiv preprint arXiv:2411.18704, 2024

arXiv 2024

[21] [22]

Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H

Julien P ´erolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Tobias Pohlen, Tom Eccles, Mark Rowland, Marc Lanc- tot, Jean-...

2022

[22] [23]

Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768, 2024

Alexandre Ram ´e, Johan Ferret, Nino Vieillard, Robert Dadashi, L ´eonard Hussenot, Pierre- Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies.arXiv preprint arXiv:2406.16768, 2024

arXiv 2024

[23] [24]

Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gra- dient methods for imperfect-information games. In International Conference on Learning Representations (ICLR), 2026

2026

[24] [25]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

2017

[25] [26]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In The Eleventh International Conference on Learning Representations, 2023. 8 EMA GNET

2023

[26] [27]

Zico Kolter, and Gabriele Farina

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J. Zico Kolter, and Gabriele Farina. Superhu- man ai for stratego using self-play reinforcement learning and test-time search. arXiv preprint arXiv:2511.07312, 2025

arXiv 2025

[27] [28]

DREAM: Deep regret minimization with advantage baselines and model-free learning, 2020

Eric Steinberger, Adam Lerer, and Noam Brown. DREAM: Deep regret minimization with advantage baselines and model-free learning, 2020

2020

[28] [29]

Czarnecki, et al

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575:350–354, 2019

2019

[29] [30]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning , pages 23965– 2399...

2022

[30] [31]

Ema policy gradient: Taming reinforcement learning for llms with ema anchor and top-k kl

Lunjun Zhang and Jimmy Ba. Ema policy gradient: Taming reinforcement learning for llms with ema anchor and top-k kl. arXiv preprint arXiv:2602.04417, 2026. Appendix A. Related Work A.1. Two-Player Zero-Sum Game Solving A central challenge in two-player zero-sum imperfect-information games is that naive self-play with policy gradient methods can cycle or d...

arXiv 2026

[31] [32]

and ESCHER [19] adapt counterfactual regret minimization to function approximation. A third family of regularized policy-gradient methods, including NeuRD [8], R-NaD [21], and magnetic mirror descent [MMD, 25], stabilizes last-iterate convergence through explicit regularization terms in the policy objective. Our work builds on this last family, proposing ...

[32] [33]

model soups

takes a different approach, regularizing via reward shaping toward a periodically updated ref- erence policy. At scale, DeepNash gradually transitions between regularization targets using linear interpolation and uses an EMA of the policy parameters to approximate fixed points. However, the regularization targets themselves remain discrete snapshots set a...