NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.
DREAM: deep regret minimization with advantage baselines and model-free learning.CoRR, abs/2006.10410
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Phi-Actor-Critic is a new method that steers multi-agent reinforcement learning toward Pareto-efficient correlated equilibria using regret minimization and Lagrangian selection.
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
AlphaExploitem adds a hierarchical transformer encoder and a diverse pool of exploitable opponents to AlphaHoldem, enabling exploitation of suboptimal poker play while preserving performance against Nash-equilibrium opponents.
citing papers explorer
-
NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.
-
Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria
Phi-Actor-Critic is a new method that steers multi-agent reinforcement learning toward Pareto-efficient correlated equilibria using regret minimization and Lagrangian selection.
-
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
-
AlphaExploitem: Going Beyond the Nash Equilibrium in Poker by Learning to Exploit Suboptimal Play
AlphaExploitem adds a hierarchical transformer encoder and a diverse pool of exploitable opponents to AlphaHoldem, enabling exploitation of suboptimal poker play while preserving performance against Nash-equilibrium opponents.