Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Luis Leal

arxiv: 2606.28308 · v1 · pith:Z73OLTT7new · submitted 2026-06-26 · 💻 cs.GT · cs.AI· cs.LG· cs.MA

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Luis Leal This is my paper

Pith reviewed 2026-06-29 01:43 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.LGcs.MA

keywords Nash equilibrium selectionzero-sum gamesmaximum entropyregret minimizationNash polytopelast-iterate methodssolver dependenceKuhn poker

0 comments

The pith

In zero-sum games with multiple Nash equilibria, solver algorithm determines which member of the set is selected rather than random seed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that different families of solvers systematically converge to different members of the Nash equilibrium set in two-player zero-sum games that admit a convex polytope of equilibria. Regularized last-iterate methods reach the maximum-entropy member, which is the information projection of a uniform reference measure onto the Nash set, while regret-averaging methods reach a lower-entropy face of the same polytope. This pattern is shown exactly on six analytically solvable games, including a two-dimensional Nash polytope and Kuhn poker, and confirmed on a randomized ensemble of 180 games. The selected equilibrium produces measurable differences in performance against suboptimal opponents that grow with sequential and hidden-information structure but remain bounded.

Core claim

Regularized last-iterate methods such as R-NaD and magnetic mirror descent select the maximum-entropy member of the Nash set—the information projection of their uniform reference onto the Nash polytope—exactly on the two-dimensional polytope and at 99.7 percent of maximum entropy in Kuhn poker, whereas regret-averaging methods such as CFR, CFR+, and fictitious play drift to a lower-entropy face; the pattern holds throughout a randomized 180-game ensemble where the regularized method attains the maximum-entropy member in 100 percent of converged games while the regret method sits strictly below it in 94 percent.

What carries the argument

The maximum-entropy member of the Nash set, defined as the information projection of the uniform reference onto the Nash polytope, which regularized last-iterate methods follow through their anchor mechanism.

If this is right

The selected equilibrium affects performance against suboptimal opponents, with the maximum-entropy member acting as a strictly better hedge in sequential games such as Kuhn poker.
Differences between equilibria remain bounded on matrix games and do not produce dominance of one member over another.
Removing the positive-orthant projection from CFR does not eliminate boundary drift to lower-entropy equilibria.
R-NaD selection is anchor-following and therefore not initialization-independent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users of equilibrium solvers may need to select the algorithm according to the entropy or robustness properties desired in the output equilibrium.
The observed pattern suggests testing whether the maximum-entropy selection persists in games with continuous action spaces or more than two players.
The distinction between last-iterate and average-regret families could inform design of hybrid solvers that target specific locations inside Nash polytopes.
Performance gaps against suboptimal opponents may become practically relevant in domains with sequential structure and imperfect information.

Load-bearing premise

The six analytically solvable testbed games together with the randomized 180-game ensemble are representative of the broader class of zero-sum games that possess non-unique Nash equilibria.

What would settle it

A zero-sum game with an analytically known Nash polytope in which a regularized last-iterate method converges to a non-maximum-entropy equilibrium or a regret-averaging method converges to the maximum-entropy equilibrium would falsify the reported selection pattern.

Figures

Figures reproduced from arXiv: 2606.28308 by Luis Leal.

**Figure 2.** Figure 2: Where each converged solver lands on the actual Nash set. Left: Kuhn’s 1-D family (bluff [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Bias versus convergence on Kuhn. Left: as the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Generalization across a 180-game random ensemble of asymmetric safe-action games. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Exploitability (capped at 1) by solver and game; dashed line is the convergence threshold. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Selection versus iteration budget on asym safe (log x-axis). R-NaD is pinned at the analytic max-entropy coordinate 0.218 at every budget; CFR+ drifts away from it as the budget grows. The family split is not under-convergence. (§4.9). (iv) Regret-averaging methods drift to a lower-entropy face, and this is not caused by the max(R, 0) projection (§4.10). (v) Downstream consequences are bounded; the max-ent… view at source ↗

**Figure 7.** Figure 7: R-NaD on Kuhn is anchor-following. As the initial reference is biased ( [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Robustness of the selected Kuhn equilibrium. The max-entropy (R-NaD) member’s value [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Maximum robustness gap |V (R-NaD) − V (CFR+)| across opponent deviations. The gap is ∼5.6× larger in the hidden-information EFG (Kuhn) than in the matrix games, but bounded (< 0.02) in absolute terms. absence of function-approximation and sampling (e.g. Monte-Carlo CFR or deep) experiments, and the restriction of the random ensemble to matrix games with a single extensive-form instance, reflect this comput… view at source ↗

read the original abstract

Many two-player zero-sum games admit not a unique Nash equilibrium but a convex set of them: a polytope of profiles that all share the minimax value V* yet prescribe different behaviour. Standard solvers each converge to some equilibrium and are treated as interchangeable. We ask whether they instead select different members of the Nash set, systematically as a function of the algorithm rather than the seed. Using a tabular, exactly solvable testbed of six games with analytically known Nash sets -- including a two-dimensional Nash polytope and Kuhn poker -- we find that (i) selection is determined by the algorithm, not the seed, but families differ only on asymmetric Nash sets; (ii) regularized last-iterate methods (R-NaD, magnetic mirror descent) select the maximum-entropy member, the information projection of their uniform reference onto the Nash set -- exactly on the 2-D polytope and at 99.7% of maximum entropy in Kuhn -- while regret-averaging methods (CFR, CFR+, fictitious play) drift to a lower-entropy face; we confirm this on a randomized 180-game ensemble, where R-NaD attains the maximum-entropy member in 100% of converged games while CFR+ sits strictly below it in 94% (paired Wilcoxon p < 10^-27); (iii) the selected member has downstream consequences against sub-optimal opponents that scale with sequential/hidden-information structure but stay bounded -- in Kuhn the max-entropy member is a strictly better hedge, whereas on the matrix games the members differ without either dominating. We also report two negative results correcting common intuitions: removing CFR's positive-orthant (max(R,0)) projection does not eliminate boundary drift; and R-NaD's selection is anchor-following, not initialization-independent. We state the maximum-entropy / I-projection characterization as a strongly data-supported conjecture, checked throughout against analytic ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solver families pick different Nash points on non-unique sets, with regularized last-iterate methods hitting the max-entropy member and regret averagers drifting lower.

read the letter

The key point is that standard Nash solvers are not interchangeable when the equilibrium set is a polytope rather than a singleton; the choice of algorithm family determines which member you land on, and this shows up consistently rather than as seed noise.

The paper grounds this in six exactly solvable games with known analytic Nash sets, including a 2-D polytope and Kuhn poker, then checks the pattern on a 180-game randomized ensemble. Regularized methods (R-NaD, magnetic mirror descent) reach the maximum-entropy equilibrium in all converged cases on the ensemble and hit 99.7% of max entropy on Kuhn, matching the information projection of the uniform reference; CFR and CFR+ land strictly below in 94% of cases with strong Wilcoxon support. They also measure downstream exploitability against suboptimal opponents and report two negative results that correct prior intuitions. The analytic verification plus the statistical tests on the ensemble give the claims real weight.

The main soft spot is the 180-game ensemble itself. No details appear on the randomization procedure or on metrics that would show the sampled polytopes are diverse in dimension and symmetry, so it remains possible the observed split is stronger on this particular draw than it would be on a broader class of zero-sum games. The hand-chosen testbeds are convincing on their own, but the generalization step rests on an unexamined sampling assumption.

This is useful for anyone building or choosing solvers in multi-agent RL or algorithmic game theory who needs to know what equilibrium they are actually computing. It deserves a serious referee because the analytic cases and the statistical results are concrete enough to evaluate directly.

Referee Report

1 major / 1 minor

Summary. The paper claims that Nash equilibrium solvers are not interchangeable in zero-sum games whose Nash set is a non-singleton polytope: regularized last-iterate methods (R-NaD, magnetic mirror descent) select the maximum-entropy member (the information projection of a uniform reference onto the Nash set), while regret-averaging methods (CFR, CFR+, fictitious play) select lower-entropy faces. This is shown exactly on six analytically solvable testbed games (including a 2-D polytope and Kuhn poker at 99.7% of max entropy) and statistically on a randomized 180-game ensemble (R-NaD attains the max-entropy member in 100% of converged games; CFR+ is strictly below in 94%, paired Wilcoxon p < 10^{-27}). The selected member has downstream consequences against suboptimal opponents that scale with sequential structure but remain bounded; two negative results correct common intuitions about CFR's projection and R-NaD's initialization dependence. The max-entropy/I-projection characterization is stated as a strongly data-supported conjecture.

Significance. If the results hold, the work shows that algorithm choice induces systematic, non-arbitrary selection within the Nash set, with measurable effects on entropy and robustness, particularly in sequential/hidden-information games. Credit is due for the verification against exact analytic Nash sets on the small testbeds, the inclusion of negative results that test intuitions, and the statistical confirmation (with p-values) on the ensemble; these elements make the empirical component unusually well-grounded for an algorithmic study in game theory.

major comments (1)

[Methods (randomized 180-game ensemble)] Methods (description of the randomized 180-game ensemble): the randomization procedure is not explicitly characterized (distribution over matrix sizes, payoff ranges, or constraints on Nash polytope dimension or symmetry). This is load-bearing for the generalization of the 100%/94% selection split and the I-projection conjecture, because the ensemble could over-sample low-dimensional or symmetric polytopes similar to the six hand-chosen testbeds.

minor comments (1)

[Abstract] Abstract: the phrase 'the information projection of their uniform reference onto the Nash set' is used without a one-sentence gloss or forward reference, which may hinder readers who encounter the term for the first time.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for noting the positive aspects of the empirical design. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods (randomized 180-game ensemble)] Methods (description of the randomized 180-game ensemble): the randomization procedure is not explicitly characterized (distribution over matrix sizes, payoff ranges, or constraints on Nash polytope dimension or symmetry). This is load-bearing for the generalization of the 100%/94% selection split and the I-projection conjecture, because the ensemble could over-sample low-dimensional or symmetric polytopes similar to the six hand-chosen testbeds.

Authors: We agree that the randomization procedure must be fully specified for the statistical results and conjecture to be interpretable. In the revised manuscript we will add an explicit description of the ensemble generation process, including the distributions over matrix dimensions, payoff ranges, and any constraints or sampling rules used to control Nash polytope dimension and symmetry. This addition will allow independent verification that the ensemble is not inadvertently biased toward the low-dimensional or symmetric cases already examined in the analytic testbeds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external analytic ground truth and randomized ensemble

full rationale

The paper presents an empirical study comparing solver behavior on six analytically solvable games (with known Nash sets) and a 180-game randomized ensemble. The maximum-entropy / I-projection characterization is explicitly labeled a 'strongly data-supported conjecture' checked against analytic ground truth rather than derived from the paper's own equations. No self-citations, fitted parameters renamed as predictions, or self-definitional steps appear in the provided text. The central observations (algorithm-dependent selection, 100%/94% split, downstream consequences) are direct measurements on external benchmarks, satisfying the criteria for a self-contained result against independent verification. Representativeness of the ensemble is an assumption about scope, not a circular reduction in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain fact that Nash equilibrium sets in finite zero-sum games are convex polytopes together with the assumption that the chosen testbed games have correctly computed analytic equilibria.

axioms (1)

domain assumption Nash equilibrium sets in two-player zero-sum games are convex polytopes
Invoked to define the testbed and the notion of selecting different members of the set.

pith-pipeline@v0.9.1-grok · 5890 in / 1261 out tokens · 44580 ms · 2026-06-29T01:43:23.666721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 1 internal anchor

[1]

J. P. Bailey and G. Piliouras. Multiplicative weights update in zero-sum games. ACM Conference on Economics and Computation (EC), 2018. DOI: 10.1145/3219166.3219235

work page doi:10.1145/3219166.3219235 2018
[2]

Balduzzi, K

D. Balduzzi, K. Tuyls, J. P\'erolat, and T. Graepel. Re-evaluating evaluation. NeurIPS, 2018. arXiv:1806.02643

Pith/arXiv arXiv 2018
[3]

Bowling, N

M. Bowling, N. Burch, M. Johanson, and O. Tammelin. Heads-up limit hold'em poker is solved. Science, 347(6218):145--149, 2015. DOI: 10.1126/science.1259433

work page doi:10.1126/science.1259433 2015
[4]

G. W. Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951

1951
[5]

Csisz\'ar

I. Csisz\'ar. I -divergence geometry of probability distributions and minimization problems. Annals of Probability, 3(1):146--158, 1975. DOI: 10.1214/aop/1176996454

work page doi:10.1214/aop/1176996454 1975
[6]

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

2013
[7]

Flokas, E.-V

L. Flokas, E.-V. Vlatakis-Gkaragkounis, T. Lianeas, P. Mertikopoulos, and G. Piliouras. No-regret learning and mixed Nash equilibria: They do not mix. NeurIPS, 2020. DOI: 10.48550/arxiv.2010.09514

work page doi:10.48550/arxiv.2010.09514 2020
[8]

Gemp et al

I. Gemp et al. Sample-based approximation of Nash in large many-player games via gradient descent. AAMAS, 2022. arXiv:2106.01285

arXiv 2022
[9]

J. C. Harsanyi and R. Selten. A General Theory of Equilibrium Selection in Games. MIT Press, 1988

1988
[10]

Hennes et al

D. Hennes et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. AAMAS, 2020. DOI: 10.65109/gjmw6851

work page doi:10.65109/gjmw6851 2020
[11]

E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620--630, 1957. DOI: 10.1103/physrev.106.620

work page doi:10.1103/physrev.106.620 1957
[12]

R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6--38, 1995. DOI: 10.1006/game.1995.1023

work page doi:10.1006/game.1995.1023 1995
[13]

Mertikopoulos, C

P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018. DOI: 10.1137/1.9781611975031.172

work page doi:10.1137/1.9781611975031.172 2018
[14]

P\'erolat et al

J. P\'erolat et al. From Poincar\'e recurrence to convergence in imperfect-information games: Finding equilibrium via regularization. ICML, 2021

2021
[15]

P\'erolat et al

J. P\'erolat et al. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623):990--996, 2022. DOI: 10.1126/science.add4679

work page doi:10.1126/science.add4679 2022
[16]

Robinson

J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54(2):296--301, 1951. DOI: 10.2307/1969530

work page doi:10.2307/1969530 1951
[17]

Sokota et al

S. Sokota et al. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. ICLR, 2023. arXiv:2206.05825

arXiv 2023
[18]

Tammelin

O. Tammelin. Solving large imperfect information games using CFR+. arXiv:1407.5042, 2014

Pith/arXiv arXiv 2014
[19]

T. L. Turocy. A dynamic homotopy interpretation of the logistic quantal response equilibrium correspondence. Games and Economic Behavior, 51(2):243--263, 2005. DOI: 10.1016/j.geb.2004.04.003

work page doi:10.1016/j.geb.2004.04.003 2005
[20]

J. Weed. An explicit analysis of the entropic penalty in linear programming. Conference on Learning Theory (COLT), PMLR 75, 2018. DOI: 10.48550/arxiv.1806.01879

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.01879 2018
[21]

Zinkevich, M

M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. NeurIPS, 2007

2007

[1] [1]

J. P. Bailey and G. Piliouras. Multiplicative weights update in zero-sum games. ACM Conference on Economics and Computation (EC), 2018. DOI: 10.1145/3219166.3219235

work page doi:10.1145/3219166.3219235 2018

[2] [2]

Balduzzi, K

D. Balduzzi, K. Tuyls, J. P\'erolat, and T. Graepel. Re-evaluating evaluation. NeurIPS, 2018. arXiv:1806.02643

Pith/arXiv arXiv 2018

[3] [3]

Bowling, N

M. Bowling, N. Burch, M. Johanson, and O. Tammelin. Heads-up limit hold'em poker is solved. Science, 347(6218):145--149, 2015. DOI: 10.1126/science.1259433

work page doi:10.1126/science.1259433 2015

[4] [4]

G. W. Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951

1951

[5] [5]

Csisz\'ar

I. Csisz\'ar. I -divergence geometry of probability distributions and minimization problems. Annals of Probability, 3(1):146--158, 1975. DOI: 10.1214/aop/1176996454

work page doi:10.1214/aop/1176996454 1975

[6] [6]

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

2013

[7] [7]

Flokas, E.-V

L. Flokas, E.-V. Vlatakis-Gkaragkounis, T. Lianeas, P. Mertikopoulos, and G. Piliouras. No-regret learning and mixed Nash equilibria: They do not mix. NeurIPS, 2020. DOI: 10.48550/arxiv.2010.09514

work page doi:10.48550/arxiv.2010.09514 2020

[8] [8]

Gemp et al

I. Gemp et al. Sample-based approximation of Nash in large many-player games via gradient descent. AAMAS, 2022. arXiv:2106.01285

arXiv 2022

[9] [9]

J. C. Harsanyi and R. Selten. A General Theory of Equilibrium Selection in Games. MIT Press, 1988

1988

[10] [10]

Hennes et al

D. Hennes et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. AAMAS, 2020. DOI: 10.65109/gjmw6851

work page doi:10.65109/gjmw6851 2020

[11] [11]

E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620--630, 1957. DOI: 10.1103/physrev.106.620

work page doi:10.1103/physrev.106.620 1957

[12] [12]

R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6--38, 1995. DOI: 10.1006/game.1995.1023

work page doi:10.1006/game.1995.1023 1995

[13] [13]

Mertikopoulos, C

P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018. DOI: 10.1137/1.9781611975031.172

work page doi:10.1137/1.9781611975031.172 2018

[14] [14]

P\'erolat et al

J. P\'erolat et al. From Poincar\'e recurrence to convergence in imperfect-information games: Finding equilibrium via regularization. ICML, 2021

2021

[15] [15]

P\'erolat et al

J. P\'erolat et al. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623):990--996, 2022. DOI: 10.1126/science.add4679

work page doi:10.1126/science.add4679 2022

[16] [16]

Robinson

J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54(2):296--301, 1951. DOI: 10.2307/1969530

work page doi:10.2307/1969530 1951

[17] [17]

Sokota et al

S. Sokota et al. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. ICLR, 2023. arXiv:2206.05825

arXiv 2023

[18] [18]

Tammelin

O. Tammelin. Solving large imperfect information games using CFR+. arXiv:1407.5042, 2014

Pith/arXiv arXiv 2014

[19] [19]

T. L. Turocy. A dynamic homotopy interpretation of the logistic quantal response equilibrium correspondence. Games and Economic Behavior, 51(2):243--263, 2005. DOI: 10.1016/j.geb.2004.04.003

work page doi:10.1016/j.geb.2004.04.003 2005

[20] [20]

J. Weed. An explicit analysis of the entropic penalty in linear programming. Conference on Learning Theory (COLT), PMLR 75, 2018. DOI: 10.48550/arxiv.1806.01879

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1806.01879 2018

[21] [21]

Zinkevich, M

M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. NeurIPS, 2007

2007