pith. sign in

arxiv: 2606.28308 · v1 · pith:Z73OLTT7new · submitted 2026-06-26 · 💻 cs.GT · cs.AI· cs.LG· cs.MA

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes

Pith reviewed 2026-06-29 01:43 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.LGcs.MA
keywords Nash equilibrium selectionzero-sum gamesmaximum entropyregret minimizationNash polytopelast-iterate methodssolver dependenceKuhn poker
0
0 comments X

The pith

In zero-sum games with multiple Nash equilibria, solver algorithm determines which member of the set is selected rather than random seed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that different families of solvers systematically converge to different members of the Nash equilibrium set in two-player zero-sum games that admit a convex polytope of equilibria. Regularized last-iterate methods reach the maximum-entropy member, which is the information projection of a uniform reference measure onto the Nash set, while regret-averaging methods reach a lower-entropy face of the same polytope. This pattern is shown exactly on six analytically solvable games, including a two-dimensional Nash polytope and Kuhn poker, and confirmed on a randomized ensemble of 180 games. The selected equilibrium produces measurable differences in performance against suboptimal opponents that grow with sequential and hidden-information structure but remain bounded.

Core claim

Regularized last-iterate methods such as R-NaD and magnetic mirror descent select the maximum-entropy member of the Nash set—the information projection of their uniform reference onto the Nash polytope—exactly on the two-dimensional polytope and at 99.7 percent of maximum entropy in Kuhn poker, whereas regret-averaging methods such as CFR, CFR+, and fictitious play drift to a lower-entropy face; the pattern holds throughout a randomized 180-game ensemble where the regularized method attains the maximum-entropy member in 100 percent of converged games while the regret method sits strictly below it in 94 percent.

What carries the argument

The maximum-entropy member of the Nash set, defined as the information projection of the uniform reference onto the Nash polytope, which regularized last-iterate methods follow through their anchor mechanism.

If this is right

  • The selected equilibrium affects performance against suboptimal opponents, with the maximum-entropy member acting as a strictly better hedge in sequential games such as Kuhn poker.
  • Differences between equilibria remain bounded on matrix games and do not produce dominance of one member over another.
  • Removing the positive-orthant projection from CFR does not eliminate boundary drift to lower-entropy equilibria.
  • R-NaD selection is anchor-following and therefore not initialization-independent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users of equilibrium solvers may need to select the algorithm according to the entropy or robustness properties desired in the output equilibrium.
  • The observed pattern suggests testing whether the maximum-entropy selection persists in games with continuous action spaces or more than two players.
  • The distinction between last-iterate and average-regret families could inform design of hybrid solvers that target specific locations inside Nash polytopes.
  • Performance gaps against suboptimal opponents may become practically relevant in domains with sequential structure and imperfect information.

Load-bearing premise

The six analytically solvable testbed games together with the randomized 180-game ensemble are representative of the broader class of zero-sum games that possess non-unique Nash equilibria.

What would settle it

A zero-sum game with an analytically known Nash polytope in which a regularized last-iterate method converges to a non-maximum-entropy equilibrium or a regret-averaging method converges to the maximum-entropy equilibrium would falsify the reported selection pattern.

Figures

Figures reproduced from arXiv: 2606.28308 by Luis Leal.

Figure 1
Figure 1. Figure 1: Mean policy entropy of the selected profile, by solver and game. On symmetric games [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Where each converged solver lands on the actual Nash set. Left: Kuhn’s 1-D family (bluff [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bias versus convergence on Kuhn. Left: as the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generalization across a 180-game random ensemble of asymmetric safe-action games. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Exploitability (capped at 1) by solver and game; dashed line is the convergence threshold. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selection versus iteration budget on asym safe (log x-axis). R-NaD is pinned at the analytic max-entropy coordinate 0.218 at every budget; CFR+ drifts away from it as the budget grows. The family split is not under-convergence. (§4.9). (iv) Regret-averaging methods drift to a lower-entropy face, and this is not caused by the max(R, 0) projection (§4.10). (v) Downstream consequences are bounded; the max-ent… view at source ↗
Figure 7
Figure 7. Figure 7: R-NaD on Kuhn is anchor-following. As the initial reference is biased ( [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Robustness of the selected Kuhn equilibrium. The max-entropy (R-NaD) member’s value [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Maximum robustness gap |V (R-NaD) − V (CFR+)| across opponent deviations. The gap is ∼5.6× larger in the hidden-information EFG (Kuhn) than in the matrix games, but bounded (< 0.02) in absolute terms. absence of function-approximation and sampling (e.g. Monte-Carlo CFR or deep) experiments, and the restriction of the random ensemble to matrix games with a single extensive-form instance, reflect this comput… view at source ↗
read the original abstract

Many two-player zero-sum games admit not a unique Nash equilibrium but a convex set of them: a polytope of profiles that all share the minimax value V* yet prescribe different behaviour. Standard solvers each converge to some equilibrium and are treated as interchangeable. We ask whether they instead select different members of the Nash set, systematically as a function of the algorithm rather than the seed. Using a tabular, exactly solvable testbed of six games with analytically known Nash sets -- including a two-dimensional Nash polytope and Kuhn poker -- we find that (i) selection is determined by the algorithm, not the seed, but families differ only on asymmetric Nash sets; (ii) regularized last-iterate methods (R-NaD, magnetic mirror descent) select the maximum-entropy member, the information projection of their uniform reference onto the Nash set -- exactly on the 2-D polytope and at 99.7% of maximum entropy in Kuhn -- while regret-averaging methods (CFR, CFR+, fictitious play) drift to a lower-entropy face; we confirm this on a randomized 180-game ensemble, where R-NaD attains the maximum-entropy member in 100% of converged games while CFR+ sits strictly below it in 94% (paired Wilcoxon p < 10^-27); (iii) the selected member has downstream consequences against sub-optimal opponents that scale with sequential/hidden-information structure but stay bounded -- in Kuhn the max-entropy member is a strictly better hedge, whereas on the matrix games the members differ without either dominating. We also report two negative results correcting common intuitions: removing CFR's positive-orthant (max(R,0)) projection does not eliminate boundary drift; and R-NaD's selection is anchor-following, not initialization-independent. We state the maximum-entropy / I-projection characterization as a strongly data-supported conjecture, checked throughout against analytic ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that Nash equilibrium solvers are not interchangeable in zero-sum games whose Nash set is a non-singleton polytope: regularized last-iterate methods (R-NaD, magnetic mirror descent) select the maximum-entropy member (the information projection of a uniform reference onto the Nash set), while regret-averaging methods (CFR, CFR+, fictitious play) select lower-entropy faces. This is shown exactly on six analytically solvable testbed games (including a 2-D polytope and Kuhn poker at 99.7% of max entropy) and statistically on a randomized 180-game ensemble (R-NaD attains the max-entropy member in 100% of converged games; CFR+ is strictly below in 94%, paired Wilcoxon p < 10^{-27}). The selected member has downstream consequences against suboptimal opponents that scale with sequential structure but remain bounded; two negative results correct common intuitions about CFR's projection and R-NaD's initialization dependence. The max-entropy/I-projection characterization is stated as a strongly data-supported conjecture.

Significance. If the results hold, the work shows that algorithm choice induces systematic, non-arbitrary selection within the Nash set, with measurable effects on entropy and robustness, particularly in sequential/hidden-information games. Credit is due for the verification against exact analytic Nash sets on the small testbeds, the inclusion of negative results that test intuitions, and the statistical confirmation (with p-values) on the ensemble; these elements make the empirical component unusually well-grounded for an algorithmic study in game theory.

major comments (1)
  1. [Methods (randomized 180-game ensemble)] Methods (description of the randomized 180-game ensemble): the randomization procedure is not explicitly characterized (distribution over matrix sizes, payoff ranges, or constraints on Nash polytope dimension or symmetry). This is load-bearing for the generalization of the 100%/94% selection split and the I-projection conjecture, because the ensemble could over-sample low-dimensional or symmetric polytopes similar to the six hand-chosen testbeds.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'the information projection of their uniform reference onto the Nash set' is used without a one-sentence gloss or forward reference, which may hinder readers who encounter the term for the first time.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for noting the positive aspects of the empirical design. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods (randomized 180-game ensemble)] Methods (description of the randomized 180-game ensemble): the randomization procedure is not explicitly characterized (distribution over matrix sizes, payoff ranges, or constraints on Nash polytope dimension or symmetry). This is load-bearing for the generalization of the 100%/94% selection split and the I-projection conjecture, because the ensemble could over-sample low-dimensional or symmetric polytopes similar to the six hand-chosen testbeds.

    Authors: We agree that the randomization procedure must be fully specified for the statistical results and conjecture to be interpretable. In the revised manuscript we will add an explicit description of the ensemble generation process, including the distributions over matrix dimensions, payoff ranges, and any constraints or sampling rules used to control Nash polytope dimension and symmetry. This addition will allow independent verification that the ensemble is not inadvertently biased toward the low-dimensional or symmetric cases already examined in the analytic testbeds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external analytic ground truth and randomized ensemble

full rationale

The paper presents an empirical study comparing solver behavior on six analytically solvable games (with known Nash sets) and a 180-game randomized ensemble. The maximum-entropy / I-projection characterization is explicitly labeled a 'strongly data-supported conjecture' checked against analytic ground truth rather than derived from the paper's own equations. No self-citations, fitted parameters renamed as predictions, or self-definitional steps appear in the provided text. The central observations (algorithm-dependent selection, 100%/94% split, downstream consequences) are direct measurements on external benchmarks, satisfying the criteria for a self-contained result against independent verification. Representativeness of the ensemble is an assumption about scope, not a circular reduction in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain fact that Nash equilibrium sets in finite zero-sum games are convex polytopes together with the assumption that the chosen testbed games have correctly computed analytic equilibria.

axioms (1)
  • domain assumption Nash equilibrium sets in two-player zero-sum games are convex polytopes
    Invoked to define the testbed and the notion of selecting different members of the set.

pith-pipeline@v0.9.1-grok · 5890 in / 1261 out tokens · 44580 ms · 2026-06-29T01:43:23.666721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    J. P. Bailey and G. Piliouras. Multiplicative weights update in zero-sum games. ACM Conference on Economics and Computation (EC), 2018. DOI: 10.1145/3219166.3219235

  2. [2]

    Balduzzi, K

    D. Balduzzi, K. Tuyls, J. P\'erolat, and T. Graepel. Re-evaluating evaluation. NeurIPS, 2018. arXiv:1806.02643

  3. [3]

    Bowling, N

    M. Bowling, N. Burch, M. Johanson, and O. Tammelin. Heads-up limit hold'em poker is solved. Science, 347(6218):145--149, 2015. DOI: 10.1126/science.1259433

  4. [4]

    G. W. Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951

  5. [5]

    Csisz\'ar

    I. Csisz\'ar. I -divergence geometry of probability distributions and minimization problems. Annals of Probability, 3(1):146--158, 1975. DOI: 10.1214/aop/1176996454

  6. [6]

    M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

  7. [7]

    Flokas, E.-V

    L. Flokas, E.-V. Vlatakis-Gkaragkounis, T. Lianeas, P. Mertikopoulos, and G. Piliouras. No-regret learning and mixed Nash equilibria: They do not mix. NeurIPS, 2020. DOI: 10.48550/arxiv.2010.09514

  8. [8]

    Gemp et al

    I. Gemp et al. Sample-based approximation of Nash in large many-player games via gradient descent. AAMAS, 2022. arXiv:2106.01285

  9. [9]

    J. C. Harsanyi and R. Selten. A General Theory of Equilibrium Selection in Games. MIT Press, 1988

  10. [10]

    Hennes et al

    D. Hennes et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. AAMAS, 2020. DOI: 10.65109/gjmw6851

  11. [11]

    E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620--630, 1957. DOI: 10.1103/physrev.106.620

  12. [12]

    R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6--38, 1995. DOI: 10.1006/game.1995.1023

  13. [13]

    Mertikopoulos, C

    P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018. DOI: 10.1137/1.9781611975031.172

  14. [14]

    P\'erolat et al

    J. P\'erolat et al. From Poincar\'e recurrence to convergence in imperfect-information games: Finding equilibrium via regularization. ICML, 2021

  15. [15]

    P\'erolat et al

    J. P\'erolat et al. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623):990--996, 2022. DOI: 10.1126/science.add4679

  16. [16]

    Robinson

    J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54(2):296--301, 1951. DOI: 10.2307/1969530

  17. [17]

    Sokota et al

    S. Sokota et al. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. ICLR, 2023. arXiv:2206.05825

  18. [18]

    Tammelin

    O. Tammelin. Solving large imperfect information games using CFR+. arXiv:1407.5042, 2014

  19. [19]

    T. L. Turocy. A dynamic homotopy interpretation of the logistic quantal response equilibrium correspondence. Games and Economic Behavior, 51(2):243--263, 2005. DOI: 10.1016/j.geb.2004.04.003

  20. [20]

    J. Weed. An explicit analysis of the entropic penalty in linear programming. Conference on Learning Theory (COLT), PMLR 75, 2018. DOI: 10.48550/arxiv.1806.01879

  21. [21]

    Zinkevich, M

    M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. NeurIPS, 2007