pith. sign in

arxiv: 2606.01159 · v1 · pith:DQR6FFTWnew · submitted 2026-05-31 · 💻 cs.LG · cs.GT

Fairness in two-player zero-sum games with bandit feedback

Pith reviewed 2026-06-28 17:11 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords zero-sum gamesbandit feedbackfairness constraintsregret minimizationexplore-then-commitmixed equilibriaprice of fairness
0
0 comments X

The pith

Fair strategies in zero-sum games reduce to standard strategies on a modified payoff matrix, enabling an explore-then-commit algorithm with order T to the two-thirds regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how fairness constraints requiring each action to be played with probability at least alpha over m can be incorporated into two-player zero-sum games observed through bandit feedback. Any fair strategy is rewritten as a uniform component plus a scaled standard strategy, which produces an equivalent payoff expression on a new matrix whose entries blend the original payoffs with column averages. This equivalence transfers classical equilibrium existence and stability results to the fair setting and supports derivation of the fair minimax value along with a dual bound on the price of fairness. The resulting algorithm attains sublinear regret of order T to the two-thirds against general mixed fair equilibria.

Core claim

By decomposing any fair strategy p as p equals (alpha over m) times the all-ones vector plus (1 minus alpha) times a standard simplex vector p tilde, and defining the fair payoff matrix A tilde equals (1 minus alpha) A plus alpha times the outer product of the all-ones vector with the column-mean vector c, the expected payoff p transpose A q equals p tilde transpose A tilde q. The fair game on A is therefore identical to an ordinary zero-sum game on A tilde. Equilibrium existence, KKT conditions, and LP basis stability therefore carry over directly from the unconstrained case, and an explore-then-commit procedure achieves order T to the two-thirds regret for mixed fair equilibria while sharp

What carries the argument

The reparametrization that expresses every fair strategy as a mixture of uniform play and a free strategy, together with the associated fair payoff matrix A tilde that converts the constrained game into an equivalent unconstrained zero-sum game.

If this is right

  • The price of fairness is at most alpha times (1 minus 1 over m) and is zero whenever the unconstrained equilibrium already assigns positive probability to every action.
  • The explore-then-commit procedure attains sublinear regret for any mixed fair equilibrium.
  • When the fair equilibrium reduces to a single dominant action the regret bound becomes instance-dependent and scales as the inverse square of the modified gap.
  • Naive action elimination does not improve the general T to the two-thirds bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition may let similar regret arguments apply to other linear constraints on mixed strategies in games.
  • Tightness of the T to the two-thirds rate could be checked by building explicit game instances whose fair equilibria require mixing over multiple actions.
  • The dual representation of the price of fairness may connect to constrained optimization problems arising in other multi-agent learning settings.

Load-bearing premise

Substituting the decomposed fair strategy into the bilinear payoff always produces an equivalent standard game on the adjusted matrix A tilde, so that all classical zero-sum results apply unchanged.

What would settle it

Construct a game whose fair equilibrium is mixed, run the Fair-ETC-TPZSG algorithm, and check whether cumulative regret grows faster than order T to the two-thirds; a consistent violation would falsify the main bound.

read the original abstract

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $\alpha/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (\alpha/m)\mathbf{1} + (1-\alpha)\widetilde{p}$ with $\widetilde{p} \in \Delta_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-\alpha)A + \alpha\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $\alpha(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $\Delta_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetilde{\Delta}(\alpha)^{2})$, where $\widetilde{\Delta}(\alpha)$ is the LP-margin gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper studies two-player zero-sum games with bandit feedback under fairness constraints (each action played with probability at least α/m). It introduces a reparametrization decomposing any fair strategy p as p = (α/m)1 + (1-α)p̃ with p̃ in the simplex, yielding the identity p^T A q = p̃^T à q for the transformed fair payoff matrix à := (1-α)A + α 1 c^T (c the column means). This reduces the fair game to a standard zero-sum game on Ã, so that equilibrium existence, KKT conditions, and LP stability carry over. The main result is an Õ(T^{2/3}) regret bound for the Explore-Then-Commit algorithm Fair-ETC-TPZSG that applies to general mixed fair equilibria; the bound improves to instance-dependent Õ(1/Δ̃(α)^2) when the fair equilibrium corresponds to a vertex in the transformed game. The price of fairness is shown to be at most α(1-1/m) and vanishes when the unconstrained equilibrium already has full support.

Significance. If the reparametrization and regret analysis are fully verified, the work supplies the first regret guarantees for fairness-constrained mixed equilibria in bandit zero-sum games. The reduction to classical results on à is technically clean and the instance-dependent sharpening when p̃* is a vertex is a useful feature. The explicit bound on the price of fairness is a positive addition. The discussion of why naive action elimination fails to improve the rate is also valuable.

major comments (1)
  1. [Abstract] Abstract and reparametrization paragraph: the central identity p^T A q = p̃^T à q is asserted without an explicit algebraic expansion or verification that à preserves the simplex structure, equilibrium existence, and KKT/LP properties needed for the subsequent regret analysis. This equivalence is load-bearing for all claims that follow from the reduction to classical zero-sum results.
minor comments (2)
  1. The T^{2/3} exponent for Fair-ETC-TPZSG follows the standard explore-then-commit tradeoff, but the manuscript should include the explicit sample-allocation argument and concentration bounds used to obtain the Õ(T^{2/3}) rate.
  2. Notation for the transformed gap Δ̃(α) and the fair minimax value should be introduced with a dedicated definition early in the text rather than only in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive comment. The single major comment is addressed below; we will incorporate the requested clarification.

read point-by-point responses
  1. Referee: [Abstract] Abstract and reparametrization paragraph: the central identity p^T A q = p̃^T à q is asserted without an explicit algebraic expansion or verification that à preserves the simplex structure, equilibrium existence, and KKT/LP properties needed for the subsequent regret analysis. This equivalence is load-bearing for all claims that follow from the reduction to classical zero-sum results.

    Authors: We agree that an explicit algebraic expansion of the identity was omitted from the abstract and introductory reparametrization paragraph. The verification is as follows. Let p = (α/m)1 + (1-α)p̃ with p̃ ∈ Δ_m. Then p^T A q = (α/m)1^T A q + (1-α)p̃^T A q. Since c_j = (1/m)∑_i A(i,j), we have 1^T A q = m c^T q, so the first term equals α c^T q. For the transformed matrix à := (1-α)A + α 1 c^T we obtain p̃^T à q = (1-α)p̃^T A q + α (p̃^T 1) c^T q = (1-α)p̃^T A q + α c^T q, which matches p^T A q. Because p̃ ∈ Δ_m implies p ∈ Δ_m^α (the fair simplex) and à is an affine transformation of A, the image of Δ_m under the map remains inside the probability simplex; standard minimax, KKT, and LP-basis arguments therefore carry over verbatim to the game on Ã. We will insert this short derivation (and the accompanying sentence confirming preservation of the required structural properties) into the revised abstract and Section 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation begins with an explicit algebraic reparametrization p = (α/m)1 + (1-α)p̃ that produces the identity p^T A q = p̃^T Ã q by direct expansion, reducing the constrained fair game to an unconstrained zero-sum game on Ã. Equilibrium existence, KKT conditions, and the subsequent ETC regret bound then follow from standard results on the transformed matrix game, with the T^{2/3} rate arising from the usual explore-then-commit allocation tradeoff rather than any self-referential definition or fitted parameter. No self-citations are invoked as load-bearing premises, no quantity is defined in terms of a later prediction, and the instance-dependent sharpening when p̃* is a vertex likewise follows from classical gap-based stopping rules. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the algebraic validity of the reparametrization and the applicability of classical zero-sum equilibrium and regret results to the linearly transformed matrix; no new free parameters are introduced beyond the given fairness level α.

axioms (2)
  • domain assumption The payoff matrix A is known to both players and the interaction is strictly zero-sum.
    Standard setup invoked when the reparametrization is substituted into p^T A q.
  • domain assumption Bandit feedback reveals only the payoff of the chosen action pair at each round.
    Required for the regret analysis of the Explore-Then-Commit procedure.

pith-pipeline@v0.9.1-grok · 5896 in / 1556 out tokens · 30100 ms · 2026-06-28T17:11:56.037666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Constrained M arkov Decision Processes

    Eitan Altman. Constrained M arkov Decision Processes . Stochastic Modeling Series. Chapman & Hall/CRC, Boca Raton, FL, 1999

  2. [2]

    Tsitsiklis

    Dimitris Bertsimas and John N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, Belmont, MA, 1997

  3. [4]

    Instance-dependent regret bounds for learning two-player zero-sum games with bandit feedback

    Shinji Ito, Haipeng Luo, Taira Tsuchiya, and Yue Wu. Instance-dependent regret bounds for learning two-player zero-sum games with bandit feedback. In The Thirty Eighth Annual Conference on Learning Theory, pages 2858--2892. PMLR, 2025

  4. [5]

    Fairness in reinforcement learning

    Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 1617--1626. PMLR, 2017. URL https://proceedings.mlr.press/v70/jabbari17a.html

  5. [6]

    Morgenstern, and Aaron Roth

    Matthew Joseph, Michael Kearns, Jamie H. Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems 29 (NeurIPS), pages 325--333, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/eb163727917cbba1eea208541a643e74-Abstract.html

  6. [7]

    Bandit Algorithms

    Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020

  7. [8]

    Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C. Parkes. Calibrated fairness in bandits. In Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML), 2017. URL https://arxiv.org/abs/1707.01875. arXiv:1707.01875

  8. [9]

    Trading regret for efficiency: Online convex optimization with long term constraints

    Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: Online convex optimization with long term constraints. Journal of Machine Learning Research, 13 0 (81): 0 2503--2528, 2012. URL https://jmlr.org/papers/v13/mahdavi12a.html

  9. [10]

    Arnab Maiti, Kevin Jamieson, and Lillian J. Ratliff. Instance-dependent sample complexity bounds for zero-sum matrix games. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 of Proceedings of Machine Learning Research, pages 9429--9469. PMLR, 2023. URL https://proceedings.mlr.press/v206/mai...

  10. [11]

    John F. Nash. Non-cooperative games. Annals of Mathematics, 54 0 (2): 0 286--295, 1951

  11. [12]

    Narahari

    Vishakha Patil, Ganesh Ghalme, Vineet Nair, and Y. Narahari. Achieving fairness in the stochastic multi-armed bandit problem. Journal of Machine Learning Research, 22 0 (174): 0 1--31, 2021. URL https://jmlr.org/papers/v22/20-704.html

  12. [13]

    Tyrrell Rockafellar

    R. Tyrrell Rockafellar. Convex Analysis. Number 28 in Princeton Mathematical Series. Princeton University Press, 1970

  13. [14]

    Theory of Linear and Integer Programming

    Alexander Schrijver. Theory of Linear and Integer Programming. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, Chichester, 1986

  14. [15]

    Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards

    Umer Siddique, Paul Weng, and Matthieu Zimmer. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 8905--8915. PMLR, 2020. URL https://proceedings.mlr.press/v...

  15. [16]

    Zur T heorie der G esellschaftsspiele

    John von Neumann. Zur T heorie der G esellschaftsspiele. Mathematische Annalen, 100: 0 295--320, 1928

  16. [17]

    Linear last-iterate convergence in constrained saddle-point optimization

    Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=dx11_7vm5_r

  17. [18]

    Fairness in reinforcement learning

    Paul Weng. Fairness in reinforcement learning. In AI for Social Good Workshop at the International Joint Conference on Artificial Intelligence (IJCAI), 2019

  18. [19]

    John A. Weymark. Generalized G ini inequality indices. Mathematical Social Sciences, 1 0 (4): 0 409--430, 1981

  19. [20]

    Two-player zero-sum games with bandit feedback

    Elif Y lmaz and Christos Dimitrakakis. Two-player zero-sum games with bandit feedback. In Eighteenth European Workshop on Reinforcement Learning, 2025. URL https://openreview.net/forum?id=UkxBgJk3li

  20. [21]

    T sallis- INF : An optimal algorithm for stochastic and adversarial bandits

    Julian Zimmert and Yevgeny Seldin. T sallis- INF : An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22 0 (28): 0 1--49, 2021. URL https://jmlr.org/papers/v22/19-753.html

  21. [22]

    The Thirty Eighth Annual Conference on Learning Theory , pages=

    Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback , author=. The Thirty Eighth Annual Conference on Learning Theory , pages=. 2025 , organization=

  22. [23]

    Eighteenth European Workshop on Reinforcement Learning , year=

    Two-Player Zero-Sum Games with Bandit Feedback , author=. Eighteenth European Workshop on Reinforcement Learning , year=

  23. [24]

    Ratliff , title =

    Arnab Maiti and Kevin Jamieson and Lillian J. Ratliff , title =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2023 , publisher =

  24. [25]

    Journal of Machine Learning Research , volume =

    Julian Zimmert and Yevgeny Seldin , title =. Journal of Machine Learning Research , volume =. 2021 , url =

  25. [26]

    Bandit Algorithms , publisher =

    Tor Lattimore and Csaba Szepesv. Bandit Algorithms , publisher =

  26. [27]

    Mathematische Annalen , volume =

    John von Neumann , title =. Mathematische Annalen , volume =

  27. [28]

    Nash , title =

    John F. Nash , title =. Annals of Mathematics , volume =

  28. [29]

    Pacific Journal of Mathematics , volume =

    Maurice Sion , title =. Pacific Journal of Mathematics , volume =

  29. [30]

    Tsitsiklis , title =

    Dimitris Bertsimas and John N. Tsitsiklis , title =

  30. [31]

    Alexander Schrijver , title =

  31. [32]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson , title =

  32. [33]

    Tyrrell Rockafellar , title =

    R. Tyrrell Rockafellar , title =

  33. [34]

    Morgenstern and Aaron Roth , title =

    Matthew Joseph and Michael Kearns and Jamie H. Morgenstern and Aaron Roth , title =. Advances in Neural Information Processing Systems 29 (NeurIPS) , pages =. 2016 , url =

  34. [35]

    Narahari , title =

    Vishakha Patil and Ganesh Ghalme and Vineet Nair and Y. Narahari , title =. Journal of Machine Learning Research , volume =. 2021 , url =

  35. [36]

    Parkes , title =

    Yang Liu and Goran Radanovic and Christos Dimitrakakis and Debmalya Mandal and David C. Parkes , title =. Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) , year =

  36. [37]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , series =

    Shahin Jabbari and Matthew Joseph and Michael Kearns and Jamie Morgenstern and Aaron Roth , title =. Proceedings of the 34th International Conference on Machine Learning (ICML) , series =. 2017 , publisher =

  37. [38]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , series =

    Umer Siddique and Paul Weng and Matthieu Zimmer , title =. Proceedings of the 37th International Conference on Machine Learning (ICML) , series =. 2020 , publisher =

  38. [39]

    AI for Social Good Workshop at the International Joint Conference on Artificial Intelligence (IJCAI) , year =

    Paul Weng , title =. AI for Social Good Workshop at the International Joint Conference on Artificial Intelligence (IJCAI) , year =

  39. [40]

    Weymark , title =

    John A. Weymark , title =. Mathematical Social Sciences , volume =

  40. [41]

    International Conference on Learning Representations (ICLR) , year =

    Chen-Yu Wei and Chung-Wei Lee and Mengxiao Zhang and Haipeng Luo , title =. International Conference on Learning Representations (ICLR) , year =

  41. [42]

    Eitan Altman , title =

  42. [43]

    arXiv preprint arXiv:2003.02189 , year =

    Yonathan Efroni and Shie Mannor and Matteo Pirotta , title =. arXiv preprint arXiv:2003.02189 , year =

  43. [44]

    Journal of Machine Learning Research , volume =

    Mehrdad Mahdavi and Rong Jin and Tianbao Yang , title =. Journal of Machine Learning Research , volume =. 2012 , url =