pith. machine review for the scientific record. sign in

arxiv: 2605.02063 · v1 · submitted 2026-05-03 · 💻 cs.MA · cs.AI· cs.LG

Recognition: 4 theorem links

· Lean Theorem

Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:45 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.LG
keywords mixed-motive reinforcement learningcoopetitionmulti-agent systemsbenchmark platforminterdependence matricesreward ablationgame-theoretic oraclesstrategic alliances
0
0 comments X

The pith

Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Coopetition-Gym v1 as a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. It organizes twenty environments into four mechanism classes drawn from prior reports on interdependence and complementarity, trust and reputation, collective action and loyalty, and sequential interaction and reciprocity. Each environment includes a closed-form payoff structure and a calibrated interdependence matrix, while exposing a reward layer that switches between private, integrated, and cooperative modes. This separation enables reward-type ablation studies, and four environments are validated against historical coopetitive relationships such as the Renault-Nissan Alliance and the Apple iOS App Store. The platform also supplies standard Gymnasium and PettingZoo interfaces, 126 reference algorithms, and large released training and audit corpora from systematic experiments.

Core claim

Coopetition-Gym v1 comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports. Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric. The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorit

What carries the argument

The parameterized reward layer configurable across private, integrated, and cooperative modes, which separates the underlying payoff structure from the observed reward signal to enable reward-type ablation.

If this is right

  • Reward-type ablation studies become feasible while holding payoff structures fixed across all environments.
  • Learning algorithm performance can be directly compared to game-theoretic oracles in each of the twenty settings.
  • Simulated outcomes in four environments can be checked against historical data from documented alliances and platforms.
  • The released 25,708-run training corpus and 1,116-run behavioral audit corpus support reproducible analysis of mixed-motive dynamics.
  • The four mechanism classes supply a modular foundation for adding further classes of strategic interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The platform design may support identification of general conditions under which cooperative equilibria arise in continuous-action spaces without explicit communication.
  • The calibrated matrices could be applied to forecast behaviors in untested but structurally similar competitive partnerships outside the four historical cases.
  • The payoff-reward separation might be adopted in other multi-agent benchmarks to isolate effects of incentive alignment.
  • Dynamic variation of interdependence coefficients during episodes could be tested as a natural extension to study adaptation in changing alliances.

Load-bearing premise

The twenty environments and four mechanism classes, together with the calibrated interdependence matrices derived from the four prior technical reports, sufficiently represent the space of strategic coopetition and allow meaningful generalization beyond the specific cases.

What would settle it

Re-running the sixteen learning algorithms on the four validated environments and obtaining validation percentages below the reported 81.7 percent threshold, or constructing a new environment from an unrepresented mechanism class and finding that learned behaviors deviate systematically from predictions based on the interdependence matrices.

Figures

Figures reproduced from arXiv: 2605.02063 by Eric Yu, Vik Pant.

Figure 1
Figure 1. Figure 1: Coverage of the twenty environments along the agent-count and episode-horizon axes. view at source ↗
Figure 2
Figure 2. Figure 2: Weakest-link property of the geometric-mean synergy. Holding all other agents at view at source ↗
Figure 3
Figure 3. Figure 3: Interdependence matrix D for the Samsung-Sony LCD joint venture (SLCD-v0), calibrated from documented strategic dependencies in Pant and Yu [34]. Row i indexes the agent placing the weight; column j indexes the agent whose payoff is weighted. The diagonal Dij = 1 entries reflect that each agent fully values its own payoff. The off-diagonal entries are asymmetric: Sony places weight 0.86 on Samsung’s payoff… view at source ↗
Figure 4
Figure 4. Figure 4: Trust dynamics under the asymmetric update of Eq. view at source ↗
Figure 5
Figure 5. Figure 5: Bounded reciprocity response φ(x) = tanh(κx) for three values of the response-sensitivity parameter κ. All three curves saturate at ±1 (dashed horizontal asymptotes), so no extreme cooperation deviation can produce an unbounded reciprocity reaction. Larger κ produces sharper, step-like responses that discretize the reciprocal reaction; smaller κ produces a more gradual response that approaches a linear reg… view at source ↗
Figure 6
Figure 6. Figure 6: Two-layer separation of payoff and reward in view at source ↗
Figure 7
Figure 7. Figure 7: Empirical distribution of the per-cell finite-fraction view at source ↗
Figure 8
Figure 8. Figure 8: Mean episodic return by mechanism-class tier under integrated reward, contrasting the view at source ↗
Figure 9
Figure 9. Figure 9: Paradigm-boundary crossover on AppleAppStore-v0. The gap (independent best − CTDE best, in percent) changes sign between the private reward configuration and the integrated and cooperative configurations: independent learning leads when each agent receives only its own payoff (+8.1%), and centralized training leads when partner payoffs enter the reward signal (−1.7% integrated, −3.5% cooperative). The hori… view at source ↗
Figure 10
Figure 10. Figure 10: Algorithm rank by mechanism class under integrated reward. Rows are algorithms view at source ↗
Figure 11
Figure 11. Figure 11: Dij contribution to return by mechanism-class tier (Eq. 27). Boxes show interquartile range; whiskers extend to the 5th and 95th percentiles. TR-3 (collective action) is the most Dij -dependent tier (median 59.7%); TR-2 (trust) is the least dependent (median 24.2%). TR￾1 and TR-4 occupy intermediate positions. Five algorithm-environment pairs on TR-2 (out of 287 total with defined returns) show negative c… view at source ↗
Figure 12
Figure 12. Figure 12: Per-algorithm median Dij contribution across the environments each algorithm was evaluated on. Twelve algorithms cluster between 34% and 53%, indicating that their learned policies depend substantially on reward mutuality. Four algorithms (MAPPO, IA2C, IPPO, SelfPlay_PPO) fall below 20%, converging to policies that are largely insensitive to whether the reward function incorporates partner payoffs. The wi… view at source ↗
Figure 13
Figure 13. Figure 13: Stylized illustration of the oracle-exceedance dynamics on view at source ↗
Figure 14
Figure 14. Figure 14: Per-algorithm NaN-divergence rate on ApacheProject-v0 (n = 13 seeds per cell). The deterministic-policy-gradient family (MADDPG, MATD3, M3DDPG) diverges on every seed under integrated and cooperative reward, and converges predominantly under private reward. The other learners (MASAC, ISAC, LOLA) converge under all three reward modes. The contrast localizes the failure to a specific intersection of algorit… view at source ↗
read the original abstract

We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. It comprises 20 environments across four mechanism classes drawn from prior technical reports on interdependence/complementarity, trust/reputation, collective action/loyalty, and sequential interaction/reciprocity. Each environment has closed-form payoffs and calibrated interdependence matrices, with a parameterized reward layer offering private, integrated, and cooperative modes. The platform provides Gymnasium and PettingZoo interfaces, 126 reference algorithms (including 7 game-theoretic oracles), and releases a 25,708-run training corpus plus behavioral audit data, with four historical case validations reproducing outcomes at 98.3%, 81.7%, 86.7%, and 87.3% fidelity. It claims to be the first platform combining continuous-action mixed-motive environments, parameterized reward mutuality, calibrated coefficients, oracle baselines, and validated case studies.

Significance. If the platform construction and historical validations hold, this provides a valuable standardized benchmark for mixed-motive MARL research, enabling controlled ablation of reward structures and reproducible comparisons via oracles and baselines. Credit is due for the release of the full training corpus under CC-BY-4.0 with Croissant metadata, support for standard Gymnasium/PettingZoo interfaces, and the scale of the 25,708-run experimental study, which together facilitate community adoption and systematic study of coopetition beyond purely competitive or cooperative settings.

major comments (1)
  1. [Abstract] Abstract: the reported reproduction accuracies (98.3%, 81.7%, 86.7%, 87.3%) for the four historical cases (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store) are presented without specifying the validation rubric, exact metrics, or error analysis. This detail is load-bearing for substantiating the 'validated case studies' component of the central novelty claim.
minor comments (2)
  1. A summary table enumerating all 20 environments by mechanism class, action space dimensionality, and source report would improve clarity and allow readers to quickly assess coverage of the coopetition space.
  2. The interdependence matrices are imported from the four prior reports; adding a brief appendix with their numerical values or a sensitivity check would enhance standalone reproducibility without altering the platform's integration focus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of Coopetition-Gym v1's potential as a benchmark and for the constructive comment on the abstract. We address the single major comment below and will incorporate the requested clarification.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported reproduction accuracies (98.3%, 81.7%, 86.7%, 87.3%) for the four historical cases (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store) are presented without specifying the validation rubric, exact metrics, or error analysis. This detail is load-bearing for substantiating the 'validated case studies' component of the central novelty claim.

    Authors: We agree that the abstract would benefit from an explicit statement of the validation rubric to better support the novelty claim. The full manuscript details the rubric in the validation section as a fidelity metric that quantifies the percentage match between simulated agent behaviors (strategic choices, cooperation levels, and payoff outcomes) and the documented historical records for each case, with supporting error analysis via parameter sensitivity checks. In the revised version we will update the abstract sentence to read: 'Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3%, 81.7%, 86.7%, and 87.3% fidelity on a validation rubric that compares simulated strategic interactions and payoffs against historical records (detailed metrics and error analysis appear in the main text).' This addition directly addresses the concern while preserving the reported accuracies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; platform is an integration of externally grounded components

full rationale

The manuscript presents a benchmark platform whose environments and matrices are drawn from four prior technical reports by the same authors, but the central contribution is the new Gymnasium/PettingZoo implementation, the three reward modes, the 126 reference algorithms, and the released 25,708-run corpus. No derivation chain, prediction, or first-principles result is claimed that reduces by construction to fitted parameters or self-citations inside the paper. The four historical validations (98.3–81.7 % fidelity) are performed against external documented relationships, supplying independent grounding. The 'first to combine' statement follows directly from the enumerated feature list rather than from any self-referential equation or ansatz. Self-citations supply the foundational environments but do not bear the load of proving a new mathematical result; the platform itself is the novel artifact.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The platform rests on closed-form payoff structures and interdependence matrices taken from four prior technical reports; no new mathematical entities are introduced. The main free parameters are the calibrated matrix values for each environment.

free parameters (1)
  • interdependence matrices = derived per environment from prior reports
    Each of the twenty environments uses a calibrated interdependence matrix derived from the corresponding prior technical report.
axioms (1)
  • domain assumption The four mechanism classes (interdependence, trust, collective action, reciprocity) adequately span strategic coopetition.
    Environments are organized into these four classes corresponding to the four referenced reports.

pith-pipeline@v0.9.0 · 5652 in / 1487 out tokens · 83435 ms · 2026-05-08T18:45:30.537704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    J. P. Agapiou, A. S. Vezhnevets, E. A. Duéñez-Guzmán, J. Matyas, Y. Mao, P. Sunehag, R. Koster, U. Madhushani, K. Kopparapu, R. Comanescu, D. J. Strouse, M. B. Johanson, S. Singh, J. Haas, I. Mordatch, D. Mobbs, and J. Z. Leibo. Melting Pot 2.0.arXiv preprint arXiv:2211.13746, 2022

  2. [2]

    Axelrod.The Evolution of Cooperation

    R. Axelrod.The Evolution of Cooperation. Basic Books, 1984

  3. [3]

    N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The Hanabi challenge: A new frontier for AI research.Artificial Intelligence, 280:103216, 2020

  4. [4]

    Coopetition

    M. Bengtsson and S. Kock. “Coopetition” in business networks — to cooperate and compete simultaneously.Industrial Marketing Management, 29(5):411–426, 2000

  5. [5]

    R. B. Bouncken, J. Gast, S. Kraus, and M. Bogers. Coopetition: a systematic review, synthesis, and future research directions.Review of Managerial Science, 9(3):577–601, 2015

  6. [6]

    A. M. Brandenburger and B. J. Nalebuff.Co-opetition. Currency Doubleday, 1996

  7. [7]

    Carroll, R

    M. Carroll, R. Shah, M. K. Ho, T. L. Griffiths, S. A. Seshia, P. Abbeel, and A. Dragan. On the utility of learning about humans for human-AI coordination. InAdvances in Neural Information Processing Systems 32, 2019

  8. [8]

    Ellis, J

    B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson. SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

  9. [9]

    Fehr and S

    E. Fehr and S. Gächter. Cooperation and punishment in public goods experiments.American Economic Review, 90(4):980–994, 2000. 79

  10. [10]

    Foerster, G

    J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. InProc. AAAI, 2018

  11. [11]

    J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponent-learning awareness. InProc. AAMAS, 2018

  12. [12]

    Co-opetitionbetweengiants: Collaborationwithcompetitors for technological innovation.Research Policy, 40(5):650–663, 2011

    D.R.GnyawaliandB.-J.R.Park. Co-opetitionbetweengiants: Collaborationwithcompetitors for technological innovation.Research Policy, 40(5):650–663, 2011

  13. [13]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProc. ICML, 2018

  14. [14]

    Bayesian

    J. C. Harsanyi. Games with incomplete information played by “Bayesian” players, parts I–III. Management Science, 14(3,5,7):159–182, 320–334, 486–502, 1967

  15. [15]

    J. A. Hartigan and P. M. Hartigan. The dip test of unimodality.The Annals of Statistics, 13(1):70–84, 1985

  16. [16]

    Henderson, R

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. InProc. AAAI, 2018

  17. [17]

    Untanglingtheriseofcoopetition: theintrusionofcompetitionin a cooperative game structure.International Studies of Management & Organization, 37(2):32– 52, 2007

    G.PadulaandG.B.Dagnino. Untanglingtheriseofcoopetition: theintrusionofcompetitionin a cooperative game structure.International Studies of Management & Organization, 37(2):32– 52, 2007

  18. [18]

    Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

    M. Lanctot, E. Lockhart, J.-B. Lespiau, V. Zambaldi, S. Upadhyay, J. Pérolat, S. Srinivasan, F. Timbers, K. Tuyls, S. Omidshafiei, D. Hennes, D. Morrill, P. Muller, T. Ewalds, R. Faulkner, J. Kramar, B. De Vylder, B. Saeta, J. Bradbury, D. Ding, S. Borgeaud, M. Lai, J. Schrittwieser, T. Anthony, E. Hughes, I. Danihelka, and J. Ryan-Davis. OpenSpiel: A fra...

  19. [19]

    J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel. Multi-agent reinforcement learning in sequential social dilemmas. InProc. AAMAS, 2017

  20. [20]

    Hughes, J

    E. Hughes, J. Z. Leibo, M. Phillips, K. Tuyls, E. Dueñez-Guzmán, A. García Castañeda, I. Dunning, T. Zhu, K. McKee, R. Koster, H. Roff, and T. Graepel. Inequity aversion improves cooperation in intertemporal social dilemmas. InAdvances in Neural Information Processing Systems 31, 2018

  21. [21]

    Y. Luo. A coopetition perspective of global competition.Journal of World Business, 42(2):129– 144, 2007

  22. [22]

    Bengtsson and S

    M. Bengtsson and S. Kock. Coopetition—Quo vadis? Past accomplishments and future challenges.Industrial Marketing Management, 43(2):180–188, 2014

  23. [23]

    J. Dahl. Conceptualizing coopetition as a process: An outline of change in cooperative and competitive interactions.Industrial Marketing Management, 43(2):272–279, 2014

  24. [24]

    Ritala, A

    P. Ritala, A. Golnam, and A. Wegmann. Coopetition-based business models: The case of Amazon.com.Industrial Marketing Management, 43(2):236–249, 2014

  25. [25]

    M. A. Nowak and K. Sigmund. Evolution of indirect reciprocity.Nature, 437(7063):1291–1298, 2005. 80

  26. [26]

    Czakon and K

    W. Czakon and K. Czernek. The role of trust-building mechanisms in entering into network coopetition: The case of tourism networks in Poland.Industrial Marketing Management, 57:64– 74, 2016

  27. [27]

    A. A. Lado, N. G. Boyd, and S. C. Hanlon. Competition, cooperation, and the search for economic rents: A syncretic model.Academy of Management Review, 22(1):110–141, 1997

  28. [28]

    J. Z. Leibo, E. A. Dueñez-Guzmán, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel. Scalable evaluation of multi-agent reinforcement learning with Melting Pot. InProc. ICML, 2021

  29. [29]

    R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems 30, 2017

  30. [30]

    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015

  31. [31]

    J. F. Nash. Equilibrium points inn-person games.Proceedings of the National Academy of Sciences, 36(1):48–49, 1950

  32. [32]

    M. A. Nowak. Five rules for the evolution of cooperation.Science, 314(5805):1560–1563, 2006

  33. [33]

    Ostrom.Governing the Commons: The Evolution of Institutions for Collective Action

    E. Ostrom.Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press, 1990

  34. [34]

    Pant and E

    V. Pant and E. Yu. Computational foundations for strategic coopetition: Formalizing interdependence and complementarity.arXiv preprint arXiv:2510.18802, 2025

  35. [35]

    Pant and E

    V. Pant and E. Yu. Computational foundations for strategic coopetition: Formalizing trust and reputation dynamics.arXiv preprint arXiv:2510.24909, 2025

  36. [36]

    Pant and E

    V. Pant and E. Yu. Computational foundations for strategic coopetition: Formalizing collective action and loyalty.arXiv preprint arXiv:2601.16237, 2026

  37. [37]

    Computationalfoundationsforstrategiccoopetition: Formalizingsequential interaction and reciprocity.arXiv preprint arXiv:2604.01240, 2026

    V.PantandE.Yu. Computationalfoundationsforstrategiccoopetition: Formalizingsequential interaction and reciprocity.arXiv preprint arXiv:2604.01240, 2026

  38. [38]

    Pant and E

    V. Pant and E. Yu.Coopetition-Gym v1: reproducibility package for the Coopetition-Gym benchmark. Software, version 1.0.0 (git tagv1.0.0), released under MIT license, 2026. Source: https://github.com/vikpant/strategic- coopetition. Archival deposit: persistent identifier to be minted via Zenodo–GitHub integration at v1.0.0 release

  39. [39]

    Papoudakis, F

    G. Papoudakis, F. Christianos, L. Schäfer, and S. V. Albrecht. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. InAdvances in Neural Information Processing Systems 34, Datasets and Benchmarks Track, 2021

  40. [40]

    D. G. Rand and M. A. Nowak. Human cooperation.Trends in Cognitive Sciences, 17(8):413– 425, 2013. 81

  41. [41]

    Rashid, M

    T. Rashid, M. Samvelyan, C. Schroeder de Witt, G. Farquhar, J. Foerster, and S. Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proc. ICML, 2018

  42. [42]

    Samvelyan, T

    M. Samvelyan, T. Rashid, C. Schroeder de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson. The StarCraft multi-agent challenge. InProc. AAMAS, 2019

  43. [43]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  44. [44]

    L. S. Shapley. Stochastic games.Proceedings of the National Academy of Sciences, 39(10):1095– 1100, 1953

  45. [45]

    Shoham and K

    Y. Shoham and K. Leyton-Brown.Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2008

  46. [46]

    Silver, A

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch- brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search.Nature, 529(75...

  47. [47]

    D. J. Strouse, K. McKee, M. Botvinick, E. Hughes, and R. Everett. Collaborating with humans without human data. InAdvances in Neural Information Processing Systems 34, 2021

  48. [48]

    Sunehag, G

    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. InProc. AAMAS, 2018

  49. [49]

    Tampuu, T

    A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent cooperation and competition with deep reinforcement learning.PLOS ONE, 12(4):e0172395, 2017

  50. [50]

    J. K. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. Santos, R. Perez- Vicente, C. Horsch, C. Dieffendahl, N. L. Williams, Y. Lokesh, and P. Ravi. PettingZoo: Gym for multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems 34, 2021

  51. [51]

    Thesurprisingeffectiveness of PPO in cooperative multi-agent games

    C.Yu, A.Velu, E.Vinitsky, J.Gao, Y.Wang, A.Bayen, andY.Wu. Thesurprisingeffectiveness of PPO in cooperative multi-agent games. InAdvances in Neural Information Processing Systems 35, Datasets and Benchmarks Track, 2022. 82