pith. sign in

arxiv: 2606.04359 · v1 · pith:6TJMW46Inew · submitted 2026-06-03 · 💻 cs.GT

Learning to cooperate with emergent reputation via multi-agent reinforcement learning

Pith reviewed 2026-06-28 04:20 UTC · model grok-4.3

classification 💻 cs.GT
keywords multi-agent reinforcement learningreputation systemscooperationsocial dilemmasemergent normsdonation gamecoin game
0
0 comments X

The pith

COOPER jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COOPER, a distributed multi-agent reinforcement learning method that trains both how agents evaluate reputations and how they act on those evaluations using only the rewards from the shared environment. It addresses the challenge of delayed and noisy feedback that arises when reputation and behavior are tightly coupled by structuring specific learning modules and the information passed between them. Experiments on donation and coin games in grid worlds show the approach adapts to different pre-existing reputation systems and other agents. In self-play, both reputation norms and cooperative behavior arise together, and the results remain stable across varied social network structures.

Core claim

COOPER jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards. Leveraging the underlying mechanisms of reputation, the method deliberately designs the constituent modules of COOPER and the data flows among them, overcoming the latency and noise in the feedback signal caused by the deep entanglement between reputation and policy. Experiments demonstrate effective adaptation to various existing reputation systems and co-players, with co-emergence of reputation norms and cooperation in self-play settings that hold across diverse social network topologies.

What carries the argument

COOPER, a distributed multi-agent reinforcement learning method whose modules and data flows are structured to jointly optimize reputation assessment rules and policies from environmental rewards.

If this is right

  • COOPER adapts to various existing reputation systems and different co-players in the donation game and the coin game.
  • Reputation norms and cooperation co-emerge in self-play settings.
  • The results remain robust across diverse social network topologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The joint learning structure could be applied to test whether similar emergence occurs in non-grid environments with continuous actions.
  • The observed co-emergence of norms suggests the method might reveal how reputation mechanisms scale with population size.
  • Extensions could examine whether the same module design supports cooperation when agents have heterogeneous perception capabilities.

Load-bearing premise

The deliberately designed modules and data flows in COOPER are sufficient to overcome the latency and noise arising from the entanglement between reputation and policy, enabling stable joint learning from environment rewards alone.

What would settle it

An experiment increasing the depth of entanglement or noise between reputation signals and actions where COOPER fails to produce stable cooperation or accurate reputation assessments.

Figures

Figures reproduced from arXiv: 2606.04359 by Dengji Zhao, Xinwei Song, Xue Feng, Yizhe Huang.

Figure 1
Figure 1. Figure 1: An overview of our method. COOPER agents promote cooperative behavior in multi￾agent reinforcement learning by jointly learn a reputation-based policy π and a reputation assign￾ment module which separately processes gossip-based reputation assignment (ψ) and interaction￾based reputation assignment (ϕ). During rollouts, the execution order is ψ → π → ϕ, and in optimization, the order is ψ → ϕ → π to facilit… view at source ↗
Figure 2
Figure 2. Figure 2: COOPER achieves a high cooperation ratio and rewards compared to baselines when [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scale-free network. Here, a COOPER agent is placed into a population of ALLD-RA agents with a threshold of 0.5. The agents are embedded in a scale-free network with size n = 10, neighbor number m = 2. An example is shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: COOPER achieves a high reward compared to baselines and stimulates cooperation in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: COOPER identifies different co-players and achieves a high reward compared to baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: COOPER achieves high cooperation ratio in self-play. (a) shows the performance of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Norm example. In a fully connected network with n = 10, all agents converge to the same reputation norm, whereas in a scale-free network with popula￾tion size n = 10 and average neighbor m = 2 shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hub agent and leaf agent in the scale-free network learn different patterns. (c) presents the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study In this section, we conduct an ablation study in a 10-agent dona￾tion game self-play setting on a scale-free network with m = 2. Specifically, we evaluate two ablated versions of COOPER: 1) COOPER without ψ, which lacks the gossip-based reputation assessment and relies solely on interaction experiences, and 2) COOPER without ϕ, which removes the interaction-based as￾sessment module and depen… view at source ↗
Figure 10
Figure 10. Figure 10: Examples of small-world, scale-free, and fully connected networks. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Self-play in donation game b = 0.5, c = 0.3 on various networks with network size n = 10. Following the standard MARL convention, self-play refers to a setting where all agents are COOPER agents, and they jointly learn from scratch without any pre-defined reputation rules or external su￾pervision. We conduct self-play experiments in the donation game, with b = 0.5, c = 0.3, on three classic network struct… view at source ↗
Figure 12
Figure 12. Figure 12: Self-play Cooperation Ratio with 10, 30, 60 agents [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cooperation Level Difference Caused by Initialization [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Norm learned in a 10-agent donation game self-play setting. The social network is fully [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reputation update pattern of hub and leaf agent in scale-free network. Let the co [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ablation study in different network structures. Both [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: COOPER outperforms the opponent shaping baselines in 10 agents self-play. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Strategic gossip hinders cooperation in 10 agents self-play Donation Game. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
read the original abstract

Reputation, the aggregation of peer assessments diffused through social networks, is a pivotal mechanism for promoting cooperation in social dilemmas ubiquitous to distributed multi-agent systems comprising agents with limited perception and cognitive capabilities. Exploring efficient reputation systems, comprising reputation assessment rules and reputation-based policies, is a long-standing challenge. Previous work assumes predefined reputation assessment rules or models reputation as an intrinsic reward to learn policies, compromising the methods' ability for generalization and adaptation. To address this, we propose a distributed multi-agent reinforcement learning method $\textbf{COOPER}$ ($\textbf{COOP}$eration with $\textbf{E}$mergent $\textbf{R}$eputation), which jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards. Notably, leveraging the underlying mechanisms of reputation, we deliberately design the constituent modules of $\textbf{COOPER}$ and the data flows among them, overcoming the latency and noise in the feedback signal, caused by the deep entanglement between reputation and policy. Experiments on the donation game and the coin game in grid world environments demonstrate that $\textbf{COOPER}$ effectively adapts to various existing reputation systems and co-players. Furthermore, we observe the co-emergence of reputation norms and cooperation in self-play settings. These results hold robustly across diverse social network topologies, underscoring the generalizability and efficacy of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes COOPER, a distributed multi-agent reinforcement learning method that jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards in social dilemma games (donation game and coin game in grid worlds). It claims that deliberately designed constituent modules and data flows overcome latency and noise arising from the deep entanglement between reputation and policy, enabling adaptation to existing reputation systems, co-players, and self-play emergence of norms across network topologies.

Significance. If the central claim holds with proper validation, the work would demonstrate end-to-end learning of reputation mechanisms in MARL without predefined assessment rules or intrinsic rewards, providing an empirical route to emergent cooperation in distributed systems. The emphasis on handling feedback entanglement via architecture is a potential contribution to credit assignment in multi-agent settings with delayed social signals.

major comments (3)
  1. [Experiments] Experimental results (described in the abstract and implied results section): no quantitative metrics, error bars, statistical significance tests, or learning curves are reported for the donation and coin games, leaving the claim of effective adaptation and robust cooperation across topologies unsupported in detail.
  2. [Method] Method section (description of COOPER modules and data flows): the central claim that the custom modules and flows overcome entanglement-induced latency/noise requires evidence that they, rather than base RL components or reward structure, drive success; no ablation replacing designed flows with standard MARL credit assignment is provided, making attribution load-bearing and unverified.
  3. [Experiments] Self-play experiments: the observation of co-emergence of reputation norms and cooperation is presented without controls for alternative explanations (e.g., network topology alone or base algorithm), which is necessary to substantiate that the joint learning mechanism is responsible.
minor comments (2)
  1. Notation for reputation assessment rules and policy components could be clarified with explicit equations or pseudocode to distinguish learned components from environment signals.
  2. The abstract states results hold 'robustly' across topologies but provides no table or figure summarizing performance variation by topology; adding such would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments] Experimental results (described in the abstract and implied results section): no quantitative metrics, error bars, statistical significance tests, or learning curves are reported for the donation and coin games, leaving the claim of effective adaptation and robust cooperation across topologies unsupported in detail.

    Authors: We acknowledge the validity of this observation. The current manuscript presents the experimental outcomes in a primarily qualitative manner. In the revised version, we will include quantitative metrics such as mean cooperation rates, standard deviations as error bars from multiple independent runs, full learning curves, and appropriate statistical tests to provide rigorous support for the claims. revision: yes

  2. Referee: [Method] Method section (description of COOPER modules and data flows): the central claim that the custom modules and flows overcome entanglement-induced latency/noise requires evidence that they, rather than base RL components or reward structure, drive success; no ablation replacing designed flows with standard MARL credit assignment is provided, making attribution load-bearing and unverified.

    Authors: The referee raises an important point regarding the need to verify the contribution of our designed modules. We agree that ablations are essential. We will add ablation studies in the revision, comparing COOPER against variants using standard MARL credit assignment techniques to demonstrate that our custom data flows are key to handling the entanglement. revision: yes

  3. Referee: [Experiments] Self-play experiments: the observation of co-emergence of reputation norms and cooperation is presented without controls for alternative explanations (e.g., network topology alone or base algorithm), which is necessary to substantiate that the joint learning mechanism is responsible.

    Authors: We concur that additional controls would strengthen the attribution to the joint learning mechanism. We will incorporate control experiments in the revised manuscript, including runs with fixed topologies without our method and base algorithms without joint reputation-policy learning, to isolate the effects. revision: yes

Circularity Check

0 steps flagged

No circularity: joint learning from external rewards is self-contained

full rationale

The paper presents COOPER as a distributed MARL method that learns both reputation assessment rules and policies directly from environment rewards, with the architecture deliberately designed to handle feedback latency and noise. No equations, fitted parameters, or self-citations are shown reducing any claimed result to its own inputs by construction. The derivation chain consists of standard RL updates plus custom modules whose contribution is asserted via design and experiments rather than tautological redefinition or self-referential prediction. This is the normal case of an empirical method whose validity rests on external reward signals and observed outcomes, not internal re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard MARL components and the assumption that environment rewards suffice for joint learning.

pith-pipeline@v0.9.1-grok · 5765 in / 1123 out tokens · 30751 ms · 2026-06-28T04:20:52.684650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 2 linked inside Pith

  1. [1]

    Artificial Intelligence Review , volume=

    Multi-agent reinforcement learning for resources allocation optimization: a survey , author=. Artificial Intelligence Review , volume=. 2025 , publisher=

  2. [2]

    arXiv preprint arXiv:2503.13415 , year=

    A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives , author=. arXiv preprint arXiv:2503.13415 , year=

  3. [3]

    arXiv preprint arXiv:2401.04934 , year=

    Fully decentralized cooperative multi-agent reinforcement learning: A survey , author=. arXiv preprint arXiv:2401.04934 , year=

  4. [4]

    IEEE Access , volume=

    Multi-agent systems: A survey about its components, framework and workflow , author=. IEEE Access , volume=. 2024 , publisher=

  5. [5]

    Ieee Access , volume=

    Multi-agent systems: A survey , author=. Ieee Access , volume=. 2018 , publisher=

  6. [6]

    Journal of Automation and Intelligence , volume=

    A survey on multi-agent reinforcement learning and its application , author=. Journal of Automation and Intelligence , volume=. 2024 , publisher=

  7. [7]

    IEEE Transactions on Industrial Informatics , volume=

    Cooperative multiagent deep reinforcement learning for reliable surveillance via autonomous multi-UAV control , author=. IEEE Transactions on Industrial Informatics , volume=. 2022 , publisher=

  8. [8]

    Applied Intelligence , volume=

    A review of cooperative multi-agent deep reinforcement learning , author=. Applied Intelligence , volume=. 2023 , publisher=

  9. [9]

    Physics Reports , volume=

    Statistical physics of human cooperation , author=. Physics Reports , volume=. 2017 , publisher=

  10. [10]

    Physical review letters , volume=

    Scale-free networks provide a unifying framework for the emergence of cooperation , author=. Physical review letters , volume=. 2005 , publisher=

  11. [11]

    arXiv preprint arXiv:1707.06347 , year=

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  12. [12]

    International conference on machine learning , pages=

    Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=

  13. [13]

    Frontiers of Information Technology & Electronic Engineering , volume=

    Decentralized multi-agent reinforcement learning with networked agents: Recent advances , author=. Frontiers of Information Technology & Electronic Engineering , volume=. 2021 , publisher=

  14. [14]

    Autonomous Robots , volume=

    Multiagent systems: A survey from a machine learning perspective , author=. Autonomous Robots , volume=. 2000 , publisher=

  15. [15]

    2008 , publisher=

    Multi-agent systems: Algorithmic, game-theoretic, and logical foundations , author=. 2008 , publisher=

  16. [16]

    2007 , publisher=

    A concise introduction to multiagent systems and distributed artificial intelligence , author=. 2007 , publisher=

  17. [17]

    science , volume=

    The evolution of cooperation , author=. science , volume=. 1981 , publisher=

  18. [18]

    the tragedy of the commons

    Extensions of" the tragedy of the commons" , author=. Science , volume=. 1998 , publisher=

  19. [19]

    1990 , publisher=

    Governing the commons: The evolution of institutions for collective action , author=. 1990 , publisher=

  20. [20]

    2016 , publisher=

    A concise introduction to decentralized POMDPs , author=. 2016 , publisher=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Multi-agent actor-critic for mixed cooperative-competitive environments , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Counterfactual multi-agent policy gradients , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  23. [23]

    IEEE Transactions on Neural Networks and Learning Systems , year=

    Decentralized multi-agent reinforcement learning: Challenges and opportunities , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

  24. [24]

    International Conference on Learning Representations , year=

    Consequentialist conditional cooperation in social dilemmas with imperfect information , author=. International Conference on Learning Representations , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Inequity aversion improves cooperation in intertemporal social dilemmas , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    arXiv preprint arXiv:2412.10609 , year=

    A systematic review of norm emergence in multi-agent systems , author=. arXiv preprint arXiv:2412.10609 , year=

  27. [27]

    Trends in Cognitive Sciences , volume=

    Social norms and human cooperation , author=. Trends in Cognitive Sciences , volume=. 2004 , publisher=

  28. [28]

    Autonomous Agents and Multi-Agent Systems , volume=

    Adaptive intrinsic rewards for multi-agent cooperation , author=. Autonomous Agents and Multi-Agent Systems , volume=. 2023 , publisher=

  29. [29]

    arXiv preprint arXiv:2102.07523 , year=

    Cooperation and reputation dynamics with reinforcement learning , author=. arXiv preprint arXiv:2102.07523 , year=

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Partner selection for the emergence of cooperation in multi-agent systems using reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  31. [31]

    Scientific american , volume=

    Scale-free networks , author=. Scientific american , volume=. 2003 , publisher=

  32. [32]

    Journal of theoretical biology , volume=

    The logic of reprobation: assessment and action rules for indirect reciprocation , author=. Journal of theoretical biology , volume=. 2004 , publisher=

  33. [33]

    arXiv preprint arXiv:2312.05162 , year=

    A review of cooperation in multi-agent learning , author=. arXiv preprint arXiv:2312.05162 , year=

  34. [34]

    The Oxford handbook of gossip and reputation , volume=

    Gossip and reputation in social networks , author=. The Oxford handbook of gossip and reputation , volume=. 2019 , publisher=

  35. [35]

    On the evolution of random graphs , author=. Publ. Math. Inst. Hungar. Acad. Sci , volume=

  36. [36]

    Journal of Artificial Intelligence Research , volume=

    Learning to resolve social dilemmas: a survey , author=. Journal of Artificial Intelligence Research , volume=

  37. [37]

    Proceedings of the National Academy of Sciences , volume=

    Indirect reciprocity can foster large-scale cooperation , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  38. [38]

    The Review of Economic Studies , volume=

    Social norms and community enforcement , author=. The Review of Economic Studies , volume=. 1992 , publisher=

  39. [39]

    International conference on machine learning , pages=

    Scalable evaluation of multi-agent reinforcement learning with melting pot , author=. International conference on machine learning , pages=. 2021 , organization=

  40. [40]

    Nature Communications , volume=

    Reputation can enhance or suppress cooperation through positive feedback , author=. Nature Communications , volume=. 2015 , publisher=

  41. [41]

    Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

    Reputation, a universal currency for human social interactions , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2016 , publisher=

  42. [42]

    Nature , volume=

    A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner's Dilemma game , author=. Nature , volume=. 1993 , publisher=

  43. [43]

    Nature , volume=

    Evolution of indirect reciprocity by image scoring , author=. Nature , volume=. 1998 , publisher=

  44. [44]

    Nature , volume=

    Evolution of indirect reciprocity , author=. Nature , volume=. 2005 , publisher=

  45. [45]

    2006 , publisher=

    Evolutionary dynamics: exploring the equations of life , author=. 2006 , publisher=

  46. [46]

    Journal of theoretical biology , volume=

    How should we define goodness?—reputation dynamics in indirect reciprocity , author=. Journal of theoretical biology , volume=. 2004 , publisher=

  47. [47]

    Journal of theoretical biology , volume=

    The leading eight: social norms that can maintain cooperation by indirect reciprocity , author=. Journal of theoretical biology , volume=. 2006 , publisher=

  48. [48]

    Nature Communications , volume=

    Reputation effects drive the joint evolution of cooperation and social rewarding , author=. Nature Communications , volume=. 2022 , publisher=

  49. [49]

    Proceedings of the National Academy of Sciences , volume=

    Explaining the evolution of gossip , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  50. [50]

    Journal of theoretical biology , volume=

    A tale of two defectors: the importance of standing for evolution of indirect reciprocity , author=. Journal of theoretical biology , volume=. 2003 , publisher=

  51. [51]

    Proceedings of the National Academy of Sciences , volume=

    Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent , author=. Proceedings of the National Academy of Sciences , volume=. 2012 , publisher=

  52. [52]

    IEEE Transactions on Evolutionary Computation , volume=

    Reputation-based interaction promotes cooperation with reinforcement learning , author=. IEEE Transactions on Evolutionary Computation , volume=. 2023 , publisher=

  53. [53]

    Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems , pages=

    Bottom-Up Reputation Promotes Cooperation with Multi-Agent Reinforcement Learning , author=. Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems , pages=

  54. [54]

    Nature Human Behaviour , volume=

    A unified framework of direct and indirect reciprocity , author=. Nature Human Behaviour , volume=. 2021 , publisher=

  55. [55]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

    Learning fair cooperation in mixed-motive games with indirect reciprocity , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

  56. [56]

    nature , volume=

    Collective dynamics of ‘small-world’networks , author=. nature , volume=. 1998 , publisher=

  57. [57]

    Social and Personality Psychology Compass , volume=

    Reputation, gossip, and human cooperation , author=. Social and Personality Psychology Compass , volume=. 2016 , publisher=

  58. [58]

    Physics of life reviews , volume=

    Reputation and reciprocity , author=. Physics of life reviews , volume=. 2023 , publisher=

  59. [59]

    Journal of Physics: Complexity , volume=

    The emergence of cooperation via Q-learning in spatial donation game , author=. Journal of Physics: Complexity , volume=. 2024 , publisher=

  60. [60]

    International conference on machine learning , pages=

    Machine theory of mind , author=. International conference on machine learning , pages=. 2018 , organization=

  61. [61]

    arXiv preprint arXiv:2103.13333 , year=

    Emergent cooperation through mutual information maximization , author=. arXiv preprint arXiv:2103.13333 , year=

  62. [62]

    arXiv preprint arXiv:2404.13236 , year=

    Social Curricula: Towards Foundation Models for Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2404.13236 , year=

  63. [63]

    Handbook of reinforcement learning and control , pages=

    Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of reinforcement learning and control , pages=. 2021 , publisher=

  64. [64]

    Journal of Artificial Intelligence Research , volume=

    Go-Explore: a New Approach for Hard-Exploration Problems , author=. Journal of Artificial Intelligence Research , volume=. 2023 , publisher=

  65. [65]

    arXiv preprint arXiv:1709.04326 , year=

    Learning with opponent-learning awareness , author=. arXiv preprint arXiv:1709.04326 , year=

  66. [66]

    arXiv preprint arXiv:2406.14662 , year=

    Advantage alignment algorithms , author=. arXiv preprint arXiv:2406.14662 , year=