pith. machine review for the scientific record. sign in

arxiv: 2604.08103 · v1 · submitted 2026-04-09 · ⚛️ physics.comp-ph

Recognition: 3 theorem links

· Lean Theorem

Reinforcement learning with reputation-based adaptive exploration promotes the evolution of cooperation

An Li, Chaoqian Wang, Hongwei Zheng, Longzhao Liu, Shaoting Tang, Wenqiang Zhu, Xin Wang, Yishen Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification ⚛️ physics.comp-ph
keywords reinforcement learningQ-learningreputationcooperationevolutionary gamesadaptive explorationmulti-agent systems
0
0 comments X

The pith

Coupling exploration to local reputation in Q-learning promotes cooperation in evolutionary games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a Q-learning model in which agents adapt their exploration rates according to differences in local reputation and apply asymmetric reputation updates that depend on current status. Simulations show that the reputation-coupled exploration and the asymmetric updates each increase cooperation when introduced alone, and that combining them produces a stronger effect. The joint mechanism works by lowering exploration for high-reputation agents while raising it for low-reputation agents, and by magnifying the reputation payoff from cooperation at low status while magnifying the penalty from defection at high status. Readers would care because the model illustrates how social evaluation can steer individual learning toward collective cooperation without requiring fixed rules or external enforcement.

Core claim

Each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making high reputation agents explore less and low reputation agents explore more, while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status.

What carries the argument

Q-learning with exploration rates tied to local reputation differences together with asymmetric state-dependent reputation updates.

If this is right

  • Cooperation levels rise further when the two mechanisms operate together than when either is used alone.
  • High-reputation agents become more likely to exploit known cooperative strategies while low-reputation agents continue to sample alternatives.
  • Reputation payoffs become larger for cooperation when an agent has low status and larger for defection when an agent has high status.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coupling could be tested in settings where reputation is observed with noise or delay.
  • The approach suggests a route for designing multi-agent systems in which status signals naturally reduce wasteful exploration once good strategies are found.
  • Real-world reputation platforms might be examined to see whether they produce analogous exploration patterns among users.

Load-bearing premise

Agents can accurately perceive and respond to differences in local reputation when they adjust their exploration rates.

What would settle it

Simulations that keep exploration fixed and use symmetric reputation updates would show no comparable rise in cooperation levels.

Figures

Figures reproduced from arXiv: 2604.08103 by An Li, Chaoqian Wang, Hongwei Zheng, Longzhao Liu, Shaoting Tang, Wenqiang Zhu, Xin Wang, Yishen Jiang.

Figure 1
Figure 1. Figure 1: FIG. 1. Adaptive exploration and asymmetric reputation updating independently and directionally reshape the evolution [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Synergistic effect between adaptive exploration and asymmetric reputation. (a) Heat map of the fraction of cooperation [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Reputation concern governs the cooperation regime. (a) Bar chart of the fraction of cooperation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Spatiotemporal evolution of strategy and reputation for different reputation concern. Snapshots of strategy (top row in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Impact of baseline exploration rate. We show the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Multi-agent reinforcement learning serves as an effective tool for studying strategy adaptation in evolutionary games. Although prior work has integrated Q-learning with reputation mechanisms to promote cooperation, most existing algorithms adopt fixed exploration rates and overlook the influence of social context on exploratory behavior. In practice, individuals may adjust their willingness to explore based on their reputation and perceived social standing. To address this, we propose a Q-learning model that couples exploration rates with local reputation differences and incorporates asymmetric, state-dependent reputation updates. Our results show that each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making ``high reputation--low exploration, low reputation--high exploration'', while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status. This study thus offers insights into how social evaluation can shape learning behavior in complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Q-learning model in multi-agent reinforcement learning for evolutionary games. It couples exploration rates to local reputation differences and incorporates asymmetric, state-dependent reputation updates. Simulations are claimed to show that each mechanism independently promotes cooperation and that their combination produces a reinforcing effect through the mapping of high reputation to low exploration (and vice versa), together with status-dependent amplification of cooperative gains at low status and defection penalties at high status.

Significance. If the simulation results are robust, the work would provide a concrete demonstration of how social-evaluation mechanisms can shape exploratory behavior in RL agents and thereby influence the evolution of cooperation. The combination of adaptive exploration and asymmetric reputation updates is a novel modeling choice that could inform both evolutionary game theory and multi-agent RL design.

major comments (2)
  1. [Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.
  2. [Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.
minor comments (1)
  1. [Model] Clarify the precise evolutionary game (e.g., Prisoner's Dilemma parameters) and the exact functional form of the reputation-update rule in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's potential significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to strengthen the presentation and robustness of the results.

read point-by-point responses
  1. Referee: [Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.

    Authors: We agree that the headline claim of a reinforcing effect would be more general if shown to be robust to the precise form of the reputation update. The asymmetry in our model is motivated by the social intuition that reputation gains from cooperation are more salient when an agent has low status, while defection penalties are amplified at high status. Nevertheless, to address the concern that the synergy might be an artifact of this choice, we have performed additional simulations with both symmetric updates and reversed asymmetry. These results will be added to the revised manuscript (new figure and accompanying text) to demonstrate that the reinforcing interaction between adaptive exploration and reputation persists, albeit with quantitative differences in the level of cooperation achieved. revision: yes

  2. Referee: [Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.

    Authors: We thank the referee for noting this presentational gap. Although the simulation protocol is described in the main text, we acknowledge that the abstract and the opening of the Methods section did not list the parameters explicitly. In the revised manuscript we have added a dedicated parameter table (population size N=1000, Prisoner's Dilemma payoffs with benefit-to-cost ratio b/c=1.5, learning rate α=0.1, discount factor γ=0.9, 50 independent runs per condition) and have included error bars together with two-sided t-tests or Wilcoxon tests on all key comparisons. Baseline results for standard Q-learning with fixed ε-greedy exploration are already present but are now referenced more explicitly in the main figures. revision: yes

Circularity Check

0 steps flagged

No circularity; simulation outcomes are independent of any self-referential derivation.

full rationale

The paper introduces a Q-learning agent model that couples exploration rates to local reputation differences and applies asymmetric state-dependent reputation updates, then reports cooperation levels from multi-agent simulations. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the reported reinforcing effect is an observed numerical outcome under the stated rules rather than a tautological restatement of the model definition itself. The derivation chain consists of standard RL updates plus explicitly chosen functional forms for reputation and exploration, none of which are justified solely by prior work from the same authors or by re-labeling known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities are identifiable. The model likely relies on standard Q-learning hyperparameters and game payoff matrices, but these are not detailed here.

pith-pipeline@v0.9.0 · 5461 in / 1174 out tokens · 47609 ms · 2026-05-10T17:56:32.945184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

69 extracted references · 4 canonical work pages

  1. [1]

    D. G. Rand and M. A. Nowak, Human cooperation, Trends in Cognitive Sciences17, 413 (2013)

  2. [2]

    Axelrod and W

    R. Axelrod and W. D. Hamilton, The evolution of coop- eration, Science211, 1390 (1981)

  3. [3]

    Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

    K. Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

  4. [4]

    P. A. Van Lange,Social dilemmas: Understanding human cooperation(OUP USA, 2014)

  5. [5]

    Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

    E. Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

  6. [6]

    J. M. Smith and G. R. Price, The logic of animal conflict, Nature246, 15 (1973). 11

  7. [7]

    P. D. Taylor and L. B. Jonker, Evolutionary stable strate- gies and game dynamics, Mathematical Biosciences40, 145 (1978)

  8. [8]

    Ohtsuki, C

    H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak, A simple rule for the evolution of cooperation on graphs and social networks, Nature441, 502 (2006)

  9. [9]

    Perc and A

    M. Perc and A. Szolnoki, Coevolutionary games—a mini review, BioSystems99, 109 (2010)

  10. [10]

    M. Perc, J. J. Jordan, D. G. Rand, Z. Wang, S. Boc- caletti, and A. Szolnoki, Statistical physics of human co- operation, Physics Reports687, 1 (2017)

  11. [11]

    C. Wang, M. Perc, and A. Szolnoki, Evolutionary dynam- ics of any multiplayer game on regular graphs, Nature Communications15, 5349 (2024)

  12. [12]

    Wang and A

    C. Wang and A. Szolnoki, Evolution of cooperation un- der a generalized death-birth process, Physical Review E 107, 024303 (2023)

  13. [13]

    Wang and A

    C. Wang and A. Szolnoki, Inertia in spatial public goods games under weak selection, Applied Mathematics and Computation449, 127941 (2023)

  14. [14]

    C. Wang, W. Zhu, and A. Szolnoki, The conflict between self-interaction and updating passivity in the evolution of cooperation, Chaos, Solitons & Fractals173, 113667 (2023)

  15. [15]

    C. Wang, W. Zhu, and A. Szolnoki, When greediness and self-confidence meet in a social dilemma, Physica A625, 129033 (2023)

  16. [16]

    Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

    R. Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

  17. [17]

    Szab´ o and C

    G. Szab´ o and C. T˝ oke, Evolutionary prisoner’s dilemma game on a square lattice, Physical Review E58, 69 (1998)

  18. [18]

    M. A. Nowak,Evolutionary dynamics: exploring the equations of life(Harvard University Press, 2006)

  19. [19]

    Sigmund, C

    K. Sigmund, C. Hauert, and M. A. Nowak, Reward and punishment, Proceedings of the National Academy of Sci- ences98, 10757 (2001)

  20. [20]

    Szolnoki and M

    A. Szolnoki and M. Perc, Reward and cooperation in the spatial public goods game, Europhysics Letters92, 38003 (2010)

  21. [21]

    Szolnoki, G

    A. Szolnoki, G. Szab´ o, and M. Perc, Phase diagrams for the spatial public goods game with pool punishment, Physical Review E83, 036101 (2011)

  22. [22]

    W. Zhu, Q. Pan, S. Song, and M. He, Effects of exposure- based reward and punishment on the evolution of coop- eration in prisoner’s dilemma game, Chaos, Solitons & Fractals172, 113519 (2023)

  23. [23]

    T. A. Han, M. H. Duong, and M. Perc, Evolutionary mechanisms that promote cooperation may not promote social welfare, Journal of the Royal Society Interface21, 20240547 (2024)

  24. [24]

    L. Zhou, B. Wu, J. Du, and L. Wang, Aspiration dynam- ics generate robust predictions in heterogeneous popula- tions, Nature Communications12, 3250 (2021)

  25. [25]

    F. Chen, L. Zhou, and L. Wang, Cooperation among un- equal players with aspiration-driven learning, Journal of the Royal Society Interface21, 20230723 (2024)

  26. [26]

    J. S. Weitz, C. Eksin, K. Paarporn, S. P. Brown, and W. C. Ratcliff, An oscillating tragedy of the commons in replicator dynamics with game-environment feedback, Proceedings of the National Academy of Sciences113, E7518 (2016)

  27. [27]

    A. R. Tilman, J. B. Plotkin, and E. Ak¸ cay, Evolutionary games with environmental feedbacks, Nature communi- cations11, 915 (2020)

  28. [28]

    Wang and F

    X. Wang and F. Fu, Eco-evolutionary dynamics with en- vironmental feedback: Cooperation in a changing world, Europhysics Letters132, 10001 (2020)

  29. [29]

    F. Fu, C. Hauert, M. A. Nowak, and L. Wang, Reputation-based partner choice promotes cooperation in social networks, Physical Review E78, 026117 (2008)

  30. [30]

    F. P. Santos, F. C. Santos, and J. M. Pacheco, Social norm complexity and past reputations in the evolution of cooperation, Nature555, 242 (2018)

  31. [31]

    C. Xia, J. Wang, M. Perc, and Z. Wang, Reputation and reciprocity, Physics of Life Reviews46, 8 (2023)

  32. [32]

    Wang and C

    J. Wang and C. Xia, Reputation evaluation and its im- pact on the human cooperation—a recent survey, Euro- physics Letters141, 21001 (2023)

  33. [33]

    Ohtsuki and Y

    H. Ohtsuki and Y. Iwasa, How should we define good- ness?—reputation dynamics in indirect reciprocity, Jour- nal of Theoretical Biology231, 107 (2004)

  34. [34]

    Ohtsuki and Y

    H. Ohtsuki and Y. Iwasa, The leading eight: social norms that can maintain cooperation by indirect reciprocity, Journal of theoretical biology239, 435 (2006)

  35. [35]

    Hilbe, L

    C. Hilbe, L. Schmid, J. Tkadlec, K. Chatterjee, and M. A. Nowak, Indirect reciprocity with private, noisy, and incomplete information, Proceedings of the National Academy of Sciences115, 12241 (2018)

  36. [36]

    M. Wei, X. Wang, L. Liu, H. Zheng, Y. Jiang, Y. Hao, Z. Zheng, F. Fu, and S. Tang, Indirect reciprocity in the public goods game with collective reputations, Journal of the Royal Society Interface22, 20240827 (2025)

  37. [37]

    M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity by image scoring, Nature393, 573 (1998)

  38. [38]

    M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity, Nature437, 1291 (2005)

  39. [39]

    W. Zhu, X. Wang, C. Wang, L. Liu, H. Zheng, and S. Tang, Reputation-based synergy and discount- ing mechanism promotes cooperation, New Journal of Physics26, 033046 (2024)

  40. [40]

    J. J. Skowronski and D. E. Carlston, Negativity and ex- tremity biases in impression formation: A review of ex- planations, Psychological Bulletin105, 131 (1989)

  41. [41]

    S. T. Fiske,Social beings: Core motives in social psychol- ogy(John Wiley & Sons, 2018)

  42. [42]

    R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs, Bad is stronger than good, Review of general psychology5, 323 (2001)

  43. [43]

    I. S. Lim and N. Masuda, To trust or not to trust: Evolu- tionary dynamics of an asymmetric n-player trust game, IEEE Transactions on Evolutionary Computation28, 117 (2023)

  44. [44]

    A. R. Fragale, B. Rosen, C. Xu, and I. Merideth, The higher they are, the harder they fall: The effects of wrongdoer status on observer punishment recommenda- tions and intentionality attributions, Organizational Be- havior and Human Decision Processes108, 53 (2009)

  45. [45]

    Y. Dong, S. Sun, C. Xia, and M. Perc, Second-order rep- utation promotes cooperation in the spatial prisoner’s dilemma game, IEEE Access7, 82532 (2019)

  46. [46]

    Q. Chen, X. Peng, H. Kang, Y. Shen, and X. Sun, The impact of historical-behavior-based asymmetric reputa- tion and deposit mechanisms on the evolutionary spatial public goods game, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0293944 (2025)

  47. [47]

    Koster, M

    R. Koster, M. Pˆ ıslar, A. Tacchetti, J. Balaguer, L. Liu, R. Elie, O. P. Hauser, K. Tuyls, M. Botvinick, and 12 C. Summerfield, Deep reinforcement learning can pro- mote sustainable human behaviour in a common-pool resource problem, Nature Communications16, 2824 (2025)

  48. [48]

    K. R. McKee, A. Tacchetti, M. A. Bakker, J. Balaguer, L. Campbell-Gillingham, R. Everett, and M. Botvinick, Scaffolding cooperation in human groups with deep re- inforcement learning, Nature Human Behaviour7, 1787 (2023)

  49. [49]

    L. Wang, D. Jia, L. Zhang, P. Zhu, M. Perc, L. Shi, and Z. Wang, L´ evy noise promotes cooperation in the pris- oner’s dilemma game with reinforcement learning, Non- linear Dynamics108, 1837 (2022)

  50. [50]

    L. Fan, Z. Song, L. Wang, Y. Liu, and Z. Wang, Incorpo- rating social payoff into reinforcement learning promotes cooperation, Chaos: An Interdisciplinary Journal of Non- linear Science32, 10.1063/5.0093996 (2022)

  51. [51]

    Y. Geng, Y. Liu, Y. Lu, C. Shen, and L. Shi, Re- inforcement learning explains various conditional coop- eration, Applied Mathematics and Computation427, 127182 (2022)

  52. [52]

    Y. Xu, J. Wang, J. Chen, D. Zhao, M. ¨Ozer, C. Xia, and M. Perc, Reinforcement learning and collective coopera- tion on higher-order networks, Knowledge-Based Systems 301, 112326 (2024)

  53. [53]

    Mintz and F

    B. Mintz and F. Fu, Evolutionary multi-agent rein- forcement learning in group social dilemmas, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0246332 (2025)

  54. [54]

    Xie and A

    K. Xie and A. Szolnoki, Reinforcement learning in evo- lutionary game theory: A brief review of recent devel- opments, Applied Mathematics and Computation510, 129685 (2026)

  55. [55]

    Hou, Y.-S

    Y. Hou, Y.-S. Ong, L. Feng, and J. M. Zurada, An evo- lutionary transfer reinforcement learning framework for multiagent systems, IEEE Transactions on Evolutionary Computation21, 601 (2017)

  56. [56]

    Zou and C

    K. Zou and C. Huang, Incorporating reputation into re- inforcement learning can promote cooperation on hyper- graphs, Chaos, Solitons & Fractals186, 115203 (2024)

  57. [57]

    Ren and X.-J

    T. Ren and X.-J. Zeng, Reputation-based interaction promotes cooperation with reinforcement learning, IEEE Transactions on Evolutionary Computation28, 1177 (2023)

  58. [58]

    Xie and A

    K. Xie and A. Szolnoki, Reputation in public goods coop- eration under double q-learning protocol, Chaos, Solitons & Fractals196, 116398 (2025)

  59. [59]

    T. Ren, X. Yao, Y. Li, and X.-J. Zeng, Bottom-up reputation promotes cooperation with multi-agent re- inforcement learning, arXiv preprint arXiv:2502.01971 10.48550/arXiv.2502.01971 (2025)

  60. [60]

    Y. Zhu, B. Xing, and C. Xia, Q-learning update with second-order reputation promotes the evolution of trust within structured populations, Chaos, Solitons & Frac- tals199, 116653 (2025)

  61. [61]

    Zhang and X

    Q. Zhang and X. Zhang, Q-learning driven cooperative evolution with dual-reputation incentive mechanisms, Applied Mathematics and Computation507, 129590 (2025)

  62. [62]

    C. J. Watkins and P. Dayan, Q-learning, Machine Learn- ing8, 279 (1992)

  63. [63]

    R. S. Sutton, A. G. Barto,et al.,Reinforcement learn- ing: an introduction, 2nd edn. Adaptive computation and machine learning, Vol. 1 (MIT press Cambridge, 2018)

  64. [64]

    Tokic and G

    M. Tokic and G. Palm, Value-difference based explo- ration: adaptive control between epsilon-greedy and softmax, inAnnual conference on artificial intelligence (Springer, 2011) pp. 335–346

  65. [65]

    S. Shen, X. Zhang, A. Xu, and T. Duan, An adaptive exploration mechanism for q-learning in spatial public goods games, Chaos, Solitons & Fractals189, 115705 (2024)

  66. [66]

    Milinski, D

    M. Milinski, D. Semmann, and H.-J. Krambeck, Repu- tation helps solve the ‘tragedy of the commons’, Nature 415, 424 (2002)

  67. [67]

    Fudenberg and D

    D. Fudenberg and D. K. Levine, Maintaining a reputation when strategies are imperfectly observed, The Review of Economic Studies59, 561 (1992)

  68. [68]

    M. A. Nowak and R. M. May, Evolutionary games and spatial chaos, nature359, 826 (1992)

  69. [69]

    W. Zhu, Q. Pan, and M. He, Exposure-based reputa- tion mechanism promotes the evolution of cooperation, Chaos, Solitons & Fractals160, 112205 (2022)