pith. sign in

arxiv: 2605.26448 · v1 · pith:B3XRPX7Fnew · submitted 2026-05-26 · 💻 cs.MA · cs.GT· cs.NE

Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

Pith reviewed 2026-07-01 16:53 UTC · model grok-4.3

classification 💻 cs.MA cs.GTcs.NE
keywords LLM constitutionspublic goods gameadversarial co-evolutionmulti-agent systemscooperation and defectionevolutionary searchred-teamingfitness functions
0
0 comments X

The pith

LLM constitutions co-evolve to a stable near-parity equilibrium in public goods games when fitness couples the factions and evaluation uses enough seeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether natural-language constitutions for LLM agents can undergo adversarial co-evolution, with one side trying to cooperate and the other to free-ride in a public goods game. It checks two uncharacterized properties: whether the fitness function actually creates pressure between the sides and whether the mutation process stays reliable when the goal is to produce specialists. With a score-advantage fitness and five evaluation seeds per generation, both sides converge to similar performance near 0.78 across different multipliers, while independent scoring produces no such pressure. The resulting Red constitutions act as readable test cases that expose weaknesses in cooperative setups. This matters for anyone building multi-agent systems that must hold up against defection.

Core claim

In the Public Goods Game, Blue cooperators and Red free-riders co-evolve their constitutions over 30 generations and reach a near-parity equilibrium at S approximately 0.78 that holds across multipliers m in {1.2, 1.5, 2.0, 3.0}. Independent per-faction scoring leaves outcomes uncoupled with correlation +0.088 and generates no adversarial pressure, while a score-advantage target S_own minus S_opp restores the pressure. Under pure-adversary fitness, K=2 evaluation seeds cause mode regression but K=5 sustains strong specialists across all generations. Adversarial co-evolution of natural-language constitutions is therefore feasible only under coupled fitness and adequate evaluation budget, and

What carries the argument

Adversarial constitutional co-evolution of LLM agents in a Public Goods Game driven by a score-advantage fitness function and LLM-based mutation.

If this is right

  • Both factions reach stable performance parity only when fitness explicitly compares their scores.
  • Score-advantage fitness is required to create measurable adversarial pressure between constitutions.
  • Five evaluation seeds per generation prevent regression into weak modes for the Red specialist.
  • Evolved Red constitutions supply readable test cases for checking future cooperative LLM designs.
  • The near-parity outcome holds across a range of public-goods multipliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-evolution setup could be applied to grid-world or other spatial environments to check whether parity still emerges.
  • Red constitutions might expose recurring patterns in how LLMs fail at cooperation that could guide alignment techniques.
  • Longer runs or larger populations might require proportionally larger evaluation budgets to avoid regression.
  • Transfer tests could check whether constitutions evolved in the public goods game remain effective when placed in new games.

Load-bearing premise

The LLM mutation operator must continue to produce reliable changes even when the objective is specialized adversarial performance rather than cooperation.

What would settle it

Running the 30-generation co-evolution with K=5 seeds and finding that Red performance regresses relative to Blue across repeated trials would falsify the claim that adequate evaluation budget sustains specialists.

Figures

Figures reproduced from arXiv: 2605.26448 by Alice Saito, Arth Singh, Eiji Kamioka, Hershraj Niranjani, Machiko Hirota, Phan Xuan Tan, Takehiro Takayanagi, Ujwal Kumar.

Figure 1
Figure 1. Figure 1: PGG co-evolution over 30 generations. (a) Blue (cooperators) and Red (free-riders) stability scores; both improve and converge near S ≈ 0.78. (b) Blue advantage (SB − SR): large early lead collapses toward zero as Red’s constitution learns to survive punishment, settling at near-parity. less material to drift toward. Environment structure, not just fitness signal, shapes whether adversarial search remains … view at source ↗
Figure 2
Figure 2. Figure 2: Information asymmetry (Experiment 3). Red advantage [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: C ∗ fragility (Experiment 1). (a) Raw stability scores: Blue (fixed C ∗ ) holds steadily across 30 generations of Red adversarial evolution. (b) Red advantage SR − SB remains negative throughout; C ∗ retains its lead under sustained zero-sum adversarial search at this compute budget [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Full co-evolution arms race (Experiment 2) under score-advantage fitness. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in {1.2, 1.5, 2.0, 3.0}; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical simulation study of adversarial co-evolution between Blue (cooperator) and Red (free-rider) LLM constitutions over 30 generations in a Public Goods Game (PGG) with multipliers m in {1.2,1.5,2.0,3.0} and a spatial grid-world. It claims three findings: (1) both factions converge to a near-parity equilibrium at S≈0.78; (2) per-faction scoring yields uncoupled outcomes (corr(S_B,S_R)=+0.088) with no adversarial pressure while the coupled target S_own−S_opp restores it; (3) evaluation seed count K=2 induces mode regression while K=5 sustains specialist behavior. The central conclusion is that adversarial constitutional co-evolution is feasible only under coupled fitness and adequate evaluation budget, with evolved Red constitutions serving as red-team artifacts.

Significance. If the reported equilibria and conditional feasibility hold, the work supplies directly testable red-team artifacts and characterizes the two properties left open by prior LLM-guided evolutionary search for cooperative constitutions. The explicit empirical tests of fitness coupling and mutation reliability under specialist objectives constitute a clear advance for multi-agent alignment research addressing goal conflicts.

major comments (2)
  1. [Results, finding (2)] Results, finding (2): the reported correlation corr(S_B, S_R)=+0.088 is presented without sample size, number of independent runs, p-value, or confidence interval, which is load-bearing for the claim that per-faction scoring produces statistically uncoupled outcomes and no adversarial pressure.
  2. [Results, finding (3)] Results, finding (3): the claim that K=2 produces mode regression while K=5 sustains specialist behavior for all 30 generations lacks reported variance, number of runs, or exclusion criteria, which is load-bearing for the assertion that adequate evaluation budget is required to avoid regression.
minor comments (2)
  1. [Abstract] Abstract: the symbol S is used without an explicit definition or reference to its computation in the PGG payoff structure.
  2. [Methods] The manuscript should include a table or figure caption that reports the exact number of generations, population size, and LLM model version used in all runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and for identifying the need for additional statistical detail in the reported results. We address both major comments below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results, finding (2)] Results, finding (2): the reported correlation corr(S_B, S_R)=+0.088 is presented without sample size, number of independent runs, p-value, or confidence interval, which is load-bearing for the claim that per-faction scoring produces statistically uncoupled outcomes and no adversarial pressure.

    Authors: We agree that the statistical details are required to substantiate the claim. The correlation value was obtained from the full set of independent evolutionary runs performed in the study. In the revised manuscript we will explicitly state the sample size (number of independent runs), report the associated p-value, and include a confidence interval to confirm that the outcomes remain statistically uncoupled under per-faction scoring. revision: yes

  2. Referee: [Results, finding (3)] Results, finding (3): the claim that K=2 produces mode regression while K=5 sustains specialist behavior for all 30 generations lacks reported variance, number of runs, or exclusion criteria, which is load-bearing for the assertion that adequate evaluation budget is required to avoid regression.

    Authors: We agree that variance, run count, and any exclusion criteria should be reported. The K=2 versus K=5 comparison was conducted across multiple independent runs; we will add the observed variance, the exact number of runs, and a clear statement of exclusion criteria (if any) to the revised text so that the requirement for adequate evaluation budget is fully supported. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper reports an empirical simulation study of adversarial co-evolution between LLM-generated constitutions in a Public Goods Game and spatial grid-world. All headline results (equilibrium scores near 0.78, correlation under different fitness targets, effect of evaluation seed count K) are direct outputs of running the described evolutionary process for 30 generations under explicitly varied conditions. No equations, uniqueness theorems, or predictions are derived; the work contains no mathematical chain that could reduce outputs to inputs by construction, and the two uncharacterized properties are addressed via independent empirical tests rather than self-referential definitions or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the tested multiplier values and seed counts; these appear to be experimental controls rather than fitted quantities.

pith-pipeline@v0.9.1-grok · 5850 in / 1059 out tokens · 35225 ms · 2026-07-01T16:53:40.109894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  2. [2]

    Evolving ai collectives to enhance human diversity and enable self-regulation.arXiv preprint arXiv:2402.12590, 2024

    Shiyang Lai, Yujin Potter, Junsol Kim, Richard Zhuang, Dawn Song, and James Evans. Evolving ai collectives to enhance human diversity and enable self-regulation.arXiv preprint arXiv:2402.12590, 2024

  3. [3]

    The coming crisis of multi-agent misalignment: Ai alignment must be a dynamic and social process.arXiv preprint arXiv:2506.01080, 2025

    Florian Carichon, Aditi Khandelwal, Marylou Fauchard, and Golnoosh Farnadi. The coming crisis of multi-agent misalignment: Ai alignment must be a dynamic and social process.arXiv preprint arXiv:2506.01080, 2025

  4. [4]

    Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn

    Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How llms could be insider threats.arXiv preprint arXiv:2510.05179, 2025

  5. [5]

    Klowden, T

    Ujwal Kumar, Alice Saito, Hershraj Niranjani, Rayan Yessou, and Phan Xuan Tan. Evolving interpretable constitutions for multi-agent coordination.arXiv preprint arXiv:2602.00755, 2026

  6. [6]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  7. [7]

    Hamilton

    Robert Axelrod and William D. Hamilton. The evolution of cooperation.Science, 211(4489):1390–1396, 1981

  8. [8]

    Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel

    Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. InProceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (AAMAS), pages 464–473, 2017

  9. [9]

    Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al

    Edward Hughes, Joel Z. Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity aversion improves cooperation in intertemporal social dilemmas. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  10. [10]

    Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, D. J. Strouse, Joel Z. Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019

  11. [11]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior.arXiv preprint arXiv:2304.03442, 2023

  12. [12]

    Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al

    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016

  13. [13]

    Dota 2 with Large Scale Deep Reinforcement Learning

    OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680, 2019

  14. [14]

    Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired Open-Ended Trailblazer (POET): End- lessly generating increasingly complex and diverse learning environments and their solutions.arXiv preprint arXiv:1901.01753, 2019

  15. [15]

    Udaya Sankar, Vishisht Srihari Rao, Mayank Ratan Bhardwaj, and Y

    V . Udaya Sankar, Vishisht Srihari Rao, Mayank Ratan Bhardwaj, and Y . Narahari. Deep learning meets mechanism design: Key results and some novel applications.arXiv preprint arXiv:2401.05683, 2024

  16. [16]

    An interpretable automated mechanism design framework with large language models.arXiv preprint arXiv:2502.12203, 2025

    Jiayuan Liu, Mingyu Guo, and Vincent Conitzer. An interpretable automated mechanism design framework with large language models.arXiv preprint arXiv:2502.12203, 2025

  17. [17]

    Human-centred mechanism design with democratic AI.Nature Human Behaviour, 6(10):1398–1407, 2022

    Raphael Koster, Jan Balaguer, Andrea Tacchetti, Ari Weinstein, Tina Zhu, Oliver Hauser, Duncan Williams, Lucy Campbell-Gillingham, Phoebe Thacker, Matthew Botvinick, and Christopher Summerfield. Human-centred mechanism design with democratic AI.Nature Human Behaviour, 6(10):1398–1407, 2022

  18. [18]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

  19. [19]

    AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

    AlphaEvolve Team. AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms. Google DeepMind Blog, 2025

  20. [20]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models.arXiv preprint arXiv:2206.08896, 2022. 9 APREPRINT- MAY27, 2026

  21. [21]

    OpenEvolve: An open-source evolutionary coding agent

    Asankhaya Sharma. OpenEvolve: An open-source evolutionary coding agent. https://github.com/ algorithmicsuperintelligence/openevolve, 2025

  22. [22]

    Illuminating search spaces by mapping elites

    Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

  23. [23]

    Cooperation and punishment in public goods experiments.American Economic Review, 90(4):980–994, 2000

    Ernst Fehr and Simon Gächter. Cooperation and punishment in public goods experiments.American Economic Review, 90(4):980–994, 2000

  24. [24]

    V olunteering as Red Queen mechanism for cooperation in public goods games.Science, 296(5570):1129–1132, 2002

    Christoph Hauert, Silvia De Monte, Josef Hofbauer, and Karl Sigmund. V olunteering as Red Queen mechanism for cooperation in public goods games.Science, 296(5570):1129–1132, 2002

  25. [25]

    Rand, Anna Dreber, Tore Ellingsen, Drew Fudenberg, and Martin A

    David G. Rand, Anna Dreber, Tore Ellingsen, Drew Fudenberg, and Martin A. Nowak. Positive interactions promote public cooperation.Science, 325(5945):1272–1275, 2009

  26. [26]

    win” from killing Blue agents is indistinguishable from a Red “win

    OpenAI. GPT-OSS-120B.https://openrouter.ai/openai/gpt-oss-120b, 2025. 10 APREPRINT- MAY27, 2026 Appendix A Formal Algorithm and Optimization Definition The framework summary in Section 5.1 omitted formal details for brevity. Here we give them explicitly. Definition A.1(Co-Evolution).Given environment E, initial constitutions C 0 B, C0 R, and scoring funct...