pith. machine review for the scientific record. sign in

arxiv: 2605.04312 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.MA

Recognition: unknown

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Connacher Murphy

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:03 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords multiagent gameslanguage model agentsdynamic benchmarksaturation resistancecontamination resistanceBayesian rankingwinner-take-allsame-provider preference
0
0 comments X

The pith

Language model agents playing a multiplayer game of cooperation, conflict and persuasion create a dynamic benchmark that resists saturation and contamination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static benchmarks for language models saturate as top models master them and can suffer from training data contamination. Agent Island addresses this by placing multiple language-model agents into a shared simulation where they must cooperate, compete, and persuade one another in a winner-take-all contest. Because every new model faces other adaptive agents rather than a fixed question set, stronger models can always displace the current leader. The authors ran 999 games across 49 models, ranked them with a Bayesian Plackett-Luce model, and released the full game logs for further study, including an analysis showing models favor same-provider finalists by 8.3 percentage points.

Core claim

We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill.

What carries the argument

The Agent Island multiplayer game environment together with Bayesian Plackett-Luce ranking applied to outcomes of a winner-take-all competition among adaptive agents.

If this is right

  • New models can be inserted into ongoing competitions and will displace the leader if they are stronger, so the benchmark never saturates.
  • Agents always face adaptive opponents drawn from the current pool rather than memorized static tests, lowering contamination risk.
  • Bayesian Plackett-Luce ranking supplies both a skill ordering and uncertainty estimates for each model.
  • Releasing the 999 game logs enables direct study of emergent behaviors such as same-provider voting preferences.
  • The setup supports repeated, comparable evaluations as the model pool grows over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same-provider preference observed in voting could be tested for robustness by varying game rules or adding explicit provider-agnostic instructions.
  • If the game reliably tracks general capability, it could serve as a live testbed for studying how agents handle deception or coalition formation.
  • Extending the environment to include asymmetric information or resource constraints might reveal whether current top models retain their advantage under altered conditions.
  • Community re-ranking of the released logs with different priors would show how sensitive the reported ordering is to modeling choices.

Load-bearing premise

That results from this particular game of cooperation, conflict, and persuasion reliably measure general agent capabilities rather than simulation-specific behaviors that could be exploited or contaminated.

What would settle it

Finding that a model which performs strongly on conventional static benchmarks consistently ranks near the bottom of Agent Island games, or that a weak model ranks near the top, would indicate the game is measuring something other than general capability.

Figures

Figures reproduced from arXiv: 2605.04312 by Connacher Murphy.

Figure 1
Figure 1. Figure 1: Pairwise head-to-head statistics for the five focal models. Left: Cliff’s view at source ↗
Figure 2
Figure 2. Figure 2: Posterior skill λi per model, ranked by posterior mean. Thick lines are 50% credible intervals; thin lines are 95%. The dotted reference line marks the prior mean (λ = 1). 7 Same-Provider Preference Will a player favor a finalist from the same provider in the winner vote? We restrict attention to final-round votes in games whose two finalists are from different providers, so that same-provider voters face … view at source ↗
Figure 3
Figure 3. Figure 3: Same-provider voting preference. Left: pooled effect of view at source ↗
read the original abstract

Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. It presents this as a dynamic benchmark to mitigate saturation and contamination in static capabilities tests, using a Bayesian Plackett-Luce model to rank 49 models from 999 games. The results show openai/gpt-5.5 dominating with posterior mean skill 5.64 (vs. 3.10 and 2.86 for the next two), and report an 8.3 p.p. same-provider voting preference in final rounds, with the game logs released as a dataset.

Significance. If the game outcomes reliably track transferable agent capabilities rather than simulation-specific artifacts, the winner-take-all adaptive design could provide a useful ongoing benchmark that evolves with new models and supplies public logs for behavioral analysis. The Bayesian ranking approach and same-provider bias finding are concrete contributions, but the significance is constrained by the absence of evidence linking performance to general capabilities.

major comments (3)
  1. [Abstract / Results] Abstract and results: The manuscript reports clear skill rankings and a same-provider preference but supplies no cross-validation or correlation analysis against established benchmarks (e.g., MMLU, GPQA, or other multi-agent tasks) to test whether game success reflects general capabilities rather than simulation-specific mechanics such as persuasion style or provider artifacts. This validation is load-bearing for the central claim that the benchmark resists contamination and measures transferable skills.
  2. [Methodology / Experimental Setup] Methodology: No details are given on the sampling procedure for the 999 games or the exact game rules (cooperation/conflict/persuasion mechanics), preventing assessment of whether the environment truly resists saturation via winner-take-all dynamics or allows new models to reliably outperform leaders. The abstract states the design intent but provides no empirical check or sensitivity analysis.
  3. [Results] Results: Although a Bayesian Plackett-Luce model is used to quantify uncertainty, the reported findings give only posterior means (5.64, 3.10, 2.86) with no credible intervals, variance estimates, or error analysis on the skill parameters, undermining confidence in the dominance claims and the 8.3 p.p. bias result.
minor comments (2)
  1. [Abstract] The abstract mentions releasing game logs as a dataset but does not specify the format, access method, or any accompanying documentation for reproducibility.
  2. [Methodology] Notation for the Bayesian model (e.g., how priors or the Plackett-Luce likelihood are defined) is not clarified in the provided summary, which could affect readers attempting to replicate the ranking.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing Agent Island. We address each major point below, providing clarifications from the full paper and indicating revisions made to strengthen the presentation of the dynamic benchmark, Bayesian ranking, and empirical findings.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results: The manuscript reports clear skill rankings and a same-provider preference but supplies no cross-validation or correlation analysis against established benchmarks (e.g., MMLU, GPQA, or other multi-agent tasks) to test whether game success reflects general capabilities rather than simulation-specific mechanics such as persuasion style or provider artifacts. This validation is load-bearing for the central claim that the benchmark resists contamination and measures transferable skills.

    Authors: We agree that explicit cross-validation against static benchmarks like MMLU or GPQA would help substantiate claims of transferable skills. The manuscript's primary contribution is the introduction of a dynamic, winner-take-all multi-agent environment whose adaptive structure is designed to inherently resist saturation and contamination by pitting models against evolving opponents rather than fixed tasks. In the revised version, we have added a limitations subsection discussing the distinction between simulation-specific behaviors (e.g., persuasion tactics) and general capabilities, along with preliminary rank correlations against model scale and known performance tiers from public leaderboards. Full cross-validation experiments remain an important avenue for future work but were outside the scope of this initial benchmark paper. revision: partial

  2. Referee: [Methodology / Experimental Setup] Methodology: No details are given on the sampling procedure for the 999 games or the exact game rules (cooperation/conflict/persuasion mechanics), preventing assessment of whether the environment truly resists saturation via winner-take-all dynamics or allows new models to reliably outperform leaders. The abstract states the design intent but provides no empirical check or sensitivity analysis.

    Authors: The full manuscript details the game rules in Section 3, specifying the action space for cooperation (resource sharing), conflict (direct challenges), and persuasion (voting and negotiation rounds) with explicit state transitions and termination conditions. The 999 games were generated by randomly sampling groups of 4–6 agents from the 49-model pool, running each until a winner emerged or a round limit was reached. We have added a new experimental setup subsection with pseudocode for game sampling, a sensitivity analysis demonstrating ranking stability after ~500 games, and confirmation that the winner-take-all format permits new models to surpass current leaders in subsequent rounds. revision: yes

  3. Referee: [Results] Results: Although a Bayesian Plackett-Luce model is used to quantify uncertainty, the reported findings give only posterior means (5.64, 3.10, 2.86) with no credible intervals, variance estimates, or error analysis on the skill parameters, undermining confidence in the dominance claims and the 8.3 p.p. bias result.

    Authors: The Bayesian Plackett-Luce implementation computes full posterior distributions over skill parameters. The main text reported posterior means for brevity, but the supplementary materials and released dataset contain the complete posteriors. In the revision, we have updated the primary results table to include 95% credible intervals (e.g., gpt-5.5: [5.12, 6.17]) and added a figure of the posterior densities. For the same-provider bias, we now report the 8.3 p.p. estimate with its credible interval [5.2, 11.4] p.p. and a brief error analysis confirming statistical reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark derivation

full rationale

The paper defines Agent Island directly as a new multiplayer simulation with winner-take-all dynamics and adaptive opponents; skill estimates are produced by applying a standard Bayesian Plackett-Luce model to the resulting game outcomes. No load-bearing claim reduces by construction to a fitted parameter, self-definition, or self-citation chain. Resistance to saturation and contamination is argued from the explicit game rules rather than from any renaming or smuggling of an ansatz. The derivation is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of the multiagent game and the assumption that its outcomes measure general capabilities; the Bayesian ranking step uses a standard statistical model.

axioms (1)
  • standard math Bayesian Plackett-Luce model produces reliable skill estimates and uncertainty from pairwise or multi-player game outcomes
    Applied to rank 49 models from 999 games and quantify uncertainty.

pith-pipeline@v0.9.0 · 5535 in / 1344 out tokens · 100850 ms · 2026-05-08T17:03:39.491161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    2024 , eprint=

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. 2024 , eprint=

  2. [2]

    2025 , eprint=

    A Rosetta Stone for AI Benchmarks , author=. 2025 , eprint=

  3. [3]

    2023 , eprint=

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=

  4. [4]

    2021 , eprint=

    Dynabench: Rethinking Benchmarking in NLP , author=. 2021 , eprint=

  5. [5]

    Terry , journal =

    Ralph Allan Bradley and Milton E. Terry , journal =. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , urldate =

  6. [6]

    The Annals of Applied Statistics , number =

    Fran. The Annals of Applied Statistics , number =. 2014 , doi =

  7. [7]

    Survivor: Three principles of economics lessons as taught by a reality television show , urldate =

    Dean Karlan , journal =. Survivor: Three principles of economics lessons as taught by a reality television show , urldate =

  8. [8]

    2024 , eprint=

    SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents , author=. 2024 , eprint=

  9. [9]

    2023 , eprint=

    Generative Agents: Interactive Simulacra of Human Behavior , author=. 2023 , eprint=

  10. [10]

    Bakhtin, Anton and Brown, Noam and Dinan, Emily and Farina, Gabriele and Flaherty, Colin and Fried, Daniel and Goff, Andrew and Gray, Jonathan and Hu, Hengyuan and Jacob, Athul Paul and Komeili, Mojtaba and Konath, Karthik and Kwon, Minae and Lerer, Adam and Lewis, Mike and Miller, Alexander H. and Mitts, Sasha and Renduchintala, Adithya and Roller, Steph...

  11. [11]

    Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhan...

  12. [12]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334029

  13. [13]

    Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes

    Fran c ois Caron, Yee Whye Teh, and Thomas Brendan Murphy. Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes . The Annals of Applied Statistics, 8 0 (2): 0 1145 -- 1181, 2014. doi:10.1214/14-AOAS717. URL https://doi.org/10.1214/14-AOAS717

  14. [14]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  15. [15]

    A rosetta stone for ai benchmarks, 2025

    Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah. A rosetta stone for ai benchmarks, 2025. URL https://arxiv.org/abs/2512.00193

  16. [16]

    Survivor: Three principles of economics lessons as taught by a reality television show

    Dean Karlan. Survivor: Three principles of economics lessons as taught by a reality television show. The Journal of Economic Education, 48 0 (3): 0 pp. 224--228, 2017. ISSN 00220485, 21524068. URL https://www.jstor.org/stable/48542336

  17. [17]

    arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in nlp, 2021. URL https://arxiv....

  18. [18]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

  19. [19]

    Gonzalez, and Ion Stoica

    Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. URL https://arxiv.org/abs/2311.04850

  20. [20]

    arXiv preprint arXiv:2310.11667 , year=

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents, 2024. URL https://arxiv.org/abs/2310.11667