pith. sign in

arxiv: 2606.08340 · v1 · pith:NYDY6C2Tnew · submitted 2026-06-06 · 💻 cs.AI · cs.LG· cs.MA

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

Pith reviewed 2026-06-27 19:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA
keywords multi-agent coordinationlanguage agentsLLM benchmarkopen-ended environmentscoordination bottleneckMARL comparisonprocedural generationcommunication ablation
0
0 comments X

The pith

Current LLM agents average only 6 percent normalized return on open-ended multi-agent coordination tasks and show that individual competence does not produce coordination competence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ALEM, a benchmark that places language agents in a long-horizon survival environment with procedural coordination tasks, soft role specialisation, and controllable difficulty. When thirteen frontier models are tested zero-shot in homogeneous teams, they reach roughly 6 percent of the normalised return achieved by reference MARL agents trained for a billion steps. Performance splits along two reward components: some models earn strong base-task scores yet collapse on the coordination-specific metric, while others approach MARL levels only on the hardest coordination settings. Ablations isolate communication as the largest single contributor to coordination success, with memory and reasoning mattering mainly when they support multi-step shared plans. The results therefore treat coordination as a measurable bottleneck that is distinct from single-agent capabilities.

Core claim

Current LLM agents remain far from solving ALEM, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabi

What carries the argument

ALEM, a JAX-based benchmark embedding procedurally generated coordination tasks, soft specialisation, communication, and controllable difficulty into a long-horizon survival world with exploration, crafting, trading, and combat.

If this is right

  • Models must be evaluated on separate base-task and coordination reward components rather than aggregate return alone.
  • Adding explicit communication channels produces the largest immediate gain in team performance.
  • Memory and reasoning modules improve results only when they are used to sustain multi-step shared plans across agents.
  • Trained MARL agents remain the performance ceiling, so zero-shot LLM teams have substantial room for improvement.
  • The benchmark supplies a controlled testbed for training agents that allocate roles and execute joint plans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Single-agent scaling curves alone are unlikely to close the coordination gap without targeted multi-agent training regimes.
  • ALEM-style environments could be used to generate synthetic coordination trajectories for fine-tuning or reinforcement learning from human feedback.
  • If the separation between task and coordination competence holds, hybrid systems that pair a strong single-agent planner with a lightweight coordinator may outperform end-to-end language agents.
  • Real-world deployments in collaborative robotics or game environments will need similar decomposed reward signals to diagnose coordination failures.

Load-bearing premise

The procedurally generated tasks and controllable difficulty settings inside ALEM capture the coordination demands that matter for real deployment of language agents.

What would settle it

A single new model that scores above 50 percent normalised return on the hardest ALEM setting while scoring below 10 percent on matched single-agent versions of the same tasks would falsify the claim that coordination forms a distinct bottleneck.

Figures

Figures reproduced from arXiv: 2606.08340 by Aidan Scannell, Alexander Rutherford, Amos Storkey, Andras Szecsenyi, Cameron Barker, Davide Paglieri, Elliot J. Crowley, Henry Gouk, Kale-ab Abebe Tessera, Tim Rockt\"aschel.

Figure 1
Figure 1. Figure 1: ALEM extends Craftax-like open worlds [37, 40] into a controllable multi-agent coordination benchmark. Top: procedurally generated levels with sampled coordination tasks (• 2-agent sync, ■ all-agent sync, ♦ handover). Bottom: coordinated mining, construction, combat, and crafting examples. By resampling tasks and coordination structure each episode, ALEM evaluates agents’ ability to infer coordination need… view at source ↗
Figure 2
Figure 2. Figure 2: Coordination coupling spectrum in ALEM. ALEM tasks vary the temporal separation ∆t = |t − t ′ | between coupled actions Ai t and A j t ′ . We show representative examples: long-range (2a, large ∆t), where one agent gathers wood used to craft a pickaxe, enabling another agent to mine stone later; handover (2b, small ∆t), where agent i initiates a mining task that agent j must complete within a short window;… view at source ↗
Figure 3
Figure 3. Figure 3: ALEM difficulty settings and MARL baselines. Left: The same generated world under Easy, Medium, and Hard settings, with coordination sites marked as • 2-agent sync, ■ all-agent sync, and ♦ handover. Increasing α preserves the layout and coordination opportunities, but tightens execution constraints through more all-agent requirements, shorter handover windows, and stronger specialisation. Right: 1B-step tr… view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic breakdown of zero-shot homogeneous team failures in ALEM. (A) Coor￾dination reward coverage by coordination type, averaged across difficulties. Cell values report the percentage of the maximum attainable reward within each coordination category. (B) Local coopera￾tive event counts per evaluation across all difficulties. (C) Mean episode length versus Total% reward on Medium difficulty, showing t… view at source ↗
Figure 5
Figure 5. Figure 5: Harness and team-composition ablations on Hard. Left: we ablate communication, scratchpad memory, and reasoning for Gemini-3.1-Pro-High and Gemma-4-31B-it. Bars show Coord.% and Base%, each normalised by the maximum achievable reward in that category, with 95% bootstrap confidence intervals. Right: we compare heterogeneous teams against their constituent homogeneous baselines. Bars show mean Total% with 95… view at source ↗
Figure 6
Figure 6. Figure 6: Example pixel-based observation in ALEM. Top bar: teammate status: role icon (forager/miner/warrior), health, and a coloured channel badge showing the discrete broadcast message each teammate sent at this step (badge absent when the agent did not communicate). Centre: the agent’s local world view: coordination overlays mark blocks and sites that require joint action, a solid border with an agent-count labe… view at source ↗
Figure 7
Figure 7. Figure 7: ALEM procedurally generates diverse worlds and coordination layouts. Three indepen￾dently sampled episodes at the same medium difficulty show variation in both terrain layout and the spatial placement of coordination tasks (■ 2-agent sync, ■ all-agent sync, ♦ handover) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot homogeneous coordination in ALEM. We evaluate modern LLMs on Easy, [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Achievement-tier coverage across coordination difficulty. Each cell reports the mean [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Different causes of death for different agents, averaged across easy, medium and hard. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance vs active model size. easy medium hard gemini-3.1-pro-high gpt-5.4-high gemma-4-31b-it qwen3.6-35b-a3b qwen3.5-27b gemma-4-26b-a4b-it qwen3.6-27b qwen3.5-122b-a10b qwen3.5-9b qwen3.5-35b-a3b gemma-4-e4b-it llama-3.3-70b-instruct llama-3.1-8b-instruct 99.9 100.0 99.6 100.0 100.0 100.0 100.0 100.0 100.0 85.9 69.7 89.7 99.1 99.7 100.0 61.0 63.7 62.4 99.8 100.0 99.8 100.0 99.5 99.8 98.8 98.7 98.6 … view at source ↗
Figure 12
Figure 12. Figure 12: Percentage success rate for parsing actions, averaged across easy, medium and hard. Spec. ratio Aligned ach. Cross-role ach. gemini-3.1-pro-high gpt-5.4-high gemma-4-31b-it qwen3.6-35b-a3b qwen3.5-27b gemma-4-26b-a4b-it qwen3.6-27b qwen3.5-122b-a10b qwen3.5-9b qwen3.5-35b-a3b gemma-4-e4b-it llama-3.3-70b-instruct llama-3.1-8b-instruct MARL 1B step 52 10.3 9.7 59 7.8 5.5 60 6.0 4.3 54 5.5 5.0 48 4.7 5.1 60… view at source ↗
Figure 14
Figure 14. Figure 14: Simulation throughput (SPS) during a full IPPO training step across 1–8 agents. The [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Extended MARL ablation on hard ALEM. HyperMARL-IPPO is trained for 3B envi￾ronment steps across five seeds. Curves show seed means and shaded bands show 95% bootstrap confidence intervals. Performance does not saturate: coordination and total reward continue improv￾ing late in training, and total reward remains well below the maximum achievable score. C.4 Extended MARL Training To test whether ALEM’s hard… view at source ↗
Figure 16
Figure 16. Figure 16: MARL 100m Results. MARL baseline performance over 5 training seeds, reported as the mean percentage of max achievable reward with 95% CIs and decomposed into coordination (dark) and base (light) reward. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Learning curves during training for MARL baselines across coordination difficulty, trained for 100m steps. Individual panels (a)-(i) break down performance by environment difficulty (Easy, Medium, Hard) and reward type (Base, Coordination, Total). Curves show the mean across 5 independent training seeds; shaded regions are 95% bootstrap confidence intervals. Base and coordination scores are normalised by … view at source ↗
Figure 18
Figure 18. Figure 18: Learning curves during training for MARL baselines across coordination difficulty, trained for one billion steps. Individual panels (a)-(i) break down performance by environment difficulty (Easy, Medium, Hard) and reward type (Base, Coordination, Total). Curves show the mean across 5 independent training seeds; shaded regions are 95% bootstrap confidence intervals. Base and coordination scores are normali… view at source ↗
Figure 19
Figure 19. Figure 19: Environment calibration. Base reward for Gemma-4-31B-it across settings sharing the same underlying world. Cooperative achievements are excluded so all settings are directly comparable. Error bars show 95% bootstrap CIs. E Additional LLM Experiments and Ablations E.1 Environment calibration: multi-agent structure adds difficulty beyond the base game. Before evaluating full ALEM, we isolate how much diffic… view at source ↗
read the original abstract

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces ALEM, a JAX-based benchmark for open-ended multi-agent coordination in language agents built on Craftax-like dynamics. It embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable difficulty into a long-horizon survival world. Zero-shot evaluation of 13 LLMs yields ~6% average normalised return; contrasts (e.g., Gemini-3.1-Pro-High nearing MARL on hardest settings while GPT-5.4-High shows strong base-task but low coordination reward) and ablations on communication/memory/reasoning support the claim that coordination is a distinct bottleneck separate from single-agent capabilities. Code is released at https://github.com/alem-world/alem-env.

Significance. If the benchmark's reward decomposition and procedural mechanics validly isolate multi-agent coordination demands, the work would supply a reproducible, controllable testbed that exposes a gap not captured by single-agent or structured MARL evaluations, with the open-source JAX implementation providing a concrete strength for follow-on research.

major comments (2)
  1. [Abstract] Abstract: the claim that 'individual task competence does not imply coordination competence' rests on the reported contrast between base-task reward and coordination reward (e.g., GPT-5.4-High). Without single-agent ablations or non-coordinating strategy baselines demonstrating that base-task rewards cannot be obtained independently of the soft-specialisation and trading mechanics, the observed gap could reflect the procedural generator's reward partitioning rather than a distinct agent limitation.
  2. [Experimental results] Experimental results (implied §4–5): the concrete performance numbers (~6% normalised return, model-specific contrasts, ablation effects) are presented without error bars, episode counts, data-exclusion rules, or full protocol for task generation and difficulty settings, undermining assessment of whether the claimed distinctions between models and between base vs. coordination reward are statistically reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'individual task competence does not imply coordination competence' rests on the reported contrast between base-task reward and coordination reward (e.g., GPT-5.4-High). Without single-agent ablations or non-coordinating strategy baselines demonstrating that base-task rewards cannot be obtained independently of the soft-specialisation and trading mechanics, the observed gap could reflect the procedural generator's reward partitioning rather than a distinct agent limitation.

    Authors: We agree that additional evidence would strengthen the claim that the observed gap reflects a distinct agent limitation rather than an artifact of the reward design. The paper's reward decomposition is intended to separate base-task performance from coordination-specific rewards through the procedural generation of tasks requiring soft specialisation and trading. However, to directly address this, we will include single-agent ablations (evaluating agents in isolation on base tasks) and non-coordinating strategy baselines in the revised manuscript. This will demonstrate that base-task rewards can indeed be achieved without coordination mechanics, supporting that the gap in multi-agent settings is due to coordination challenges. revision: yes

  2. Referee: [Experimental results] Experimental results (implied §4–5): the concrete performance numbers (~6% normalised return, model-specific contrasts, ablation effects) are presented without error bars, episode counts, data-exclusion rules, or full protocol for task generation and difficulty settings, undermining assessment of whether the claimed distinctions between models and between base vs. coordination reward are statistically reliable.

    Authors: We acknowledge the importance of reporting statistical details for reproducibility and reliability assessment. The current manuscript presents average normalised returns but omits error bars and episode counts. We will revise the experimental results section to include error bars (e.g., standard error across episodes), the number of episodes evaluated per model, data-exclusion rules if any, and a detailed protocol for task generation and difficulty settings. This will allow readers to better assess the statistical significance of the model contrasts and ablation effects. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces the ALEM benchmark and reports direct empirical results from running 13 LLMs and trained MARL agents on procedurally generated tasks. No equations, fitted parameters, or derivations are present in the provided text. Claims such as 'individual task competence does not imply coordination competence' rest on observed performance contrasts rather than any reduction to inputs by construction. No self-citations or ansatzes are invoked as load-bearing steps. The evaluation is self-contained against external agent runs and does not rename known results or smuggle assumptions via prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of the new benchmark tasks as proxies for real coordination demands and on the assumption that zero-shot prompting plus MARL baselines provide a fair comparison; no free parameters are fitted to produce the headline numbers.

axioms (2)
  • domain assumption Language models can be deployed as zero-shot autonomous agents in interactive environments through standard prompting without task-specific fine-tuning.
    The evaluation protocol in the abstract relies on this to measure LLM performance.
  • domain assumption MARL agents trained for one billion steps constitute a meaningful performance ceiling for the coordination tasks in ALEM.
    Used as reference points for interpreting LLM results.
invented entities (1)
  • ALEM benchmark environment no independent evidence
    purpose: To provide procedurally generated open-ended multi-agent coordination tasks with controllable difficulty.
    Newly constructed in this work; no independent evidence outside the paper.

pith-pipeline@v0.9.1-grok · 5841 in / 1564 out tokens · 26906 ms · 2026-06-27T19:17:04.480328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 6 linked inside Pith

  1. [1]

    Melting pot 2.0.arXiv preprint arXiv:2211.13746, 2022

    John P Agapiou, Alexander Sasha Vezhnevets, Edgar A Duéñez-Guzmán, Jayd Matyas, Yi- ran Mao, Peter Sunehag, Raphael Köster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, et al. Melting pot 2.0.arXiv preprint arXiv:2211.13746, 2022

  2. [2]

    Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Belle- mare. Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021

  3. [3]

    Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models

    Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 8038–8057, 2025

  4. [4]

    The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020

    Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020

  5. [5]

    The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

  6. [6]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman- Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax

  7. [7]

    Superhuman ai for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

    Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

  8. [8]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    Deep blue.Artificial intelligence, 134(1-2):57–83, 2002

    Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue.Artificial intelligence, 134(1-2):57–83, 2002

  10. [10]

    On the utility of learning about humans for human-ai coordination.Advances in neural information processing systems, 32, 2019

    Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination.Advances in neural information processing systems, 32, 2019

  11. [11]

    MLE-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview...

  12. [12]

    Is independent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

    Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

  13. [13]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  14. [14]

    Improv- ing factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2023

  15. [15]

    Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 36:37567–37593, 2023

    Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob Foerster, and Shimon Whiteson. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 36:37567–37593, 2023. 11

  16. [16]

    Simplifying deep temporal difference learning

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning. InThe Thirteenth International Conference on Learning Representations, 2025

  17. [17]

    Overcookedv2: Rethinking overcooked for zero-shot coordination

    Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, and Jakob Nicolaus Foerster. Overcookedv2: Rethinking overcooked for zero-shot coordination. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=hlvLM3GX8R

  18. [18]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026

  19. [19]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4, April 2026. Accessed: 2026-04-25

  20. [20]

    Kellybench: A benchmark for long-horizon sequential decision making.arXiv preprint arXiv:2604.27865, 2026

    Thomas Grady, Kip Parker, Iliyan Zarov, Henry Course, Chengxi Taylor, and Ross Taylor. Kellybench: A benchmark for long-horizon sequential decision making.arXiv preprint arXiv:2604.27865, 2026

  21. [21]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv preprint arXiv:2507.08616, 2025

    Florian Grötschla, Luis Müller, Jan Tönshoff, Mikhail Galkin, and Bryan Perozzi. Agentsnet: Coordination and collaborative reasoning in multi-agent llms.arXiv preprint arXiv:2507.08616, 2025

  23. [23]

    Large language model based multi-agents: a survey of progress and challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: a survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 8048–8057, 2024

  24. [24]

    Benchmarking the spectrum of agent capabilities

    Danijar Hafner. Benchmarking the spectrum of agent capabilities. InInternational Con- ference on Learning Representations, 2022. URL https://openreview.net/forum?id= 1W0z96MFEoH

  25. [25]

    Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

    Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenˇciak, et al. Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

  26. [26]

    Dynamic programming for partially observable stochastic games

    Eric A Hansen, Daniel S Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. InAAAI, volume 4, pages 709–715, 2004

  27. [27]

    yc−bench : Benchmarking ai agents for long-term planning and consistent execution.arXiv preprint arXiv:2604.01212, 2026

    Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, and Nazneen Rajani. yc−bench : Benchmarking ai agents for long-term planning and consistent execution.arXiv preprint arXiv:2604.01212, 2026

  28. [28]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  29. [29]

    Position: Open-endedness is essential for artificial superhuman intelligence

    Edward Hughes, Michael D Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktäschel. Position: Open-endedness is essential for artificial superhuman intelligence. InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 20597–20616....

  30. [30]

    Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps

    Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, and Eugene Vinitsky. Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. InPro- ceedings of the International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2408.01584. 12

  31. [31]

    Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499, 352, 2025

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499, 352, 2025

  32. [32]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  33. [33]

    Scalable evaluation of multi-agent reinforcement learning with melting pot

    Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. InInternational conference on machine learning, pages 6187–6199. PMLR, 2021

  34. [34]

    Stateful active facilitator: Co- ordination and environmental heterogeneity in cooperative multi-agent reinforcement learn- ing

    Dianbo Liu, Vedant Shah, Oussama Boussif, Cristian Meo, Anirudh Goyal, Tianmin Shu, Michael Curtis Mozer, Nicolas Heess, and Yoshua Bengio. Stateful active facilitator: Co- ordination and environmental heterogeneity in cooperative multi-agent reinforcement learn- ing. InThe Eleventh International Conference on Learning Representations, 2023. URL https://o...

  35. [35]

    Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

  36. [36]

    The interdisciplinary study of coordination.ACM Computing Surveys (CSUR), 26(1):87–119, 1994

    Thomas W Malone and Kevin Crowston. The interdisciplinary study of coordination.ACM Computing Surveys (CSUR), 26(1):87–119, 1994

  37. [37]

    Craftax: A lightning-fast benchmark for open-ended reinforcement learning

    Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. InInternational Conference on Machine Learning (ICML), 2024

  38. [38]

    The influence of scaffolds on coordination scaling laws in LLM agents

    Mariana Meireles, Niklas Lauffer, Rupali Bhati, and Cameron Allen. The influence of scaffolds on coordination scaling laws in LLM agents. InWorkshop on Scaling Environments for Agents,

  39. [39]

    URLhttps://openreview.net/forum?id=E9whrbtgUA

  40. [40]

    Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  41. [41]

    Multi- agent craftax: Benchmarking open-ended multi-agent reinforcement learning at the hyperscale

    Bassel Al Omari, Michael Matthews, Alexander Rutherford, and Jakob Nicolaus Foerster. Multi- agent craftax: Benchmarking open-ended multi-agent reinforcement learning at the hyperscale. arXiv preprint arXiv:2511.04904, 2025

  42. [42]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

  43. [43]

    BALROG: Benchmarking agentic LLM and VLM reasoning on games

    Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. BALROG: Benchmarking agentic LLM and VLM reasoning on games. InThe Thirteenth International Conference on Learning Representations,

  44. [44]

    URLhttps://openreview.net/forum?id=fp6t3F669F

  45. [45]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

  46. [46]

    Jaxmarl: Multi-agent rl environments and algorithms in jax.Advances in Neural Information Processing Systems, 37:50925–50951, 2024

    Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian S de Witt, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax.Advances in Neural Information Processing Systems, 37:50925–50951, 2024

  47. [47]

    Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 2186–2188, Richland, ...

  48. [48]

    Harvard university press, 1980

    Thomas C Schelling.The Strategy of Conflict: with a new Preface by the Author. Harvard university press, 1980

  49. [49]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

  50. [50]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity

    Parshin Shojaee, Seyed Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum? id=Y...

  51. [51]

    The illusion of diminishing returns: Measuring long horizon execution in LLMs

    Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in LLMs. InThe Fourteenth Inter- national Conference on Learning Representations, 2026. URL https://openreview.net/ forum?id=3lm8lWYxiq

  52. [52]

    Cambridge University Press, 2004

    Brian Skyrms.The stag hunt and the evolution of social structure. Cambridge University Press, 2004

  53. [53]

    Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents.arXiv preprint arXiv:1903.00784, 2019

    Joseph Suarez, Yilun Du, Phillip Isola, and Igor Mordatch. Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents.arXiv preprint arXiv:1903.00784, 2019

  54. [54]

    Neural mmo 2.0: a massively multi-task addition to massively multi-agent learning.Advances in Neural Information Processing Systems, 36:50094–50104, 2023

    Joseph Suarez, David Bloomin, Kyoung Whan Choe, Hao Xiang Li, Ryan Sullivan, Nishaanth Kanna, Daniel Scott, Rose Shuman, Herbie Bradley, Louis Castricato, et al. Neural mmo 2.0: a massively multi-task addition to massively multi-agent learning.Advances in Neural Information Processing Systems, 36:50094–50104, 2023

  55. [55]

    Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents

    Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4922–4951, 2025

  56. [56]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  57. [57]

    URLhttps://qwen.ai/blog?id=qwen3.5

  58. [58]

    Hypermarl: Adaptive hypernetworks for multi-agent rl

    Kale-ab Abebe Tessera, Arrasy Rahman, Amos Storkey, and Stefano V Albrecht. Hypermarl: Adaptive hypernetworks for multi-agent rl. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems, 2025. URL https://openreview.net/forum?id= 56CgYnf9Dr

  59. [59]

    Probing dec-POMDP reasoning in cooperative MARL

    Kale-ab Abebe Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, and Amos Storkey. Probing dec-POMDP reasoning in cooperative MARL. InThe 25th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Oral, 2026

  60. [60]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  61. [61]

    Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

  62. [62]

    V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=ehfRiF0R3a

  63. [63]

    Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971, 2024

    Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971, 2024. 14

  64. [64]

    Odysseybench: Evaluating llm agents on long-horizon complex office application workflows.arXiv preprint arXiv:2508.09124, 2025

    Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows.arXiv preprint arXiv:2508.09124, 2025

  65. [65]

    The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  66. [66]

    LLM- powered decentralized generative agents with adaptive hierarchical knowledge graph for co- operative planning

    Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido Botran, and Carlee Joe-Wong. LLM- powered decentralized generative agents with adaptive hierarchical knowledge graph for co- operative planning. InThe First MARW: Multi-Agent AI in the Real World Workshop at AAAI 2025, 2025. URLhttps://openreview.net/forum?id=l9QUw0oUTa

  67. [67]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=WE_vluYUL-X

  68. [68]

    An efficient open world environment for multi-agent social learning.arXiv preprint arXiv:2508.15679, 2025

    Eric Ye, Ren Tao, and Natasha Jaques. An efficient open world environment for multi-agent social learning.arXiv preprint arXiv:2508.15679, 2025

  69. [69]

    The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

  70. [70]

    Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137, 2026

    Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137, 2026

  71. [71]

    Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

  72. [72]

    Multiagentbench: Evaluating the collaboration and competition of llm agents

    Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Daisy Zhe Wang, Zhenhailong Wang, Cheng Qian, Robert Tang, Heng Ji, et al. Multiagentbench: Evaluating the collaboration and competition of llm agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8580–8622, 2025. 15...

  73. [73]

    By default it also optionally includes the full action catalogue with natural-language descriptions, game and coordination mechanics

    System prompt:The system prompt is shared across all time steps for an agent and defines the agent identity and role, the team size, the objective, and the rules needed to act in the environment. By default it also optionally includes the full action catalogue with natural-language descriptions, game and coordination mechanics

  74. [74]

    Observation from k step(s) ago

    Observation and Action History:The observation and action history is a rolling sequence of recent messages from the environment and the agent itself. The observation messages contain previous observations, labelled as “Observation from k step(s) ago”, and the action messages contain the action previously taken by the agent. When memory and commu- nication...

  75. [75]

    The agent is then prompted to respond with an action

    Current Observation and Action Space:The current observation message contains textual descriptions of: achievement progress, current level, nearby terrain, items and enemies, visible coordination opportunities and requirements, teammate status, agent stats, vitals and inventory. The agent is then prompted to respond with an action. By default, the call to...

  76. [76]

    Gather wood -> place a table -> craft a wood pickaxe; craft a wood sword early if combat is likely

  77. [77]

    Mine stone and coal -> place a furnace -> craft iron tools and iron armour

  78. [78]

    The ladder only becomes usable after enough monsters on that level have been killed

    To descend: stand on the`ladder_down`tile (visible in your observation when close) and use the Descend action. The ladder only becomes usable after enough monsters on that level have been killed. Only one agent needs to use Descend/Ascend -- all teammates are teleported with them. </game_rules> <achievements> ## Achievements Collect Wood Place Table Eat C...

  79. [79]

    (Required) Exactly one action from the available action list: <action>YOUR_CHOSEN_ACTION</action>

  80. [80]

    Teammates can only act on what you tell them

    (Optional) Broadcast to teammates, up to 400 chars. Teammates can only act on what you tell them. Be specific (e.g.'Dig on tree next turn','Ladder at 5NE','Need 2 wood'). Reply to teammates' requests. <communication>YOUR_MESSAGE</communication>

Showing first 80 references.