pith. sign in

arxiv: 2606.04202 · v1 · pith:YEIRMKZBnew · submitted 2026-06-02 · 💻 cs.AI

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

Pith reviewed 2026-06-28 09:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsmulti-agent coordinationStarCraftdeceptionnatural language communicationbenchmarkdecentralized controlpartial observability
0
0 comments X

The pith

SMAC-Talk extends the StarCraft Multi-Agent Challenge with a natural language channel to test how LLM agents coordinate and handle deception in decentralized settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SMAC-Talk as a benchmark that adds natural language communication to the existing StarCraft Multi-Agent Challenge environment. This setup lets researchers evaluate LLM-based agents under decentralized control, partial observability, and long time horizons while they exchange messages to cooperate or face embedded deceptive communicators. Experiments with four models from the Qwen3.5 family measure how reasoning structure, memory, and model scale influence coordination success. A sympathetic reader cares because effective multi-agent work with language is a required capability for LLMs that must operate alongside other agents rather than alone.

Core claim

SMAC-Talk supplies a natural language extension of SMAC that includes a communication channel for probing agent coordination and trust, supports construction of deceptive-communicator scenarios that disrupt allies through messages alone, and provides three benchmark agents whose performance with Qwen3.5 models reveals measurable effects of reasoning structure, memory, and scale on cooperative outcomes.

What carries the argument

The natural language communication channel added to the SMAC environment, which enables both coordination messages and embedded deception scenarios.

If this is right

  • LLM agents can be systematically compared on their ability to maintain cooperation under partial information and long horizons.
  • Differences in reasoning structure and memory usage produce observable changes in multi-agent success rates.
  • Model scale correlates with improved handling of both honest coordination and resistance to deceptive messages.
  • The benchmark separates language-based trust from the underlying game mechanics, allowing targeted study of communication quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This benchmark could be used to test whether current LLMs can form reliable teams without external oversight in open-ended tasks.
  • Extending the deceptive scenarios to include more subtle misinformation patterns would expose additional failure modes in language-based trust.
  • The environment offers a concrete way to measure whether scaling laws observed in single-agent settings continue to hold when agents must negotiate shared goals through dialogue.

Load-bearing premise

The constructed communication channel and deceptive scenarios serve as a faithful test of coordination and trust rather than being dominated by artifacts of the StarCraft rules, prompt choices, or the specific agent implementations.

What would settle it

Running the same coordination tasks with the deceptive communicator removed and finding that success rates and information-sharing patterns remain statistically unchanged.

Figures

Figures reproduced from arXiv: 2606.04202 by Homayoun Najjaran, Joel Sol.

Figure 1
Figure 1. Figure 1: SMAC-Talk Environment Diagram 1Code is available at https://anonymous.4open.science/r/SMAC-Talk-C345/README.md Preprint. arXiv:2606.04202v1 [cs.AI] 2 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge (SMAC) for evaluating LLM-based agents. The environment retains SMAC's decentralized control, partial observability, and long-horizon decisions while adding a natural language communication channel used to construct coordination and embedded deception scenarios. Three agent types are provided and benchmarked on four Qwen3.5 models to examine effects of reasoning structure, memory, and model scale on coordination performance; the environment is released as an open benchmark.

Significance. If the reported effects hold under rigorous controls, SMAC-Talk would supply a reproducible, open benchmark extending an established multi-agent testbed to language-mediated coordination and deception. The provision of agent implementations and the focus on an existing environment aid reproducibility and community adoption for LLM multi-agent research.

major comments (3)
  1. [Experiments] Experiments section: the manuscript states that experiments 'show effects of reasoning structure, memory, and scale on coordination performance' yet supplies no quantitative metrics, performance tables, error bars, run counts, or statistical details. This absence is load-bearing for the central empirical claim.
  2. [Environment Design / Evaluation Scenarios] Environment and Evaluation Scenarios: no ablation is reported that disables the natural language channel (while retaining the observation/action interface and SMAC mechanics) to isolate whether performance differences arise from language-mediated coordination rather than prompt serialization artifacts or residual game dynamics. This control is required to support the claim that the channel provides a faithful probe of coordination and trust.
  3. [Evaluation Scenarios] Deceptive-communicator scenarios: the construction assumes language is the dominant deception vector, but without the language-ablated control the results cannot rule out that any observed disruption stems from the underlying partial-observability mechanics rather than the embedded deceiver.
minor comments (2)
  1. [Abstract] Abstract: states that benchmarking was performed but does not preview any concrete outcomes or metrics.
  2. [Agents] Notation for the three agent types and the exact prompt templates used for state serialization should be defined explicitly in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on SMAC-Talk. We agree that the empirical claims require stronger quantitative support and that language-ablated controls are needed to isolate the contribution of the communication channel. We address each major comment below and will incorporate the requested changes in the revised manuscript.

read point-by-point responses
  1. Referee: Experiments section: the manuscript states that experiments 'show effects of reasoning structure, memory, and scale on coordination performance' yet supplies no quantitative metrics, performance tables, error bars, run counts, or statistical details. This absence is load-bearing for the central empirical claim.

    Authors: We acknowledge that the current manuscript does not include the detailed quantitative results, tables, error bars, run counts, or statistical tests needed to support the claims. This was an omission in the presentation of the experiments. In the revised version we will add comprehensive performance tables (including mean coordination metrics such as win rate or cumulative reward), standard deviations as error bars, the number of independent runs per configuration (minimum of five), and appropriate statistical comparisons to substantiate the effects of reasoning structure, memory, and model scale. revision: yes

  2. Referee: Environment and Evaluation Scenarios: no ablation is reported that disables the natural language channel (while retaining the observation/action interface and SMAC mechanics) to isolate whether performance differences arise from language-mediated coordination rather than prompt serialization artifacts or residual game dynamics. This control is required to support the claim that the channel provides a faithful probe of coordination and trust.

    Authors: We agree that an ablation disabling the natural language channel while preserving the rest of the SMAC observation and action interface is necessary. We will run and report this control experiment in the revision, directly comparing agent performance with and without the communication channel to demonstrate that coordination gains are attributable to language-mediated interaction rather than other factors. revision: yes

  3. Referee: Deceptive-communicator scenarios: the construction assumes language is the dominant deception vector, but without the language-ablated control the results cannot rule out that any observed disruption stems from the underlying partial-observability mechanics rather than the embedded deceiver.

    Authors: We concur that the language-ablated control is also required for the deceptive scenarios. The same ablation will be applied to these settings in the revised manuscript so that any performance drop can be attributed to deceptive language use rather than the base partial-observability mechanics of SMAC. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and scenarios defined independently of any fitted parameters or self-referential derivations.

full rationale

The paper introduces SMAC-Talk by extending the existing SMAC environment with a natural language channel and constructs evaluation scenarios (including deceptive communicators) as explicit design choices. It then runs external Qwen3.5 models through three provided agent architectures and reports empirical coordination results. No equations, parameter fits, predictions derived from subsets of data, or self-citation chains appear in the derivation of the central claims. The benchmark definition and results are obtained by executing independent models on the released environment, satisfying the criteria for a self-contained empirical contribution with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark-introduction paper; it defines an environment and runs existing models rather than deriving results from mathematical axioms or fitted parameters.

pith-pipeline@v0.9.1-grok · 5709 in / 1278 out tokens · 29960 ms · 2026-06-28T09:40:59.020737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Playing repeated games with large language models

    E. Akata, L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, May 2025. ISSN 2397-3374. doi: 10.1038/s41562-025-02172-y. URL http://dx.doi.org/10.1038/ s41562-025-02172-y

  2. [2]

    P. M. P. Curvo. The traitors: Deception and trust in multi-agent language model simulations,

  3. [3]

    URLhttps://arxiv.org/abs/2505.12923

  4. [4]

    Ellis, J

    B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson. Smacv2: an improved benchmark for cooperative multi-agent reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  5. [5]

    X. Feng, Y . Luo, Z. Wang, H. Tang, M. Yang, K. Shao, D. Mguni, Y . Du, and J. Wang. Chessgpt: bridging policy learning and language modeling. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  6. [6]

    J. N. Foerster, Y . M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. InProceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 2145–2153, Red Hook, NY , USA,

  7. [7]

    ISBN 9781510838819

    Curran Associates Inc. ISBN 9781510838819

  8. [8]

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VtmBAGCN7o

  9. [9]

    X. Hong, Y . Wang, D. Jin, Y . Yuan, X. Huang, Z. Wu, and W. Li. Hlsmac: A new starcraft multi-agent challenge for high-level strategic decision-making, 2025. URL https://arxiv. org/abs/2509.12927

  10. [10]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  11. [11]

    G. Li, H. A. Al Kader Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: communica- tive agents for "mind" exploration of large language model society. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  12. [12]

    Liang, T

    F. Liang, T. Zheng, C. Chan, Y . Yim, and Y . Song. Llm-hanabi: Evaluating multi-agent gameplays with theory-of-mind and rationale inference in imperfect information collaboration game, 2025. URLhttps://arxiv.org/abs/2510.04980

  13. [13]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Repre...

  14. [14]

    W. Ma, Q. Mi, Y . Zeng, X. Yan, Y . Wu, R. Lin, H. Zhang, and J. Wang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach, 2024. URL https://arxiv.org/abs/2312.11865

  15. [15]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5. 9

  16. [16]

    Samvelyan, T

    M. Samvelyan, T. Rashid, C. Schroeder de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson. The starcraft multi-agent challenge. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 2186–2188, Richland, SC, 2019. International Foundation for A...

  17. [17]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. InProceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/abs/2010.03768

  18. [18]

    Sukhbaatar, A

    S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropa- gation. InProceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 2252–2260, Red Hook, NY , USA, 2016. Curran Associates Inc. ISBN 9781510838819

  19. [19]

    C. Sun, S. Huang, and D. Pompili. Llm-based multi-agent decision-making: Challenges and future directions.IEEE Robotics and Automation Letters, 10(6):5681–5688, 2025. doi: 10.1109/LRA.2025.3562371

  20. [20]

    K.-T. Tran, D. Dao, M.-D. Nguyen, Q.-V . Pham, B. O’Sullivan, and H. D. Nguyen. Multi-agent collaboration mechanisms: A survey of llms, 2025. URL https://arxiv.org/abs/2501. 06322

  21. [21]

    tse Huang, E

    J. tse Huang, E. J. Li, M. H. Lam, T. Liang, W. Wang, Y . Yuan, W. Jiao, X. Wang, Z. Tu, and M. R. Lyu. How far are we on the decision-making of llms? evaluating llms’ gaming ability in multi-agent environments, 2025. URLhttps://arxiv.org/abs/2403.11807

  22. [22]

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. URL https: //arxiv.org/abs/2305.16291

  23. [23]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  24. [24]

    Y . Xu, X. Zhang, S. Yeh, J. Dhamala, O. Dia, R. Gupta, and S. Li. Lh-deception: Simulating and understanding llm deceptive behaviors in long-horizon interactions, 2026. URL https: //arxiv.org/abs/2510.03999

  25. [25]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

  26. [26]

    Zhang, W

    H. Zhang, W. Du, J. Shan, Q. Zhou, Y . Du, J. B. Tenenbaum, T. Shu, and C. Gan. Building co- operative embodied agents modularly with large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=EnXJfQqy0K. A Prompts This section show the instruction templates for each agent in the n...

  27. [27]

    Target the lowest HP% Marauder

    ENGAGE: Any enemy is within attack range. Target the lowest HP% Marauder. If none, target the lowest HP% enemy Marine

  28. [28]

    Move to defend the ally that had the largest drop in HP%

    DEFEND: At least one ally HP% decreased since the last observation. Move to defend the ally that had the largest drop in HP%

  29. [29]

    REGROUP: If allies are visible, move towards them (distance < 3.0) and try to form a large group

  30. [30]

    ===== OUTPUT FORMAT (follow exactly) ===== Analysis: [EXACTLY 1 sentence

    RALLY: If no allies are visible or all allies are within a distance < 3.0, move towards map center. ===== OUTPUT FORMAT (follow exactly) ===== Analysis: [EXACTLY 1 sentence. State the hierarchy level and the condition that triggered it.] Target: [Unit ID or (X, Y) coordinate] Math: - My Position: (My_X, My_Y) - Target Position: (Target_X, Target_Y) - Delt...

  31. [31]

    BLEND: If allies are visible move towards the largest cluster to appear cooperative

  32. [32]

    ===== DECEPTION STRATEGY ===== Review ally messages and try to determine if there is a format

    EVADE: If no allies are visible move away from the nearest enemies. ===== DECEPTION STRATEGY ===== Review ally messages and try to determine if there is a format. Send a short message mimicking their format to confuse allies. Possible avenues for sabotage include: - Making make up fake enemies. - Suggesting suboptimal actions. - Lying about which actions ...