AVA: Attentive VLM Agent for Mastering StarCraft II

Bernard Ghanem; Guohao Li; Weiyu Ma; Yuqian Fu; Zecheng Zhang

arxiv: 2503.05383 · v7 · submitted 2025-03-07 · 💻 cs.AI · cs.MA

AVA: Attentive VLM Agent for Mastering StarCraft II

Weiyu Ma , Yuqian Fu , Zecheng Zhang , Bernard Ghanem , Guohao Li This is my paper

Pith reviewed 2026-05-23 01:13 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords StarCraft IIVision-Language ModelsMulti-Agent Reinforcement LearningBenchmarkZero-shotGame AIMultimodal Agents

0 comments

The pith

Vision-language models reach 75-90% zero-shot win rates in StarCraft II while trained MARL peaks at 19.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AVACraft, a multimodal benchmark for StarCraft II that supplies RGB visuals, natural language observations, and structured state data so the same scenarios can test both multi-agent reinforcement learning and vision-language models. It runs six MARL algorithms for 5 million training steps and evaluates multiple VLMs in zero-shot mode across 21 scenarios that cover micromanagement, coordination, and strategic planning. The central result is that VLMs deliver 75-90% win rates with decisions that match human reasoning, while the strongest MARL method reaches only 19.3%. This comparison exposes concrete differences in training cost, performance ceiling, and interpretability between the two paradigms.

Core claim

AVACraft shows that vision-language models can solve a wide range of StarCraft II tasks at high win rates through zero-shot prompting on multimodal inputs, whereas multi-agent reinforcement learning algorithms trained for millions of steps remain limited to low win rates even with strong backbones.

What carries the argument

The AVACraft benchmark that supplies identical RGB, natural-language, and structured-state observations to both MARL training loops and VLM zero-shot inference across the same 21 scenarios.

If this is right

Zero-shot VLM agents can reach higher performance ceilings than MARL agents trained for millions of steps in these environments.
VLM decisions align more closely with human strategies, increasing interpretability of agent behavior.
VLM agents require no environment-specific training, lowering the computational cost of deployment.
The benchmark format makes direct trade-offs between training efficiency, final performance, and interpretability measurable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attentive mechanisms inside VLMs may be especially useful for long-horizon coordination tasks that current MARL struggles with.
The same multimodal setup could be applied to other real-time strategy games to test whether the performance gap holds.
Hybrid agents that combine VLM reasoning with occasional MARL fine-tuning might close remaining gaps in specific micromanagement situations.

Load-bearing premise

The 21 scenarios and three observation formats represent the full range of StarCraft II challenges without giving an unintended advantage to zero-shot VLM prompting over trained MARL methods.

What would settle it

Train the MARL algorithms on the same RGB images and natural-language descriptions given to the VLMs and check whether their win rates remain below 75% across the 21 scenarios.

read the original abstract

We introduce AVACraft, a multimodal StarCraft II benchmark supporting both Multi-Agent Reinforcement Learning (MARL) and Vision-Language Model (VLM) paradigms. Unlike SMAC-family environments that rely on abstract state representations and exclude VLMs, AVACraft provides RGB visuals, natural language observations, and structured state information, enabling systematic comparison between training-based and zero-shot methods across 21 scenarios spanning micromanagement, coordination, and strategic planning. We establish comprehensive baselines: six MARL algorithms (IQL, QMIX, QTRAN, VDN, MAPPO, IPPO) with Swin-Transformer backbones trained for 5M steps, and multiple VLMs including proprietary (GPT-4o) and open-source (Qwen3-VL) models. Results reveal complementary strengths-MARL peaks at 19.3% win rate after 5M steps, while VLMs achieve 75-90% zero-shot with human-aligned decisions-exposing trade-offs between training efficiency, performance ceilings, interpretability, and deployment cost. Code: https://github.com/camel-ai/VLM-Play-StarCraft2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVACraft gives a multimodal StarCraft benchmark for VLM vs MARL comparison, but the reported gap rests on asymmetric inputs that favor the VLMs.

read the letter

The paper's real addition is AVACraft, a StarCraft II setup that supplies RGB, natural language observations, and structured states across 21 scenarios so both MARL and VLM agents can be tested on the same tasks. Prior SMAC environments stayed with abstract states and left VLMs out, so this fills a practical gap for direct comparison of trained agents against zero-shot models like GPT-4o and Qwen3-VL. The baselines are straightforward: six MARL algorithms with Swin-Transformer backbones run for 5M steps, and the VLMs run zero-shot. The headline numbers show MARL maxing at 19.3% while VLMs reach 75-90% with more human-like behavior. That contrast is worth seeing even if the absolute values need checking. The soft spot is the input asymmetry the stress-test flags. VLMs get language descriptions on top of visuals; the MARL runs get only pixels. If those descriptions encode task structure that is easy to read but hard to discover from images alone, the performance difference reflects the extra channel more than zero-shot power versus sample efficiency. The abstract also skips variance across runs, statistical tests, and any controls for prompting bias or scenario selection. Without those, the central claim is hard to trust at face value. This is useful for groups already working on multimodal agents or RTS benchmarks who want a shared testbed. It is coherent enough on its own terms to deserve peer review so the experimental details can be examined, but the current write-up would need tighter controls on inputs and reporting before the gap can be taken as settled.

Referee Report

3 major / 1 minor

Summary. The paper introduces AVACraft, a multimodal StarCraft II benchmark providing RGB visuals, natural language observations, and structured states across 21 scenarios to enable direct comparison of MARL and VLM agents. It establishes baselines showing six MARL algorithms (IQL, QMIX, etc.) with Swin-Transformer backbones reaching a peak win rate of 19.3% after 5M steps, while VLMs (GPT-4o, Qwen3-VL) achieve 75-90% zero-shot win rates with human-aligned decisions, highlighting trade-offs in training efficiency, performance, and interpretability.

Significance. If the central performance comparison holds under controlled conditions, the work provides a useful new benchmark for evaluating zero-shot VLM capabilities against sample-inefficient MARL in a complex multi-agent domain, potentially guiding research on hybrid approaches that combine prompting with learning. The explicit multimodal design is a strength for enabling such comparisons.

major comments (3)

[Abstract] Abstract and results presentation: the headline claims of 75-90% VLM win rates versus 19.3% MARL peak lack any mention of evaluation episode counts, variance across runs, statistical significance tests, or controls for prompting variability, which directly undermines verification of the performance gap.
[Benchmark description / Experimental setup] Benchmark and experimental setup: it is unclear from the description whether the six MARL baselines receive the natural language observations (in addition to RGB) or are restricted to visual input only; if the latter, the reported gap may reflect input asymmetry rather than zero-shot capability versus trained policies.
[Scenarios] Scenario selection: the criteria used to choose the 21 scenarios and to ensure they do not inadvertently favor language-based reasoning over pixel-based discovery are not specified, which is load-bearing for the claim that the results generalize beyond the chosen testbed.

minor comments (1)

[Code and reproducibility] The GitHub link is given but the text provides no explicit hyperparameters for the MARL runs or the exact prompting templates used for VLMs, which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions that will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results presentation: the headline claims of 75-90% VLM win rates versus 19.3% MARL peak lack any mention of evaluation episode counts, variance across runs, statistical significance tests, or controls for prompting variability, which directly undermines verification of the performance gap.

Authors: We agree that the abstract and results presentation would benefit from greater detail on the evaluation protocol. In the revised manuscript we will update the abstract to reference the evaluation episode counts and include variance measures in the main results tables. We will also document the fixed prompting strategy used across VLMs. While formal statistical significance tests were not performed, the magnitude and consistency of the observed gaps will be noted explicitly as supporting evidence. revision: yes
Referee: [Benchmark description / Experimental setup] Benchmark and experimental setup: it is unclear from the description whether the six MARL baselines receive the natural language observations (in addition to RGB) or are restricted to visual input only; if the latter, the reported gap may reflect input asymmetry rather than zero-shot capability versus trained policies.

Authors: The MARL baselines receive only RGB visual input processed by the Swin-Transformer backbone, consistent with standard vision-based MARL practice. Natural language observations are provided in the benchmark to support VLM agents. We will add an explicit clarification in the experimental setup section stating the input modalities used for each paradigm. revision: yes
Referee: [Scenarios] Scenario selection: the criteria used to choose the 21 scenarios and to ensure they do not inadvertently favor language-based reasoning over pixel-based discovery are not specified, which is load-bearing for the claim that the results generalize beyond the chosen testbed.

Authors: We will expand the benchmark description to specify the selection criteria. The 21 scenarios were chosen to cover micromanagement, coordination, and strategic planning tasks drawn from prior MARL literature while ensuring sufficient visual complexity; we will add text explaining how this selection balances requirements for visual perception versus higher-level reasoning. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical benchmark results

full rationale

The paper introduces a new benchmark (AVACraft) and reports direct empirical win-rate measurements for MARL baselines (trained 5M steps) versus zero-shot VLM agents across 21 scenarios. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text or abstract. The central claim (VLM 75-90% vs MARL 19.3%) is a raw performance comparison, not a reduction of any output to its own inputs by construction. Self-citations, if present, are not load-bearing for any claimed derivation. This is the standard case of an empirical methods paper with no circularity risk.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central comparison rests on the unstated assumption that the provided observation modalities and scenario set are unbiased.

pith-pipeline@v0.9.0 · 5740 in / 1052 out tokens · 39706 ms · 2026-05-23T01:13:24.201030+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
cs.AI 2026-06 unverdicted novelty 7.0

RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM perfo...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
cs.CV 2026-04 unverdicted novelty 7.0

EgoEsportsQA is a new egocentric video QA benchmark from esports matches that shows state-of-the-art Video-LLMs reach only 71.58% accuracy and struggle more with tactical reasoning than basic perception.
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
cs.CV 2026-04 unverdicted novelty 6.0

Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.