GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Cong Lin; Jialin Liu; Julian Togelius; Muhammad Umair Nasir; Philip Bontrager; Yuchen Li

arxiv: 2508.08501 · v3 · pith:O22N6UK6new · submitted 2025-08-11 · 💻 cs.AI

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li , Cong Lin , Muhammad Umair Nasir , Philip Bontrager , Jialin Liu , Julian Togelius This is my paper

Pith reviewed 2026-05-21 22:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords GVGAI-LLMlarge language modelsspatial reasoningvideo game AIzero-shot evaluationagentic behaviorbenchmarking LLMs

0 comments

The pith

Large language models show consistent limitations in spatial reasoning and planning when tested on 118 diverse video games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GVGAI-LLM as a benchmark for assessing LLMs' reasoning abilities using arcade-style games represented by ASCII grids. Evaluations across many games highlight that models often make spatial and logical mistakes in zero-shot settings. Techniques like structured prompting provide some help, yet the overall performance remains low. This setup allows for creating new games to keep testing fresh and focuses on agent-like behavior in dynamic environments.

Core claim

Through zero-shot evaluations on 118 games, LLMs exhibit persistent spatial and logical errors in handling game scenes and planning actions, with structured prompting and spatial grounding yielding only partial improvements, leaving the benchmark largely unsolved.

What carries the argument

The GVGAI-LLM benchmark, which converts game states to compact ASCII character grids and measures performance via meaningful step ratio, step efficiency, and overall score.

If this is right

Models need improved spatial awareness to succeed in these game tasks.
The ability to generate infinite new games supports long-term evaluation without overfitting.
Current prompting methods can mitigate some but not all reasoning failures.
The benchmark highlights gaps in basic planning for language model agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integrating visual processing with language models could address the spatial shortcomings observed.
This benchmark could be used to track progress in LLM capabilities over time as new models emerge.
Similar game-based tests might apply to other AI systems beyond language models.

Load-bearing premise

ASCII representations of game scenes combined with the three performance metrics truly test general reasoning rather than superficial text understanding.

What would settle it

Finding an LLM that completes a large majority of the 118 games with few spatial or logical errors in zero-shot mode would challenge the reported limitations.

read the original abstract

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a video game description language that enables the rapid creation of new games (including rules and levels), helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across 118 games with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. Although these interventions lead to partial improvements, the benchmark remains very far from being solved. GVGAI-LLM serves as a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and spatial reasoning. Furthermore, its ability to generate infinite benchmarks, both manually and procedurally, provides a scalable framework for longitudinal evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GVGAI-LLM adapts an existing game framework for LLMs with ASCII states and procedural generation, but the results do not cleanly separate spatial reasoning limits from input parsing issues.

read the letter

Colleague, the paper's main contribution is a new benchmark that ports the GVGAI engine to LLMs. It uses compact ASCII grids for game states and the framework's built-in tools for quick game creation, including procedural levels, to produce an effectively infinite set of tests. They run zero-shot evaluations across 118 games and track three straightforward metrics: meaningful step ratio, step efficiency, and overall score. The work also checks a couple of prompting fixes and reports partial gains without claiming the problems are solved. That setup is practical and directly addresses overfitting risks that plague many static LLM benchmarks. The metrics are easy to interpret, which helps when comparing models or tracking incremental progress on agentic tasks. The authors are clear that the benchmark stays far from solved, which keeps the claims grounded. The soft spot sits in the input representation. Every state arrives as an ASCII character grid, so spatial and logical errors could come from failing to parse the layout in text rather than from deeper reasoning shortfalls. The metrics treat invalid or off-grid moves as failures without further breakdown, and there are no ablations using coordinate lists, object-centric descriptions, or other encodings. That leaves the central claim about persistent spatial reasoning limits resting on thinner evidence than it first appears. The paper is aimed at researchers working on LLM agents and evaluation methods for planning and spatial tasks. A reader who wants a reproducible, scalable game-based testbed beyond standard text suites will find usable material here. It deserves peer review because the benchmark construction is concrete and the evaluation scope is broad, even if the interpretation of errors needs tighter controls.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces GVGAI-LLM, a benchmark built on the General Video Game AI (GVGAI) framework for evaluating large language model agents. It includes 118 arcade-style games represented as compact ASCII character grids, with new interpretable metrics such as meaningful step ratio, step efficiency, and overall score. Through zero-shot evaluations, the authors claim to reveal persistent limitations in LLMs' spatial reasoning and basic planning abilities, with some partial improvements from structured prompting and spatial grounding techniques, while noting that the benchmark remains far from solved. The framework supports infinite game generation to prevent overfitting.

Significance. If the central findings hold after addressing methodological concerns, this benchmark could serve as an important testbed for assessing agentic and spatial reasoning capabilities in LLMs, distinct from many existing text-based or mathematical benchmarks. The procedural generation aspect and emphasis on reproducibility are positive features that could support longitudinal studies of model progress.

major comments (1)

[Evaluation Setup and State Representation] The paper relies exclusively on ASCII character grids for game state input without providing ablations using alternative encodings such as coordinate lists, object-centric descriptions, or natural language summaries. Since the metrics (meaningful step ratio, step efficiency, overall score) treat invalid or off-grid actions as failures, it is unclear whether observed spatial and logical errors stem from reasoning limitations or from difficulties in parsing and interpreting the textual grid format. This directly impacts the support for the claim of persistent limitations in spatial reasoning and planning.

minor comments (1)

[Abstract] The abstract summarizes findings from evaluations on 118 games but does not include any quantitative results, error bars, or specific details on game selection and metric computation, which would help readers assess the strength of the claims immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and outline revisions to strengthen the work.

read point-by-point responses

Referee: [Evaluation Setup and State Representation] The paper relies exclusively on ASCII character grids for game state input without providing ablations using alternative encodings such as coordinate lists, object-centric descriptions, or natural language summaries. Since the metrics (meaningful step ratio, step efficiency, overall score) treat invalid or off-grid actions as failures, it is unclear whether observed spatial and logical errors stem from reasoning limitations or from difficulties in parsing and interpreting the textual grid format. This directly impacts the support for the claim of persistent limitations in spatial reasoning and planning.

Authors: We appreciate the referee raising this methodological point. GVGAI-LLM adopts the native ASCII grid representation from the GVGAI framework because it is compact, directly encodes spatial layout, and aligns with how the original framework presents states to agents. All evaluated models receive identical inputs, allowing relative comparisons of performance across the 118 games. The interpretable metrics (e.g., meaningful step ratio) are intended to go beyond raw validity and capture whether actions reflect coherent planning. That said, we acknowledge that without explicit ablations on alternative encodings such as coordinate lists or object-centric descriptions, it remains difficult to fully isolate parsing difficulties from deeper spatial reasoning deficits. In the revised manuscript we will add a dedicated subsection under Evaluation Setup that (1) justifies the ASCII choice on grounds of efficiency and fidelity to the GVGAI standard, (2) discusses the possibility that some errors may arise from textual parsing, and (3) explicitly flags the need for future ablations with richer or more structured state representations. This addition will qualify our claims appropriately while preserving the core finding that current LLMs exhibit persistent difficulties on these tasks. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and metrics defined independently of evaluation outcomes

full rationale

The paper introduces GVGAI-LLM as a new benchmark built on the external GVGAI framework, represents states via ASCII grids, and defines three explicit metrics (meaningful step ratio, step efficiency, overall score). Claims of LLM limitations in spatial reasoning and planning are derived directly from zero-shot performance measurements across 118 games rather than from any fitted parameters, self-citations, or prior author results. No derivation chain reduces to its own inputs by construction; the evaluation outcomes serve as independent evidence. This is the normal case for an empirical benchmark paper with external foundations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about LLM text processing and the validity of game-based metrics for measuring reasoning; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption ASCII grids are sufficient to convey spatial game state to LLMs for decision making
The entire evaluation pipeline depends on this representation choice.
domain assumption The metrics meaningful step ratio, step efficiency, and overall score capture reasoning and planning ability
These are the basis for all claims about model limitations.

pith-pipeline@v0.9.0 · 5776 in / 1275 out tokens · 36979 ms · 2026-05-21T22:34:05.743743+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each game scene is represented by a compact set of ASCII characters... meaningful step ratio, step efficiency, and overall score
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spatial grounding failures... coordinate confusion... hallucinated proximity in sparse layouts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.