Zhang, Thomas L

Alex L · 2025 · cs.AI · arXiv 2505.18134

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

citation-role summary

background 1 dataset 1

citation-polarity summary

background 2

representative citing papers

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

MineExplorer is a new benchmark for MLLM agents' open-world exploration in Minecraft, using task filtering, ReAct formulation, and multi-agent synthesis to create reliable multi-hop instances.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

RTA trains a VLM as a progress ordinal scorer via GRPO on shuffled expert frames and uses Spearman rank correlation with temporal indices as a bounded RL reward, matching or exceeding prior video reward methods on discrete and continuous control benchmarks.

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

OmniGameArena is a unified UE5 benchmark with 12 games and the IDC harness for cold-start scores and improvement dynamics of VLM agents.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

cs.CV · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

citing papers explorer

Showing 2 of 2 citing papers after filters.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft cs.CL · 2026-05-29 · unverdicted · none · ref 6 · internal anchor
MineExplorer is a new benchmark for MLLM agents' open-world exploration in Minecraft, using task filtering, ReAct formulation, and multi-agent synthesis to create reliable multi-hop instances.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 115 · internal anchor
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

Zhang, Thomas L

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer