arxiv: 2604.07429 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· cs.HC

Recognition: no theorem link

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Hwee Tou Ng, Kevin Qinghong Lin, Mike Zheng Shou, Mingyu Ouyang, Siyuan Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC

keywords multimodal agentsgame agentsbenchmarkvideo gamesevaluation frameworkMLLMcomputer-use agentssemantic action parsing

0 comments

The pith

GameWorld benchmark shows even top multimodal AI agents fall far short of human performance on 34 video games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GameWorld to create a standardized, reproducible way to test multimodal large language models as game agents inside browser environments. It defines two clear interfaces for agents—one that emits raw keyboard and mouse actions and another that uses deterministic semantic action parsing—and supplies 34 games containing 170 tasks, each equipped with objective state-verifiable success metrics. Evaluation of 18 model-interface pairs finds that the strongest current agents remain well below human levels in perception, long-horizon planning, and precise control. Video games serve as a closed-loop testbed that captures the latency, sparse feedback, and irreversible mistakes that embodied agents must handle in real settings. The work therefore supplies both a measurement tool and evidence of the distance still to be covered.

Core claim

GameWorld supplies a benchmark of 34 diverse games and 170 tasks inside browser environments together with two standardized agent interfaces—direct keyboard-and-mouse control and semantic action parsing—and demonstrates through repeated evaluation of 18 model-interface combinations that current multimodal agents remain far from human capabilities while exposing specific challenges in real-time interaction, memory use, and action validity.

What carries the argument

GameWorld benchmark: 34 games paired with 170 state-verifiable tasks, supporting two agent interfaces (direct control and semantic action parsing) inside a browser environment for outcome-based scoring.

If this is right

Agents must improve handling of latency, sparse rewards, and irreversible errors within closed interaction loops.
Semantic action parsing offers a cleaner interface than raw controls for generalist multimodal models.
Repeated full-benchmark reruns provide a stable baseline for tracking future progress.
Targeted studies on context memory and real-time constraints identify concrete bottlenecks for agent design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark's verifiable metrics could support automated training loops that let agents improve through repeated self-play.
If the games capture core requirements of embodied interaction, similar evaluation pipelines might transfer to robotic or simulation-based tasks.
Persistent gaps suggest that simply scaling current models will not close the distance without new mechanisms for long-horizon planning and fine motor control.

Load-bearing premise

The chosen 34 games, 170 tasks, browser setting, and semantic action parsing together constitute a representative test of general multimodal agent abilities without large interface biases or gaps in real-world interaction challenges.

What would settle it

A follow-up run in which the highest-scoring agent reaches human-comparable success rates on at least 80 percent of the 170 tasks across multiple full-benchmark evaluations would indicate the performance gap is closing; consistent sub-human results despite new models would indicate the gap persists.

read the original abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GameWorld adds a useful standardized benchmark with 34 games and dual interfaces, but the human baseline gap looks overstated without matching test conditions.

read the letter

The main thing to know is that this paper introduces GameWorld as a browser-based benchmark covering 34 games and 170 tasks, with state-verifiable metrics and two agent interfaces: direct keyboard/mouse emission and semantic actions via parsing. That scale and the split between computer-use and generalist setups are the concrete new pieces, and they address the heterogeneity problem in prior game agent evaluations more systematically than most existing work.

Referee Report

2 major / 2 minor

Summary. The paper introduces GameWorld, a benchmark for standardized evaluation of multimodal LLM agents as game players in browser environments. It covers 34 games and 170 tasks with state-verifiable outcome metrics, studies two interfaces (direct computer-use via keyboard/mouse emission and semantic actions via deterministic parsing), evaluates 18 model-interface pairs, and reports that even the strongest agents remain far below human performance levels. Additional experiments address benchmark robustness via repeated reruns, real-time interaction, context-memory sensitivity, and action validity.

Significance. If the human comparisons and interface controls are fairly matched, GameWorld supplies a reproducible, verifiable testbed that directly targets the perception-planning-control loop required for embodied agents. The use of closed-loop browser environments, deterministic parsing, and outcome-based verification addresses common heterogeneity problems in game-agent evaluation and provides a concrete platform for measuring progress toward generalist multimodal agents.

major comments (2)

[Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.
[Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.

minor comments (2)

[Abstract] Abstract and methods: Exact human baseline collection protocol, number of human trials, and any error bars or variance measures are not reported, even though the abstract cites 'performance gaps and robustness from repeated reruns.'
[Introduction] The project page is referenced but the paper should explicitly state which evaluation details (full task lists, human protocols, raw logs) are only available online versus contained in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of GameWorld's contributions. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses

Referee: [Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.

Authors: We appreciate this critical observation on ensuring fair human baselines. All human performance data were in fact collected inside the identical browser environment using the same two interfaces (direct keyboard/mouse emission and semantic action parsing) with matched rendering, latency, and control constraints. We regret that this protocol was not stated explicitly in the original Results section. We have revised the manuscript to add a dedicated subsection describing the human evaluation procedure, including participant instructions, interface screenshots, and explicit confirmation that no native desktop controls were used. This change directly resolves the concern and reinforces that the reported performance gap reflects agent limitations in perception, planning, and control. revision: yes
Referee: [Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.

Authors: We agree that greater transparency on task selection strengthens the benchmark's claims. We have expanded Section 3 with a new subsection titled 'Task Selection Criteria and Benchmark Design.' It now details: genre coverage across 8 categories (action, puzzle, strategy, simulation, etc.) with explicit game examples; difficulty calibration via pilot human playtests and rule-based agent runs to span easy-to-hard tasks with associated completion-time statistics; and bias mitigation by verifying every task is solvable under both interfaces through deterministic simulations and excluding any game where one interface confers an inherent advantage. These additions make the representativeness of the 170 tasks explicit and support the evaluation of generalist multimodal capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent human comparisons

full rationale

The paper introduces GameWorld as a new benchmark with 34 games, 170 tasks, and two agent interfaces (computer-use and semantic-action), then reports direct empirical results across 18 model-interface pairs. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the abstract or described methods. The central claim that agents are far from human capabilities rests on outcome-based metrics and repeated reruns for robustness, not on any reduction to inputs by construction. Human baselines are positioned as external reference points, with the benchmark itself offered as an independent, reproducible evaluation tool via the project page.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical benchmark paper, the central claim rests on the design choices for games, tasks, interfaces, and metrics rather than mathematical axioms or new physical entities; no free parameters or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5573 in / 1094 out tokens · 36437 ms · 2026-05-10T17:54:29.903842+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

Reference graph

Works this paper leans on

87 extracted references · 17 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Cubefield

Max Abernethy. Cubefield. Flash game (preserved on Internet Archive), 2006. URLhttps://archive.org/deta ils/cubefield_flash

2006
[2]

Amazing Adam. Vex 3. Browser platform game, 2014. URLhttps://apps.microsoft.com/detail/9ntlfr2t dg7z

2014
[3]

Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, and Gunhee Kim. Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games. InEMNLP, 2025

2025
[4]

Fireboy and watergirl

Oslo Albet. Fireboy and watergirl. Flash puzzle-platform game, 2009. URLhttps://en.wikipedia.org/wiki/ Fireboy_and_Watergirl

2009
[5]

Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024

Anthropic. Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024. Official description of the Claude model family

2024
[6]

Breakout

Atari, Inc. Breakout. Arcade game manual (preserved digital artifact), 1976. URLhttps://archive.org/deta ils/ArcadeGameManualBreakout

1976
[7]

Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026

2026
[8]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022

2022
[10]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022

2022
[11]

DeepMind Lab

Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016

work page Pith review arXiv 2016
[12]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

2016
[13]

Gabriele Cirulli. 2048. GitHub repository (browser game), 2014. URLhttps://github.com/gabrielecirulli /2048

2048
[14]

Joseph Cloutier. Run 3. Browser game, 2014. URLhttps://player03.com/run/3/beta/

2014
[15]

Another gentleman’s adventure

Coolmath Games. Another gentleman’s adventure. Browser game page (Coolmath Games), 2018. URLhttps: //www.coolmathgames.com/0-another-gentlemans-adventure

2018
[16]

Textworld: A learning environment for text-based games

Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on 48 Artificial Intelligence, IJCAI 2018,...

2018
[17]

The world’s hardest game

Stephen Critoph. The world’s hardest game. Flash game, 2007. URLhttps://en.wikipedia.org/wiki/The_ World%27s_Hardest_Game

2007
[18]

The world’s hardest game 2

Stephen Critoph. The world’s hardest game 2. Flash game, 2008. URLhttps://archive.org/details/worlds hardestgame2_202310

2008
[19]

Boxel rebound

Jacob DeBenedetto. Boxel rebound. Browser game / extension distribution (official developer site), 2017. URL https://www.dopplercreative.com/boxel-rebound/privacy-policy

2017
[20]

Dedra Games. Ovo. Google Play app listing, 2018. URLhttps://play.google.com/store/apps/details?i d=com.dedra.ovo

2018
[21]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems,
[22]

URLhttps://openreview.net/forum?id=kiYqbO3wqw
[23]

The adventures of captain callisto

Cody Ebberson. The adventures of captain callisto. JS13K Games entry (browser game), 2021. URLhttps: //js13kgames.com/entries/the-adventures-of-captain-callisto

2021
[24]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural InformationProcessing Systems, 35:18343–18362, 2022

2022
[25]

Fullscreenmario

Karol Franz. Fullscreenmario. GitHub repository (browser game engine/implementation), 2013. URLhttps: //github.com/karol-f/FullScreenMario

2013
[26]

Assistgui: Task-oriented desktop graphical user interface automation

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation.arXiv preprint arXiv:2312.13108, 2023

work page arXiv 2023
[27]

Gemini: A family of highly capable multimodal models, 2025

Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312 .11805

2025
[28]

Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025

Google. Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025. System card describing the Gemini 2.5 Computer Use model

2025
[29]

Chrome dino (offline dinosaur game)

Google Chrome team. Chrome dino (offline dinosaur game). Built-in browser game in Google Chrome, 2014. URL https://blog.google/products-and-platforms/products/chrome/chrome-dino/

2014
[30]

Google snake

Google LLC. Google snake. Google Doodle browser game, 2013. URLhttps://www.google.com/fbx?fbx=sna ke_arcade

2013
[31]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[32]

Hextris contributors. Hextris. GitHub repository (browser game), 2014. URLhttps://github.com/Hextris/h extris

2014
[33]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review arXiv 2025
[34]

Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?, 2025

2025
[35]

The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024

work page arXiv 2024
[36]

Geodash 2.2

IdeiGeniale. Geodash 2.2. GitHub repository (browser game), 2025. URLhttps://github.com/IdeiGeniale /GeoDash2.2

2025
[37]

Temple run 2

Imangi Studios. Temple run 2. Google Play listing, 2013. URLhttps://play.google.com/store/apps/detai ls?id=com.imangi.templerun2. 49

2013
[38]

Toru Iwatani. Pac-man. Official franchise history page, 1980. URLhttps://pacman.com/en/history/

1980
[39]

Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025

2025
[40]

Ketchapp. Stack. App Store listing, 2016. URLhttps://apps.apple.com/us/app/stack/id1080487957

2016
[41]

Restless wing syndrome

Leko. Restless wing syndrome. itch.io game page, 2020. URLhttps://leko.itch.io/restless-wing-syndr ome

2020
[42]

Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

work page arXiv 2025
[43]

Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023

Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023

2023
[44]

Nitrogen: An open foundation model for generalist gaming agents, 2025

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, et al. Nitrogen: An open foundation model for generalist gaming agents, 2025. Preprint

2025
[45]

Rocket league 2d

Gurpreet Singh Matharoo. Rocket league 2d. itch.io game page, 2017. URLhttps://matharoo.itch.io/rl2d

2017
[46]

Microsoft minesweeper

Microsoft. Microsoft minesweeper. Microsoft Store app listing, 2012. URLhttps://apps.microsoft.com/det ail/9wzdncrfhwcn

2012
[47]

Microsoft edge surf

Microsoft Edge team. Microsoft edge surf. Built-in browser game in Microsoft Edge, 2020. URLhttps://blogs.wi ndows.com/msedgedev/2020/05/26/surf-game-edge-stable/

2020
[48]

Ns-shaft

NAGI-P SOFT. Ns-shaft. Official developer download/info page, 2001. URLhttps://www.nagi-p.com/v1/eng/ nsshaft.html

2001
[49]

Flappy bird

Dong Nguyen. Flappy bird. Mobile game, 2013. URLhttps://en.wikipedia.org/wiki/Flappy_Bird

2013
[50]

Computer-using agent

OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/ , 2025. Accessed 2025-08-18

2025
[51]

Gpt-5 system card

OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf , 2025. System card describing the GPT-5 model family

2025
[52]

Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025. ICLR 2025

2025
[53]

Alexey Pajitnov. Tetris. Official history/about page, 1984. URLhttps://tetris.com/about

1984
[54]

Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025

Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. Preprint and project release

2025
[55]

Doodle jump

Igor Pušenjak and Marko Pušenjak. Doodle jump. Mobile game, 2009. URLhttps://en.wikipedia.org/wiki/ Doodle_Jump

2009
[56]

Core ball

randomyang. Core ball. GitHub repository (HTML5 browser game), 2015. URLhttps://github.com/randomy ang/core-ball

2015
[57]

Toolformer: Language Models Can Teach Themselves to Use Tools

TimoSchick,JaneDwivedi-Yu,RobertoDessì,RobertaRaileanu,MariaLomeli,LukeZettlemoyer,NicolaCancedda,and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review arXiv 2023
[58]

Ui-tars-1.5.https://seed-tars.com/1.5, 2025

ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

2025
[59]

Wolfenstein 3d html5

Jacob Seidelin. Wolfenstein 3d html5. GitHub repository, 2012. URLhttps://github.com/jseidelin/wolf3d

2012
[60]

Scaling instructable agents across many simulated worlds, 2024

SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds, 2024

2024
[61]

Sima 2: A generalist embodied agent for virtual worlds, 2025

SIMA Team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, et al. Sima 2: A generalist embodied agent for virtual worlds, 2025. 50

2025
[62]

Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025

Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025

2025
[63]

Videogameqa-bench: Evaluating vision-language models for video game quality assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InNeurIPS Datasets and Benchmarks Track, 2025. Paper reports 9 QA task types and 4,786 questions over 800+ games

2025
[64]

Cradle: Empowering foundation agents towards general computer control,

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024
[65]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

work page arXiv 2025
[66]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Rye Terrell. Astray. GitHub repository and GitHub Pages browser game, 2015. URLhttps://github.com/wwwty ro/Astray

2015
[68]

Monkey mart

TinyDobbins. Monkey mart. Browser game page (Poki), 2022. URLhttps://poki.com/en/g/monkey-mart

2022
[69]

Game-rl: Synthesizing multimodal verifiable game data to boost vlms’ general reasoning, 2025

Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodal verifiable game data ...

2025
[70]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Are large vision language models good game players?, 2025

Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players?, 2025

2025
[72]

JARVIS-1: Open- world multi-task agents with memory-augmented multimodal lan- guage models,

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023

work page arXiv 2023
[73]

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025. Technical report

2025
[74]

Josh Wardle. Wordle. Browser game, 2021. URLhttps://en.wikipedia.org/wiki/Wordle

2021
[75]

Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024

xAI. Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024. Official description of the Grok model family

2024
[76]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

2024
[77]

Play to generalize: Learning to reason through game play, 2025

Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, and Chen Wei. Play to generalize: Learning to reason through game play, 2025

2025
[78]

Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng

Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng. Deepphy: Benchmarking agentic vlms on physical reasoning, 2025. URLhttps://arxiv.org/abs/25 08.05405

2025
[79]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Zhang, Thomas L

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?, 2025

2025

Showing first 80 references.