Recognition: no theorem link
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
GameWorld benchmark shows even top multimodal AI agents fall far short of human performance on 34 video games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GameWorld supplies a benchmark of 34 diverse games and 170 tasks inside browser environments together with two standardized agent interfaces—direct keyboard-and-mouse control and semantic action parsing—and demonstrates through repeated evaluation of 18 model-interface combinations that current multimodal agents remain far from human capabilities while exposing specific challenges in real-time interaction, memory use, and action validity.
What carries the argument
GameWorld benchmark: 34 games paired with 170 state-verifiable tasks, supporting two agent interfaces (direct control and semantic action parsing) inside a browser environment for outcome-based scoring.
If this is right
- Agents must improve handling of latency, sparse rewards, and irreversible errors within closed interaction loops.
- Semantic action parsing offers a cleaner interface than raw controls for generalist multimodal models.
- Repeated full-benchmark reruns provide a stable baseline for tracking future progress.
- Targeted studies on context memory and real-time constraints identify concrete bottlenecks for agent design.
Where Pith is reading between the lines
- The benchmark's verifiable metrics could support automated training loops that let agents improve through repeated self-play.
- If the games capture core requirements of embodied interaction, similar evaluation pipelines might transfer to robotic or simulation-based tasks.
- Persistent gaps suggest that simply scaling current models will not close the distance without new mechanisms for long-horizon planning and fine motor control.
Load-bearing premise
The chosen 34 games, 170 tasks, browser setting, and semantic action parsing together constitute a representative test of general multimodal agent abilities without large interface biases or gaps in real-world interaction challenges.
What would settle it
A follow-up run in which the highest-scoring agent reaches human-comparable success rates on at least 80 percent of the 170 tasks across multiple full-benchmark evaluations would indicate the performance gap is closing; consistent sub-human results despite new models would indicate the gap persists.
read the original abstract
Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GameWorld, a benchmark for standardized evaluation of multimodal LLM agents as game players in browser environments. It covers 34 games and 170 tasks with state-verifiable outcome metrics, studies two interfaces (direct computer-use via keyboard/mouse emission and semantic actions via deterministic parsing), evaluates 18 model-interface pairs, and reports that even the strongest agents remain far below human performance levels. Additional experiments address benchmark robustness via repeated reruns, real-time interaction, context-memory sensitivity, and action validity.
Significance. If the human comparisons and interface controls are fairly matched, GameWorld supplies a reproducible, verifiable testbed that directly targets the perception-planning-control loop required for embodied agents. The use of closed-loop browser environments, deterministic parsing, and outcome-based verification addresses common heterogeneity problems in game-agent evaluation and provides a concrete platform for measuring progress toward generalist multimodal agents.
major comments (2)
- [Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.
- [Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.
minor comments (2)
- [Abstract] Abstract and methods: Exact human baseline collection protocol, number of human trials, and any error bars or variance measures are not reported, even though the abstract cites 'performance gaps and robustness from repeated reruns.'
- [Introduction] The project page is referenced but the paper should explicitly state which evaluation details (full task lists, human protocols, raw logs) are only available online versus contained in the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of GameWorld's contributions. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: [Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.
Authors: We appreciate this critical observation on ensuring fair human baselines. All human performance data were in fact collected inside the identical browser environment using the same two interfaces (direct keyboard/mouse emission and semantic action parsing) with matched rendering, latency, and control constraints. We regret that this protocol was not stated explicitly in the original Results section. We have revised the manuscript to add a dedicated subsection describing the human evaluation procedure, including participant instructions, interface screenshots, and explicit confirmation that no native desktop controls were used. This change directly resolves the concern and reinforces that the reported performance gap reflects agent limitations in perception, planning, and control. revision: yes
-
Referee: [Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.
Authors: We agree that greater transparency on task selection strengthens the benchmark's claims. We have expanded Section 3 with a new subsection titled 'Task Selection Criteria and Benchmark Design.' It now details: genre coverage across 8 categories (action, puzzle, strategy, simulation, etc.) with explicit game examples; difficulty calibration via pilot human playtests and rule-based agent runs to span easy-to-hard tasks with associated completion-time statistics; and bias mitigation by verifying every task is solvable under both interfaces through deterministic simulations and excluding any game where one interface confers an inherent advantage. These additions make the representativeness of the 170 tasks explicit and support the evaluation of generalist multimodal capabilities. revision: yes
Circularity Check
No circularity: empirical benchmark with independent human comparisons
full rationale
The paper introduces GameWorld as a new benchmark with 34 games, 170 tasks, and two agent interfaces (computer-use and semantic-action), then reports direct empirical results across 18 model-interface pairs. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the abstract or described methods. The central claim that agents are far from human capabilities rests on outcome-based metrics and repeated reruns for robustness, not on any reduction to inputs by construction. Human baselines are positioned as external reference points, with the benchmark itself offered as an independent, reproducible evaluation tool via the project page.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Reference graph
Works this paper leans on
-
[1]
Cubefield
Max Abernethy. Cubefield. Flash game (preserved on Internet Archive), 2006. URLhttps://archive.org/deta ils/cubefield_flash
2006
-
[2]
Amazing Adam. Vex 3. Browser platform game, 2014. URLhttps://apps.microsoft.com/detail/9ntlfr2t dg7z
2014
-
[3]
Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games
Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, and Gunhee Kim. Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games. InEMNLP, 2025
2025
-
[4]
Fireboy and watergirl
Oslo Albet. Fireboy and watergirl. Flash puzzle-platform game, 2009. URLhttps://en.wikipedia.org/wiki/ Fireboy_and_Watergirl
2009
-
[5]
Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024
Anthropic. Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024. Official description of the Claude model family
2024
-
[6]
Breakout
Atari, Inc. Breakout. Arcade game manual (preserved digital artifact), 1976. URLhttps://archive.org/deta ils/ArcadeGameManualBreakout
1976
-
[7]
Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026
Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026
2026
-
[8]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022
2022
-
[10]
Human-level play in the game of diplomacy by combining language models with strategic reasoning
Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022
2022
-
[11]
Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016
work page Pith review arXiv 2016
-
[12]
Openai gym, 2016
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016
2016
-
[13]
Gabriele Cirulli. 2048. GitHub repository (browser game), 2014. URLhttps://github.com/gabrielecirulli /2048
2048
-
[14]
Joseph Cloutier. Run 3. Browser game, 2014. URLhttps://player03.com/run/3/beta/
2014
-
[15]
Another gentleman’s adventure
Coolmath Games. Another gentleman’s adventure. Browser game page (Coolmath Games), 2018. URLhttps: //www.coolmathgames.com/0-another-gentlemans-adventure
2018
-
[16]
Textworld: A learning environment for text-based games
Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on 48 Artificial Intelligence, IJCAI 2018,...
2018
-
[17]
The world’s hardest game
Stephen Critoph. The world’s hardest game. Flash game, 2007. URLhttps://en.wikipedia.org/wiki/The_ World%27s_Hardest_Game
2007
-
[18]
The world’s hardest game 2
Stephen Critoph. The world’s hardest game 2. Flash game, 2008. URLhttps://archive.org/details/worlds hardestgame2_202310
2008
-
[19]
Boxel rebound
Jacob DeBenedetto. Boxel rebound. Browser game / extension distribution (official developer site), 2017. URL https://www.dopplercreative.com/boxel-rebound/privacy-policy
2017
-
[20]
Dedra Games. Ovo. Google Play app listing, 2018. URLhttps://play.google.com/store/apps/details?i d=com.dedra.ovo
2018
-
[21]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems,
-
[22]
URLhttps://openreview.net/forum?id=kiYqbO3wqw
-
[23]
The adventures of captain callisto
Cody Ebberson. The adventures of captain callisto. JS13K Games entry (browser game), 2021. URLhttps: //js13kgames.com/entries/the-adventures-of-captain-callisto
2021
-
[24]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural InformationProcessing Systems, 35:18343–18362, 2022
2022
-
[25]
Fullscreenmario
Karol Franz. Fullscreenmario. GitHub repository (browser game engine/implementation), 2013. URLhttps: //github.com/karol-f/FullScreenMario
2013
-
[26]
Assistgui: Task-oriented desktop graphical user interface automation
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation.arXiv preprint arXiv:2312.13108, 2023
-
[27]
Gemini: A family of highly capable multimodal models, 2025
Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312 .11805
2025
-
[28]
Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025
Google. Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025. System card describing the Gemini 2.5 Computer Use model
2025
-
[29]
Chrome dino (offline dinosaur game)
Google Chrome team. Chrome dino (offline dinosaur game). Built-in browser game in Google Chrome, 2014. URL https://blog.google/products-and-platforms/products/chrome/chrome-dino/
2014
-
[30]
Google snake
Google LLC. Google snake. Google Doodle browser game, 2013. URLhttps://www.google.com/fbx?fbx=sna ke_arcade
2013
-
[31]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Hextris contributors. Hextris. GitHub repository (browser game), 2014. URLhttps://github.com/Hextris/h extris
2014
-
[33]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?, 2025
2025
-
[35]
The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,
Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024
-
[36]
Geodash 2.2
IdeiGeniale. Geodash 2.2. GitHub repository (browser game), 2025. URLhttps://github.com/IdeiGeniale /GeoDash2.2
2025
-
[37]
Temple run 2
Imangi Studios. Temple run 2. Google Play listing, 2013. URLhttps://play.google.com/store/apps/detai ls?id=com.imangi.templerun2. 49
2013
-
[38]
Toru Iwatani. Pac-man. Official franchise history page, 1980. URLhttps://pacman.com/en/history/
1980
-
[39]
Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025
Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025
2025
-
[40]
Ketchapp. Stack. App Store listing, 2016. URLhttps://apps.apple.com/us/app/stack/id1080487957
2016
-
[41]
Restless wing syndrome
Leko. Restless wing syndrome. itch.io game page, 2020. URLhttps://leko.itch.io/restless-wing-syndr ome
2020
-
[42]
Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025
-
[43]
Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023
Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023
2023
-
[44]
Nitrogen: An open foundation model for generalist gaming agents, 2025
Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, et al. Nitrogen: An open foundation model for generalist gaming agents, 2025. Preprint
2025
-
[45]
Rocket league 2d
Gurpreet Singh Matharoo. Rocket league 2d. itch.io game page, 2017. URLhttps://matharoo.itch.io/rl2d
2017
-
[46]
Microsoft minesweeper
Microsoft. Microsoft minesweeper. Microsoft Store app listing, 2012. URLhttps://apps.microsoft.com/det ail/9wzdncrfhwcn
2012
-
[47]
Microsoft edge surf
Microsoft Edge team. Microsoft edge surf. Built-in browser game in Microsoft Edge, 2020. URLhttps://blogs.wi ndows.com/msedgedev/2020/05/26/surf-game-edge-stable/
2020
-
[48]
Ns-shaft
NAGI-P SOFT. Ns-shaft. Official developer download/info page, 2001. URLhttps://www.nagi-p.com/v1/eng/ nsshaft.html
2001
-
[49]
Flappy bird
Dong Nguyen. Flappy bird. Mobile game, 2013. URLhttps://en.wikipedia.org/wiki/Flappy_Bird
2013
-
[50]
Computer-using agent
OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/ , 2025. Accessed 2025-08-18
2025
-
[51]
Gpt-5 system card
OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf , 2025. System card describing the GPT-5 model family
2025
-
[52]
Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025. ICLR 2025
2025
-
[53]
Alexey Pajitnov. Tetris. Official history/about page, 1984. URLhttps://tetris.com/about
1984
-
[54]
Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025
Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. Preprint and project release
2025
-
[55]
Doodle jump
Igor Pušenjak and Marko Pušenjak. Doodle jump. Mobile game, 2009. URLhttps://en.wikipedia.org/wiki/ Doodle_Jump
2009
-
[56]
Core ball
randomyang. Core ball. GitHub repository (HTML5 browser game), 2015. URLhttps://github.com/randomy ang/core-ball
2015
-
[57]
Toolformer: Language Models Can Teach Themselves to Use Tools
TimoSchick,JaneDwivedi-Yu,RobertoDessì,RobertaRaileanu,MariaLomeli,LukeZettlemoyer,NicolaCancedda,and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
Ui-tars-1.5.https://seed-tars.com/1.5, 2025
ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025
2025
-
[59]
Wolfenstein 3d html5
Jacob Seidelin. Wolfenstein 3d html5. GitHub repository, 2012. URLhttps://github.com/jseidelin/wolf3d
2012
-
[60]
Scaling instructable agents across many simulated worlds, 2024
SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds, 2024
2024
-
[61]
Sima 2: A generalist embodied agent for virtual worlds, 2025
SIMA Team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, et al. Sima 2: A generalist embodied agent for virtual worlds, 2025. 50
2025
-
[62]
Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025
Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025
2025
-
[63]
Videogameqa-bench: Evaluating vision-language models for video game quality assurance
Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InNeurIPS Datasets and Benchmarks Track, 2025. Paper reports 9 QA task types and 4,786 questions over 800+ games
2025
-
[64]
Cradle: Empowering foundation agents towards general computer control,
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024
-
[65]
Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025
-
[66]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Rye Terrell. Astray. GitHub repository and GitHub Pages browser game, 2015. URLhttps://github.com/wwwty ro/Astray
2015
-
[68]
Monkey mart
TinyDobbins. Monkey mart. Browser game page (Poki), 2022. URLhttps://poki.com/en/g/monkey-mart
2022
-
[69]
Game-rl: Synthesizing multimodal verifiable game data to boost vlms’ general reasoning, 2025
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodal verifiable game data ...
2025
-
[70]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Are large vision language models good game players?, 2025
Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players?, 2025
2025
-
[72]
JARVIS-1: Open- world multi-task agents with memory-augmented multimodal lan- guage models,
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023
-
[73]
Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025
Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025. Technical report
2025
-
[74]
Josh Wardle. Wordle. Browser game, 2021. URLhttps://en.wikipedia.org/wiki/Wordle
2021
-
[75]
Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024
xAI. Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024. Official description of the Grok model family
2024
-
[76]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
2024
-
[77]
Play to generalize: Learning to reason through game play, 2025
Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, and Chen Wei. Play to generalize: Learning to reason through game play, 2025
2025
-
[78]
Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng
Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng. Deepphy: Benchmarking agentic vlms on physical reasoning, 2025. URLhttps://arxiv.org/abs/25 08.05405
2025
-
[79]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[80]
Zhang, Thomas L
Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.