pith. sign in

arxiv: 2506.03610 · v3 · submitted 2025-06-04 · 💻 cs.AI

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Pith reviewed 2026-05-19 11:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsvideo game benchmarkfine-tuning datasetagentic strategiesgame genresevaluation frameworkModel Context ProtocolLLM battle arenas
0
0 comments X

The pith

Orak benchmark supplies tools and data to train and test LLM agents across 12 video game genres.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Orak to overcome gaps in existing benchmarks that fail to test LLM agents across varied game genres or to study key agent components and fine-tuning needs. It covers 12 popular games from all major genres and supplies a plug-and-play interface plus a dataset of expert gameplay trajectories for adapting general LLMs into game agents. Evaluations include leaderboards, direct LLM-versus-LLM competitions, and targeted experiments on input types, strategies, and fine-tuning. A sympathetic reader would value this because it turns isolated game tests into a repeatable platform that can reveal what actually improves agent performance in practice.

Core claim

Orak establishes a foundation towards versatile gaming agents by providing a united evaluation framework including game leaderboards, LLM battle arenas, and ablation studies of input modality, agentic strategies, and fine-tuning effects, supported by a plug-and-play interface built on Model Context Protocol and a released fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres.

What carries the argument

The plug-and-play interface built on Model Context Protocol that supports systematic and reproducible studies of agentic modules across the 12 games.

If this is right

  • Leaderboards can rank LLM agents by performance across multiple game genres in one shared setting.
  • Ablation experiments can isolate which input modalities and agent strategies produce the largest gains.
  • The expert trajectory dataset enables fine-tuning that converts general LLMs into stronger game players.
  • LLM battle arenas make head-to-head comparisons between different agent designs straightforward.
  • The overall setup allows researchers to run controlled tests on how agents handle diverse gameplay demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A shared benchmark of this form could make incremental progress in game agents easier to track and compare over time.
  • It may support later work on whether skills learned in one genre transfer to others without retraining.
  • Adding measures of human preference or enjoyment to the evaluations could reveal what players actually value in agent behavior.

Load-bearing premise

The plug-and-play interface built on Model Context Protocol supports systematic and reproducible studies of agentic modules across the 12 games without significant technical barriers or loss of fidelity in gameplay representation.

What would settle it

A study that finds the interface demands game-specific fixes or produces non-reproducible agent behaviors across the 12 titles would show the unified framework does not deliver consistent evaluations.

Figures

Figures reproduced from arXiv: 2506.03610 by Ameya S. Mahabaleshwarkar, Beongjun Choi, Bilal Kartal, Byeong-Uk Lee, Dongmin Park, Inkyu Park, Jaewoo Ahn, Jaewoong Cho, Jaeyoung Hwang, Jonghyun Lee, Junhyuck Kim, Kangwook Lee, Keon Lee, Minkyu Kim, Pritam Biswas, Yoshi Suhara.

Figure 1
Figure 1. Figure 1: Overview of Orak, a benchmark designed to evaluate LLM agents across 12 real-world [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation pipeline of Orak. Game scores are computed via [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM capabilities required to play 12 games in Orak. The color theme (red, yellow, etc) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Match outcomes and Elo ratings for LLMs in two competitive environments. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two of playable characters in Street Fighter III. Environment. Street Fighter III [53] is a 2D competitive fighting game, known for precise controls, deep mechanics, and a diverse ros￾ter of characters. Each character features unique moves, combos, and super arts, requiring precise timing and strategic decision-making. Players aim to defeat their opponent through a mix of normal attacks, special moves, and… view at source ↗
Figure 6
Figure 6. Figure 6: Character detection using YOLOv11 model [75] in Street Fighter III. Observation-to-Text Conversion. The Diambra environment offers a convenient interface for ex￾tracting the game state from Street Fighter III. Through this interface, we obtain the latest game frame at a resolution of 224×384, along with key state information such as remaining time, player and opponent health, super bar gauge, super count, … view at source ↗
Figure 7
Figure 7. Figure 7: Planning prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Action inference prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reflection prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Screenshot and assets of Super Mario. Environment. Super Mario (1985 Super Mario Bros) [54] is a side-scrolling game where the player controls Mario to avoid obstacles, defeat monsters, and reach the flag. In this environ￾ment, Mario progresses through the game using directional key controls (e.g., ‘left’ and ‘right’ keys) and jump actions. Mario should either destroy or traverse obstacles (e.g., bricks, … view at source ↗
Figure 11
Figure 11. Figure 11: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Screenshot of Episode 1: The First Turnabout. Environment. Ace Attorney [55] is a courtroom adven￾ture game where players act as defense attorneys, gather evidence, and cross-examine witnesses. We target the first episode of Phoenix Wright: Ace Attorney Trilogy on Steam (see [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 14
Figure 14. Figure 14: For GPT-4o and Gemini-2.5-pro, using both text and image inputs underperforms compared [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Action inference system prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Action inference user prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Action Space. Following the BurnySc2/python-sc2 implementation, We define the action space as a discrete set of 72 high-level commands specifically for the Protoss race. These include unit training (e.g., Probes, Zealots, and Stalkers), building construction (e.g., Pylons and Gateways), research upgrades, scouting, multi-unit attacks or retreats, and special abilities (e.g., Chrono Boost). A complete list… view at source ↗
Figure 20
Figure 20. Figure 20: Action inference prompt for ‘zero-shot’ agent playing StarCraft II. [PITH_FULL_IMAGE:figures/full_fig_p043_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p045_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Level 1 of Baba Is You. Environment. Baba Is You is a puzzle game in which players must discover and understand ev￾ery rule and mechanic on their own, apart from the basic movement keys (‘left’, ‘right’, ‘up’, and ‘down’) [64]. The game’s defining feature is that the text tiles forming the rules can be pushed around, allowing the player to rewrite those rules on the fly. Every valid rule sentence must con… view at source ↗
Figure 23
Figure 23. Figure 23: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Action inference prompt for ‘zero-shot’ agent playing 2048. [PITH_FULL_IMAGE:figures/full_fig_p050_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Three different runs for o3 zero-shot agent playing [PITH_FULL_IMAGE:figures/full_fig_p052_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Output of OpenAI’s o3 model right before producing the 1024 tile in [PITH_FULL_IMAGE:figures/full_fig_p052_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Example of original system prompt and augmented system prompt in [PITH_FULL_IMAGE:figures/full_fig_p053_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Comparison of seen (left) and unseen (right) scenarios for six games. [PITH_FULL_IMAGE:figures/full_fig_p056_28.png] view at source ↗
read the original abstract

Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Orak, a benchmark for training and evaluating LLM agents across 12 video games spanning major genres. It features a plug-and-play interface based on the Model Context Protocol (MCP) to enable systematic studies of agentic modules, releases a fine-tuning dataset of expert gameplay trajectories, and provides a unified evaluation framework including game leaderboards, LLM battle arenas, and ablation studies on input modalities, agentic strategies, and fine-tuning effects.

Significance. If the central claims hold, Orak would address key gaps in existing game benchmarks by offering diversity across genres, support for agentic module analysis, and fine-tuning resources. The open release of code on GitHub and datasets on Hugging Face, combined with the framework for leaderboards and ablations, represents a concrete strength that could enable reproducible progress toward versatile LLM gaming agents.

major comments (1)
  1. [Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.
minor comments (2)
  1. [Abstract] Abstract: The text contains a LaTeX placeholder 'and ablations studies' that should be replaced with the intended phrasing.
  2. [Introduction or Benchmark Description] The manuscript would benefit from an explicit table or section listing the 12 games, their genres, and key mechanics to allow readers to assess coverage and potential fidelity issues directly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, particularly on the importance of substantiating the fidelity claims for the MCP interface. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.

    Authors: We appreciate this observation and agree that explicit examples and discussion of fidelity are necessary to support the benchmark's claims. The MCP interface uses carefully engineered text serializations that include discretized positions, velocities, object states, and relevant environmental variables for each game to capture core mechanics. However, the current manuscript describes these representations at a high level in the methods and provides limited game-specific details without concrete serialized examples or quantitative fidelity checks against visual or human baselines. In the revised version we will add a new subsection with concrete state-representation examples drawn from at least two games that involve continuous physics and spatial reasoning, together with a short discussion of design choices made to mitigate information loss. We will also note any remaining limitations for visual-cue-heavy titles. These additions will directly strengthen the abstract claim and the supporting evidence for the modality ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: Orak is a new benchmark resource whose contributions are self-contained

full rationale

The paper introduces a benchmark suite, MCP-based interface, expert trajectories dataset, and empirical leaderboards/ablation studies across 12 games. No derivation chain, equations, or first-principles predictions are claimed; the central outputs (leaderboards, fine-tuning effects, modality comparisons) are direct experimental results on the released resources rather than reductions of fitted parameters or self-citations. The MCP interface and game coverage are presented as engineering contributions whose fidelity is evaluated empirically, not assumed via prior self-referential theorems. This matches the default expectation for a resource/benchmark paper and receives the lowest circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation of new benchmark infrastructure and data rather than additional free parameters or new physical entities.

axioms (1)
  • domain assumption LLM agents can benefit from fine-tuning on expert trajectories
    The paper relies on this to justify releasing the dataset as turning general LLMs into effective game agents.

pith-pipeline@v0.9.0 · 5812 in / 1271 out tokens · 43613 ms · 2026-05-19T11:57:13.465660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  2. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  3. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

    cs.AI 2026-04 unverdicted novelty 6.0

    STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.

  4. RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...

  5. Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

    cs.AI 2025-08 unverdicted novelty 6.0

    Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weakness...

  6. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  7. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 6 Pith papers · 20 internal anchors

  1. [1]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

  2. [2]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  3. [3]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

  4. [4]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

  5. [5]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

  6. [6]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

  7. [7]

    St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents

    Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. arXiv preprint arXiv:2410.06703, 2024

  8. [8]

    Towards a realistic long-term benchmark for open-web research agents

    Peter Mühlbacher, Nikos I Bosse, and Lawrence Phillips. Towards a realistic long-term benchmark for open-web research agents. arXiv preprint arXiv:2409.14913, 2024

  9. [9]

    Datascibench: An llm agent benchmark for data science

    Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. Datascibench: An llm agent benchmark for data science. arXiv preprint arXiv:2502.13897, 2025

  10. [10]

    Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai

    NVIDIA. Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai. https://www.nvidia.com/en-us/geforce/news/ nvidia-ace-for-games-generative-ai-npcs/ , 2025. Accessed: 2025-05-13

  11. [11]

    A survey on large language model-based game agents

    Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, and Ling Liu. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

  12. [12]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025

  13. [13]

    Inter- active fiction games: A colossal adventure

    Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020

  14. [14]

    game over

    Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868, 2023

  15. [15]

    Adapt: As-needed decomposition and planning with language models.arXiv preprint arXiv:2311.05772, 2023

    Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

  16. [16]

    Chessgpt: Bridging policy learning and language modeling

    Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36:7216–7262, 2023. 10

  17. [17]

    The nethack learning environment

    Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020

  18. [18]

    arXiv preprint arXiv:2109.06780 , year=

    Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021

  19. [19]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022

  20. [20]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  21. [21]

    Civrealm: A learning and reasoning odyssey in civilization for decision-making agents

    Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, et al. Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024

  22. [22]

    Pokéllmon: A human-parity agent for pokémon battles with large language models

    Sihao Hu, Tiansheng Huang, and Ling Liu. Pokéllmon: A human-parity agent for pokémon battles with large language models. arXiv preprint arXiv:2402.01118, 2024

  23. [23]

    Large language models play starcraft ii: Benchmarks and a chain of summarization approach

    Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386–133442, 2024

  24. [24]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  25. [25]

    J., L AM, M

    Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

  26. [26]

    Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

    Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents. arXiv preprint arXiv:2406.06613, 2024

  27. [27]

    Gamearena: Evaluating llm reasoning through live computer games

    Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games. arXiv preprint arXiv:2412.06394, 2024

  28. [28]

    Mitchell, and Yuanzhi Li

    Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023

  29. [29]

    Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

    Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543, 2024

  30. [30]

    Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

    Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

  31. [31]

    Karlsson, Bo An, and Zongqing Lu

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024

  32. [32]

    V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

    Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-mage: A game evaluation framework for assessing visual-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148, 2025

  33. [33]

    DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

    Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. arXiv preprint arXiv:2503.06047, 2025. 11

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  35. [35]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  36. [36]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  37. [37]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  38. [38]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

  39. [39]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pages 9118–9147. PMLR, 2022

  40. [40]

    Llm-planner: Few-shot grounded planning for embodied agents with large language models

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

  41. [41]

    Large language models as commonsense knowledge for large-scale task planning

    Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36:31967– 31987, 2023

  42. [42]

    Fireact: Toward language agent fine-tuning

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023

  43. [43]

    Agenttuning: Enabling generalized agent abilities for llms

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

  44. [44]

    Agent-flan: Designing data and methods of effective agent tuning for large language models

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024

  45. [45]

    Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories

    Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories. arXiv preprint arXiv:2410.07706, 2024

  46. [46]

    Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

    Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893, 2025

  47. [47]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

  48. [48]

    Agile: A novel reinforcement learning framework of llm agents

    Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents. arXiv preprint arXiv:2405.14751, 2024

  49. [49]

    Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025. 12

  50. [50]

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024

  51. [51]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

  52. [52]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-v0: Train real-world long-horizon agents via reinforcement learning, 2025

  53. [53]

    Street Fighter III: 3rd Strike

    Capcom. Street Fighter III: 3rd Strike. https://streetfighter.fandom.com/wiki/ Street_Fighter_III:_3rd_Strike, 1997. Accessed: 2025-05-12

  54. [54]

    Super Mario Bros for OpenAI Gym

    Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-12

  55. [55]

    Phoenix Wright: Ace Attorney

    Capcom. Phoenix Wright: Ace Attorney. https://aceattorney.fandom.com/wiki/ Phoenix_Wright:_Ace_Attorney, 2001. Accessed: 2025-05-12

  56. [56]

    Her Story

    Sam Barlow. Her Story. https://www.herstorygame.com, 2015. Accessed: 2025-05-12

  57. [57]

    Pokémon Red Version

    Game Freak. Pokémon Red Version. https://pokemon.fandom.com/wiki/Pok%C3% A9mon_Red_and_Blue_Versions, 1996. Accessed: 2025-05-12

  58. [58]

    Darkest Dungeon

    Red Hook Studios. Darkest Dungeon. https://www.darkestdungeon.com, 2016. Accessed: 2025-05-12

  59. [59]

    Minecraft

    Mojang Studios. Minecraft. https://www.minecraft.net, 2011. Accessed: 2025-05-12

  60. [60]

    PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API

    PrismarineJS contributors. PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API. https://github.com/PrismarineJS/mineflayer,

  61. [61]

    Accessed: 2025-05-01

  62. [62]

    Stardew Valley

    ConcernedApe. Stardew Valley. https://www.stardewvalley.net, 2016. Accessed: 2025- 05-12

  63. [63]

    StarCraft II

    Blizzard Entertainment. StarCraft II. https://starcraft2.com, 2010. Accessed: 2025-05- 12

  64. [64]

    Slay the Spire

    MegaCrit. Slay the Spire. https://www.megacrit.com, 2017. Accessed: 2025-05-12

  65. [65]

    Baba is you

    Hempuli. Baba is you. https://hempuli.com/baba/, 2019. Accessed: 2025-05-12

  66. [66]

    Gabriele Cirulli. 2048. https://play2048.co/, 2014. Accessed: 2025-05-12

  67. [67]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  68. [68]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  69. [69]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

    Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

  70. [70]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 13

  71. [71]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  72. [72]

    Claude 3.7 sonnet: Our most capable model yet

    Anthropic. Claude 3.7 sonnet: Our most capable model yet. https://www.anthropic.com/ news/claude-3-7-sonnet , 2025. Accessed: 2025-05-08

  73. [73]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  74. [74]

    List of video game genres

    Wikipedia contributors. List of video game genres. https://en.wikipedia.org/wiki/ List_of_video_game_genres, 2025. Accessed: 2025-05-22

  75. [75]

    DIAMBRA: Reinforcement Learning Platform for Competitive Video Games

    DIAMBRA. DIAMBRA: Reinforcement Learning Platform for Competitive Video Games. https://www.diambra.ai/, 2025. Accessed: 2025-05-22

  76. [76]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

  77. [77]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952

  78. [78]

    Super Mario Bros for OpenAI Gym

    Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-21

  79. [79]

    Harmony: A library for patching, replacing and decorating .net and mono methods during runtime

    Andreas Pardeike. Harmony: A library for patching, replacing and decorating .net and mono methods during runtime. https://github.com/pardeike/Harmony, 2025. Accessed: 2025- 05-21

  80. [80]

    Bepinex: Unity / xna game patcher and plugin framework

    BepInEx Contributors. Bepinex: Unity / xna game patcher and plugin framework. https: //github.com/BepInEx/BepInEx, 2025. Accessed: 2025-05-21

Showing first 80 references.