Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Ameya S. Mahabaleshwarkar; Beongjun Choi; Bilal Kartal; Byeong-Uk Lee; Dongmin Park; Inkyu Park; Jaewoo Ahn; Jaewoong Cho; Jaeyoung Hwang; Jonghyun Lee

arxiv: 2506.03610 · v3 · submitted 2025-06-04 · 💻 cs.AI

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park , Minkyu Kim , Beongjun Choi , Junhyuck Kim , Keon Lee , Jonghyun Lee , Inkyu Park , Byeong-Uk Lee

show 8 more authors

Jaeyoung Hwang Jaewoo Ahn Ameya S. Mahabaleshwarkar Bilal Kartal Pritam Biswas Yoshi Suhara Kangwook Lee Jaewoong Cho

This is my paper

Pith reviewed 2026-05-19 11:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsvideo game benchmarkfine-tuning datasetagentic strategiesgame genresevaluation frameworkModel Context ProtocolLLM battle arenas

0 comments

The pith

Orak benchmark supplies tools and data to train and test LLM agents across 12 video game genres.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Orak to overcome gaps in existing benchmarks that fail to test LLM agents across varied game genres or to study key agent components and fine-tuning needs. It covers 12 popular games from all major genres and supplies a plug-and-play interface plus a dataset of expert gameplay trajectories for adapting general LLMs into game agents. Evaluations include leaderboards, direct LLM-versus-LLM competitions, and targeted experiments on input types, strategies, and fine-tuning. A sympathetic reader would value this because it turns isolated game tests into a repeatable platform that can reveal what actually improves agent performance in practice.

Core claim

Orak establishes a foundation towards versatile gaming agents by providing a united evaluation framework including game leaderboards, LLM battle arenas, and ablation studies of input modality, agentic strategies, and fine-tuning effects, supported by a plug-and-play interface built on Model Context Protocol and a released fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres.

What carries the argument

The plug-and-play interface built on Model Context Protocol that supports systematic and reproducible studies of agentic modules across the 12 games.

If this is right

Leaderboards can rank LLM agents by performance across multiple game genres in one shared setting.
Ablation experiments can isolate which input modalities and agent strategies produce the largest gains.
The expert trajectory dataset enables fine-tuning that converts general LLMs into stronger game players.
LLM battle arenas make head-to-head comparisons between different agent designs straightforward.
The overall setup allows researchers to run controlled tests on how agents handle diverse gameplay demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A shared benchmark of this form could make incremental progress in game agents easier to track and compare over time.
It may support later work on whether skills learned in one genre transfer to others without retraining.
Adding measures of human preference or enjoyment to the evaluations could reveal what players actually value in agent behavior.

Load-bearing premise

The plug-and-play interface built on Model Context Protocol supports systematic and reproducible studies of agentic modules across the 12 games without significant technical barriers or loss of fidelity in gameplay representation.

What would settle it

A study that finds the interface demands game-specific fixes or produces non-reproducible agent behaviors across the 12 titles would show the unified framework does not deliver consistent evaluations.

Figures

Figures reproduced from arXiv: 2506.03610 by Ameya S. Mahabaleshwarkar, Beongjun Choi, Bilal Kartal, Byeong-Uk Lee, Dongmin Park, Inkyu Park, Jaewoo Ahn, Jaewoong Cho, Jaeyoung Hwang, Jonghyun Lee, Junhyuck Kim, Kangwook Lee, Keon Lee, Minkyu Kim, Pritam Biswas, Yoshi Suhara.

**Figure 2.** Figure 2: Evaluation pipeline of Orak. Game scores are computed via [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LLM capabilities required to play 12 games in Orak. The color theme (red, yellow, etc) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Match outcomes and Elo ratings for LLMs in two competitive environments. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Two of playable characters in Street Fighter III. Environment. Street Fighter III [53] is a 2D competitive fighting game, known for precise controls, deep mechanics, and a diverse roster of characters. Each character features unique moves, combos, and super arts, requiring precise timing and strategic decision-making. Players aim to defeat their opponent through a mix of normal attacks, special moves, and… view at source ↗

**Figure 6.** Figure 6: Character detection using YOLOv11 model [75] in Street Fighter III. Observation-to-Text Conversion. The Diambra environment offers a convenient interface for extracting the game state from Street Fighter III. Through this interface, we obtain the latest game frame at a resolution of 224×384, along with key state information such as remaining time, player and opponent health, super bar gauge, super count, … view at source ↗

**Figure 7.** Figure 7: Planning prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Action inference prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Reflection prompt for ‘reflection-planning’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Screenshot and assets of Super Mario. Environment. Super Mario (1985 Super Mario Bros) [54] is a side-scrolling game where the player controls Mario to avoid obstacles, defeat monsters, and reach the flag. In this environment, Mario progresses through the game using directional key controls (e.g., ‘left’ and ‘right’ keys) and jump actions. Mario should either destroy or traverse obstacles (e.g., bricks, … view at source ↗

**Figure 11.** Figure 11: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Screenshot of Episode 1: The First Turnabout. Environment. Ace Attorney [55] is a courtroom adventure game where players act as defense attorneys, gather evidence, and cross-examine witnesses. We target the first episode of Phoenix Wright: Ace Attorney Trilogy on Steam (see [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 14.** Figure 14: For GPT-4o and Gemini-2.5-pro, using both text and image inputs underperforms compared [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Action inference system prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Action inference user prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 18.** Figure 18: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗

**Figure 19.** Figure 19: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Action Space. Following the BurnySc2/python-sc2 implementation, We define the action space as a discrete set of 72 high-level commands specifically for the Protoss race. These include unit training (e.g., Probes, Zealots, and Stalkers), building construction (e.g., Pylons and Gateways), research upgrades, scouting, multi-unit attacks or retreats, and special abilities (e.g., Chrono Boost). A complete list… view at source ↗

**Figure 20.** Figure 20: Action inference prompt for ‘zero-shot’ agent playing StarCraft II. [PITH_FULL_IMAGE:figures/full_fig_p043_20.png] view at source ↗

**Figure 21.** Figure 21: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p045_21.png] view at source ↗

**Figure 22.** Figure 22: Level 1 of Baba Is You. Environment. Baba Is You is a puzzle game in which players must discover and understand every rule and mechanic on their own, apart from the basic movement keys (‘left’, ‘right’, ‘up’, and ‘down’) [64]. The game’s defining feature is that the text tiles forming the rules can be pushed around, allowing the player to rewrite those rules on the fly. Every valid rule sentence must con… view at source ↗

**Figure 23.** Figure 23: Action inference prompt for ‘zero-shot’ agent playing [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗

**Figure 24.** Figure 24: Action inference prompt for ‘zero-shot’ agent playing 2048. [PITH_FULL_IMAGE:figures/full_fig_p050_24.png] view at source ↗

**Figure 25.** Figure 25: Three different runs for o3 zero-shot agent playing [PITH_FULL_IMAGE:figures/full_fig_p052_25.png] view at source ↗

**Figure 26.** Figure 26: Output of OpenAI’s o3 model right before producing the 1024 tile in [PITH_FULL_IMAGE:figures/full_fig_p052_26.png] view at source ↗

**Figure 27.** Figure 27: Example of original system prompt and augmented system prompt in [PITH_FULL_IMAGE:figures/full_fig_p053_27.png] view at source ↗

**Figure 28.** Figure 28: Comparison of seen (left) and unseen (right) scenarios for six games. [PITH_FULL_IMAGE:figures/full_fig_p056_28.png] view at source ↗

read the original abstract

Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Orak, a benchmark for training and evaluating LLM agents across 12 video games spanning major genres. It features a plug-and-play interface based on the Model Context Protocol (MCP) to enable systematic studies of agentic modules, releases a fine-tuning dataset of expert gameplay trajectories, and provides a unified evaluation framework including game leaderboards, LLM battle arenas, and ablation studies on input modalities, agentic strategies, and fine-tuning effects.

Significance. If the central claims hold, Orak would address key gaps in existing game benchmarks by offering diversity across genres, support for agentic module analysis, and fine-tuning resources. The open release of code on GitHub and datasets on Hugging Face, combined with the framework for leaderboards and ablations, represents a concrete strength that could enable reproducible progress toward versatile LLM gaming agents.

major comments (1)

[Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.

minor comments (2)

[Abstract] Abstract: The text contains a LaTeX placeholder 'and ablations studies' that should be replaced with the intended phrasing.
[Introduction or Benchmark Description] The manuscript would benefit from an explicit table or section listing the 12 games, their genres, and key mechanics to allow readers to assess coverage and potential fidelity issues directly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, particularly on the importance of substantiating the fidelity claims for the MCP interface. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.

Authors: We appreciate this observation and agree that explicit examples and discussion of fidelity are necessary to support the benchmark's claims. The MCP interface uses carefully engineered text serializations that include discretized positions, velocities, object states, and relevant environmental variables for each game to capture core mechanics. However, the current manuscript describes these representations at a high level in the methods and provides limited game-specific details without concrete serialized examples or quantitative fidelity checks against visual or human baselines. In the revised version we will add a new subsection with concrete state-representation examples drawn from at least two games that involve continuous physics and spatial reasoning, together with a short discussion of design choices made to mitigate information loss. We will also note any remaining limitations for visual-cue-heavy titles. These additions will directly strengthen the abstract claim and the supporting evidence for the modality ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: Orak is a new benchmark resource whose contributions are self-contained

full rationale

The paper introduces a benchmark suite, MCP-based interface, expert trajectories dataset, and empirical leaderboards/ablation studies across 12 games. No derivation chain, equations, or first-principles predictions are claimed; the central outputs (leaderboards, fine-tuning effects, modality comparisons) are direct experimental results on the released resources rather than reductions of fitted parameters or self-citations. The MCP interface and game coverage are presented as engineering contributions whose fidelity is evaluated empirically, not assumed via prior self-referential theorems. This matches the default expectation for a resource/benchmark paper and receives the lowest circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation of new benchmark infrastructure and data rather than additional free parameters or new physical entities.

axioms (1)

domain assumption LLM agents can benefit from fine-tuning on expert trajectories
The paper relies on this to justify releasing the dataset as turning general LLMs into effective game agents.

pith-pipeline@v0.9.0 · 5812 in / 1271 out tokens · 43613 ms · 2026-05-19T11:57:13.465660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
cs.AI 2026-04 unverdicted novelty 6.0

STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
cs.CL 2026-04 unverdicted novelty 6.0

RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
cs.AI 2025-08 unverdicted novelty 6.0

Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weakness...
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 6 Pith papers · 20 internal anchors

[1]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review arXiv 2024
[7]

St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. arXiv preprint arXiv:2410.06703, 2024

work page internal anchor Pith review arXiv 2024
[8]

Towards a realistic long-term benchmark for open-web research agents

Peter Mühlbacher, Nikos I Bosse, and Lawrence Phillips. Towards a realistic long-term benchmark for open-web research agents. arXiv preprint arXiv:2409.14913, 2024

work page arXiv 2024
[9]

Datascibench: An llm agent benchmark for data science

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. Datascibench: An llm agent benchmark for data science. arXiv preprint arXiv:2502.13897, 2025

work page arXiv 2025
[10]

Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai

NVIDIA. Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai. https://www.nvidia.com/en-us/geforce/news/ nvidia-ace-for-games-generative-ai-npcs/ , 2025. Accessed: 2025-05-13

work page 2025
[11]

A survey on large language model-based game agents

Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, and Ling Liu. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

work page arXiv 2024
[12]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Inter- active fiction games: A colossal adventure

Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020

work page 2020
[14]

game over

Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868, 2023

work page arXiv 2023
[15]

Adapt: As-needed decomposition and planning with language models.arXiv preprint arXiv:2311.05772, 2023

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

work page arXiv 2023
[16]

Chessgpt: Bridging policy learning and language modeling

Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36:7216–7262, 2023. 10

work page 2023
[17]

The nethack learning environment

Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020

work page 2020
[18]

arXiv preprint arXiv:2109.06780 , year=

Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021

work page arXiv 2021
[19]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022

work page 2022
[20]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Civrealm: A learning and reasoning odyssey in civilization for decision-making agents

Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, et al. Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024

work page arXiv 2024
[22]

Pokéllmon: A human-parity agent for pokémon battles with large language models

Sihao Hu, Tiansheng Huang, and Ling Liu. Pokéllmon: A human-parity agent for pokémon battles with large language models. arXiv preprint arXiv:2402.01118, 2024

work page arXiv 2024
[23]

Large language models play starcraft ii: Benchmarks and a chain of summarization approach

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386–133442, 2024

work page 2024
[24]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[25]

J., L AM, M

Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

work page arXiv 2024
[26]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents. arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024
[27]

Gamearena: Evaluating llm reasoning through live computer games

Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games. arXiv preprint arXiv:2412.06394, 2024

work page arXiv 2024
[28]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023

work page arXiv 2023
[29]

Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543, 2024

work page arXiv 2024
[30]

Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

work page arXiv 2025
[31]

Karlsson, Bo An, and Zongqing Lu

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024
[32]

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-mage: A game evaluation framework for assessing visual-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. arXiv preprint arXiv:2503.06047, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[35]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[36]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[37]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[38]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022
[40]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

work page 2023
[41]

Large language models as commonsense knowledge for large-scale task planning

Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36:31967– 31987, 2023

work page 2023
[42]

Fireact: Toward language agent fine-tuning

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023
[43]

Agenttuning: Enabling generalized agent abilities for llms

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

work page arXiv 2023
[44]

Agent-flan: Designing data and methods of effective agent tuning for large language models

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024

work page arXiv 2024
[45]

Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories

Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories. arXiv preprint arXiv:2410.07706, 2024

work page arXiv 2024
[46]

Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025
[47]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[48]

Agile: A novel reinforcement learning framework of llm agents

Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents. arXiv preprint arXiv:2405.14751, 2024

work page arXiv 2024
[49]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025. 12

work page arXiv 2025
[50]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024

work page internal anchor Pith review arXiv 2024
[51]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024
[52]

Gonzalez, and Ion Stoica

Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-v0: Train real-world long-horizon agents via reinforcement learning, 2025

work page 2025
[53]

Street Fighter III: 3rd Strike

Capcom. Street Fighter III: 3rd Strike. https://streetfighter.fandom.com/wiki/ Street_Fighter_III:_3rd_Strike, 1997. Accessed: 2025-05-12

work page 1997
[54]

Super Mario Bros for OpenAI Gym

Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-12

work page 2018
[55]

Phoenix Wright: Ace Attorney

Capcom. Phoenix Wright: Ace Attorney. https://aceattorney.fandom.com/wiki/ Phoenix_Wright:_Ace_Attorney, 2001. Accessed: 2025-05-12

work page 2001
[56]

Her Story

Sam Barlow. Her Story. https://www.herstorygame.com, 2015. Accessed: 2025-05-12

work page 2015
[57]

Pokémon Red Version

Game Freak. Pokémon Red Version. https://pokemon.fandom.com/wiki/Pok%C3% A9mon_Red_and_Blue_Versions, 1996. Accessed: 2025-05-12

work page 1996
[58]

Darkest Dungeon

Red Hook Studios. Darkest Dungeon. https://www.darkestdungeon.com, 2016. Accessed: 2025-05-12

work page 2016
[59]

Minecraft

Mojang Studios. Minecraft. https://www.minecraft.net, 2011. Accessed: 2025-05-12

work page 2011
[60]

PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API

PrismarineJS contributors. PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API. https://github.com/PrismarineJS/mineflayer,

work page
[61]

Accessed: 2025-05-01

work page 2025
[62]

Stardew Valley

ConcernedApe. Stardew Valley. https://www.stardewvalley.net, 2016. Accessed: 2025- 05-12

work page 2016
[63]

StarCraft II

Blizzard Entertainment. StarCraft II. https://starcraft2.com, 2010. Accessed: 2025-05- 12

work page 2010
[64]

Slay the Spire

MegaCrit. Slay the Spire. https://www.megacrit.com, 2017. Accessed: 2025-05-12

work page 2017
[65]

Baba is you

Hempuli. Baba is you. https://hempuli.com/baba/, 2019. Accessed: 2025-05-12

work page 2019
[66]

Gabriele Cirulli. 2048. https://play2048.co/, 2014. Accessed: 2025-05-12

work page 2048
[67]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

work page arXiv 2024
[70]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Claude 3.7 sonnet: Our most capable model yet

Anthropic. Claude 3.7 sonnet: Our most capable model yet. https://www.anthropic.com/ news/claude-3-7-sonnet , 2025. Accessed: 2025-05-08

work page 2025
[73]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

List of video game genres

Wikipedia contributors. List of video game genres. https://en.wikipedia.org/wiki/ List_of_video_game_genres, 2025. Accessed: 2025-05-22

work page 2025
[75]

DIAMBRA: Reinforcement Learning Platform for Competitive Video Games

DIAMBRA. DIAMBRA: Reinforcement Learning Platform for Competitive Video Games. https://www.diambra.ai/, 2025. Accessed: 2025-05-22

work page 2025
[76]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952

work page 1952
[78]

Super Mario Bros for OpenAI Gym

Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-21

work page 2018
[79]

Harmony: A library for patching, replacing and decorating .net and mono methods during runtime

Andreas Pardeike. Harmony: A library for patching, replacing and decorating .net and mono methods during runtime. https://github.com/pardeike/Harmony, 2025. Accessed: 2025- 05-21

work page 2025
[80]

Bepinex: Unity / xna game patcher and plugin framework

BepInEx Contributors. Bepinex: Unity / xna game patcher and plugin framework. https: //github.com/BepInEx/BepInEx, 2025. Accessed: 2025-05-21

work page 2025

Showing first 80 references.

[1] [1]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[2] [2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review arXiv 2024

[7] [7]

St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. arXiv preprint arXiv:2410.06703, 2024

work page internal anchor Pith review arXiv 2024

[8] [8]

Towards a realistic long-term benchmark for open-web research agents

Peter Mühlbacher, Nikos I Bosse, and Lawrence Phillips. Towards a realistic long-term benchmark for open-web research agents. arXiv preprint arXiv:2409.14913, 2024

work page arXiv 2024

[9] [9]

Datascibench: An llm agent benchmark for data science

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. Datascibench: An llm agent benchmark for data science. arXiv preprint arXiv:2502.13897, 2025

work page arXiv 2025

[10] [10]

Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai

NVIDIA. Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai. https://www.nvidia.com/en-us/geforce/news/ nvidia-ace-for-games-generative-ai-npcs/ , 2025. Accessed: 2025-05-13

work page 2025

[11] [11]

A survey on large language model-based game agents

Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, and Ling Liu. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024

work page arXiv 2024

[12] [12]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Inter- active fiction games: A colossal adventure

Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020

work page 2020

[14] [14]

game over

Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868, 2023

work page arXiv 2023

[15] [15]

Adapt: As-needed decomposition and planning with language models.arXiv preprint arXiv:2311.05772, 2023

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

work page arXiv 2023

[16] [16]

Chessgpt: Bridging policy learning and language modeling

Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36:7216–7262, 2023. 10

work page 2023

[17] [17]

The nethack learning environment

Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020

work page 2020

[18] [18]

arXiv preprint arXiv:2109.06780 , year=

Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021

work page arXiv 2021

[19] [19]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022

work page 2022

[20] [20]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Civrealm: A learning and reasoning odyssey in civilization for decision-making agents

Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, et al. Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024

work page arXiv 2024

[22] [22]

Pokéllmon: A human-parity agent for pokémon battles with large language models

Sihao Hu, Tiansheng Huang, and Ling Liu. Pokéllmon: A human-parity agent for pokémon battles with large language models. arXiv preprint arXiv:2402.01118, 2024

work page arXiv 2024

[23] [23]

Large language models play starcraft ii: Benchmarks and a chain of summarization approach

Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386–133442, 2024

work page 2024

[24] [24]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[25] [25]

J., L AM, M

Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

work page arXiv 2024

[26] [26]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents. arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024

[27] [27]

Gamearena: Evaluating llm reasoning through live computer games

Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games. arXiv preprint arXiv:2412.06394, 2024

work page arXiv 2024

[28] [28]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023

work page arXiv 2023

[29] [29]

Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543, 2024

work page arXiv 2024

[30] [30]

Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

work page arXiv 2025

[31] [31]

Karlsson, Bo An, and Zongqing Lu

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024

work page arXiv 2024

[32] [32]

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-mage: A game evaluation framework for assessing visual-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. arXiv preprint arXiv:2503.06047, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[35] [35]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023

[36] [36]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023

[37] [37]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[38] [38]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022

[40] [40]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

work page 2023

[41] [41]

Large language models as commonsense knowledge for large-scale task planning

Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36:31967– 31987, 2023

work page 2023

[42] [42]

Fireact: Toward language agent fine-tuning

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023

[43] [43]

Agenttuning: Enabling generalized agent abilities for llms

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

work page arXiv 2023

[44] [44]

Agent-flan: Designing data and methods of effective agent tuning for large language models

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024

work page arXiv 2024

[45] [45]

Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories

Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories. arXiv preprint arXiv:2410.07706, 2024

work page arXiv 2024

[46] [46]

Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025

[47] [47]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[48] [48]

Agile: A novel reinforcement learning framework of llm agents

Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents. arXiv preprint arXiv:2405.14751, 2024

work page arXiv 2024

[49] [49]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025. 12

work page arXiv 2025

[50] [50]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024

work page internal anchor Pith review arXiv 2024

[51] [51]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024

[52] [52]

Gonzalez, and Ion Stoica

Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-v0: Train real-world long-horizon agents via reinforcement learning, 2025

work page 2025

[53] [53]

Street Fighter III: 3rd Strike

Capcom. Street Fighter III: 3rd Strike. https://streetfighter.fandom.com/wiki/ Street_Fighter_III:_3rd_Strike, 1997. Accessed: 2025-05-12

work page 1997

[54] [54]

Super Mario Bros for OpenAI Gym

Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-12

work page 2018

[55] [55]

Phoenix Wright: Ace Attorney

Capcom. Phoenix Wright: Ace Attorney. https://aceattorney.fandom.com/wiki/ Phoenix_Wright:_Ace_Attorney, 2001. Accessed: 2025-05-12

work page 2001

[56] [56]

Her Story

Sam Barlow. Her Story. https://www.herstorygame.com, 2015. Accessed: 2025-05-12

work page 2015

[57] [57]

Pokémon Red Version

Game Freak. Pokémon Red Version. https://pokemon.fandom.com/wiki/Pok%C3% A9mon_Red_and_Blue_Versions, 1996. Accessed: 2025-05-12

work page 1996

[58] [58]

Darkest Dungeon

Red Hook Studios. Darkest Dungeon. https://www.darkestdungeon.com, 2016. Accessed: 2025-05-12

work page 2016

[59] [59]

Minecraft

Mojang Studios. Minecraft. https://www.minecraft.net, 2011. Accessed: 2025-05-12

work page 2011

[60] [60]

PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API

PrismarineJS contributors. PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API. https://github.com/PrismarineJS/mineflayer,

work page

[61] [61]

Accessed: 2025-05-01

work page 2025

[62] [62]

Stardew Valley

ConcernedApe. Stardew Valley. https://www.stardewvalley.net, 2016. Accessed: 2025- 05-12

work page 2016

[63] [63]

StarCraft II

Blizzard Entertainment. StarCraft II. https://starcraft2.com, 2010. Accessed: 2025-05- 12

work page 2010

[64] [64]

Slay the Spire

MegaCrit. Slay the Spire. https://www.megacrit.com, 2017. Accessed: 2025-05-12

work page 2017

[65] [65]

Baba is you

Hempuli. Baba is you. https://hempuli.com/baba/, 2019. Accessed: 2025-05-12

work page 2019

[66] [66]

Gabriele Cirulli. 2048. https://play2048.co/, 2014. Accessed: 2025-05-12

work page 2048

[67] [67]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

work page arXiv 2024

[70] [70]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [72]

Claude 3.7 sonnet: Our most capable model yet

Anthropic. Claude 3.7 sonnet: Our most capable model yet. https://www.anthropic.com/ news/claude-3-7-sonnet , 2025. Accessed: 2025-05-08

work page 2025

[73] [73]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

List of video game genres

Wikipedia contributors. List of video game genres. https://en.wikipedia.org/wiki/ List_of_video_game_genres, 2025. Accessed: 2025-05-22

work page 2025

[75] [75]

DIAMBRA: Reinforcement Learning Platform for Competitive Video Games

DIAMBRA. DIAMBRA: Reinforcement Learning Platform for Competitive Video Games. https://www.diambra.ai/, 2025. Accessed: 2025-05-22

work page 2025

[76] [76]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[77] [77]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952

work page 1952

[78] [78]

Super Mario Bros for OpenAI Gym

Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-21

work page 2018

[79] [79]

Harmony: A library for patching, replacing and decorating .net and mono methods during runtime

Andreas Pardeike. Harmony: A library for patching, replacing and decorating .net and mono methods during runtime. https://github.com/pardeike/Harmony, 2025. Accessed: 2025- 05-21

work page 2025

[80] [80]

Bepinex: Unity / xna game patcher and plugin framework

BepInEx Contributors. Bepinex: Unity / xna game patcher and plugin framework. https: //github.com/BepInEx/BepInEx, 2025. Accessed: 2025-05-21

work page 2025