Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Pith reviewed 2026-05-19 11:57 UTC · model grok-4.3
The pith
Orak benchmark supplies tools and data to train and test LLM agents across 12 video game genres.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Orak establishes a foundation towards versatile gaming agents by providing a united evaluation framework including game leaderboards, LLM battle arenas, and ablation studies of input modality, agentic strategies, and fine-tuning effects, supported by a plug-and-play interface built on Model Context Protocol and a released fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres.
What carries the argument
The plug-and-play interface built on Model Context Protocol that supports systematic and reproducible studies of agentic modules across the 12 games.
If this is right
- Leaderboards can rank LLM agents by performance across multiple game genres in one shared setting.
- Ablation experiments can isolate which input modalities and agent strategies produce the largest gains.
- The expert trajectory dataset enables fine-tuning that converts general LLMs into stronger game players.
- LLM battle arenas make head-to-head comparisons between different agent designs straightforward.
- The overall setup allows researchers to run controlled tests on how agents handle diverse gameplay demands.
Where Pith is reading between the lines
- A shared benchmark of this form could make incremental progress in game agents easier to track and compare over time.
- It may support later work on whether skills learned in one genre transfer to others without retraining.
- Adding measures of human preference or enjoyment to the evaluations could reveal what players actually value in agent behavior.
Load-bearing premise
The plug-and-play interface built on Model Context Protocol supports systematic and reproducible studies of agentic modules across the 12 games without significant technical barriers or loss of fidelity in gameplay representation.
What would settle it
A study that finds the interface demands game-specific fixes or produces non-reproducible agent behaviors across the 12 titles would show the unified framework does not deliver consistent evaluations.
Figures
read the original abstract
Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Orak, a benchmark for training and evaluating LLM agents across 12 video games spanning major genres. It features a plug-and-play interface based on the Model Context Protocol (MCP) to enable systematic studies of agentic modules, releases a fine-tuning dataset of expert gameplay trajectories, and provides a unified evaluation framework including game leaderboards, LLM battle arenas, and ablation studies on input modalities, agentic strategies, and fine-tuning effects.
Significance. If the central claims hold, Orak would address key gaps in existing game benchmarks by offering diversity across genres, support for agentic module analysis, and fine-tuning resources. The open release of code on GitHub and datasets on Hugging Face, combined with the framework for leaderboards and ablations, represents a concrete strength that could enable reproducible progress toward versatile LLM gaming agents.
major comments (1)
- [Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.
minor comments (2)
- [Abstract] Abstract: The text contains a LaTeX placeholder 'and ablations studies' that should be replaced with the intended phrasing.
- [Introduction or Benchmark Description] The manuscript would benefit from an explicit table or section listing the 12 games, their genres, and key mechanics to allow readers to assess coverage and potential fidelity issues directly.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, particularly on the importance of substantiating the fidelity claims for the MCP interface. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the MCP-based plug-and-play interface 'supports systematic and reproducible studies of agentic modules in varied game scenarios' without 'significant loss of fidelity in gameplay representation' is load-bearing for the benchmark's validity and the modality ablations. For the subset of the 12 games whose core mechanics involve continuous physics, spatial reasoning, or non-textual visual cues, the text serialization process could omit information that human or visual agents rely on; the manuscript provides no concrete examples of state representations or fidelity validation for such titles.
Authors: We appreciate this observation and agree that explicit examples and discussion of fidelity are necessary to support the benchmark's claims. The MCP interface uses carefully engineered text serializations that include discretized positions, velocities, object states, and relevant environmental variables for each game to capture core mechanics. However, the current manuscript describes these representations at a high level in the methods and provides limited game-specific details without concrete serialized examples or quantitative fidelity checks against visual or human baselines. In the revised version we will add a new subsection with concrete state-representation examples drawn from at least two games that involve continuous physics and spatial reasoning, together with a short discussion of design choices made to mitigate information loss. We will also note any remaining limitations for visual-cue-heavy titles. These additions will directly strengthen the abstract claim and the supporting evidence for the modality ablations. revision: yes
Circularity Check
No circularity: Orak is a new benchmark resource whose contributions are self-contained
full rationale
The paper introduces a benchmark suite, MCP-based interface, expert trajectories dataset, and empirical leaderboards/ablation studies across 12 games. No derivation chain, equations, or first-principles predictions are claimed; the central outputs (leaderboards, fine-tuning effects, modality comparisons) are direct experimental results on the released resources rather than reductions of fitted parameters or self-citations. The MCP interface and game coverage are presented as engineering contributions whose fidelity is evaluated empirically, not assumed via prior self-referential theorems. This matches the default expectation for a resource/benchmark paper and receives the lowest circularity score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can benefit from fine-tuning on expert trajectories
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.
-
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...
-
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weakness...
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[2]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St- webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. arXiv preprint arXiv:2410.06703, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
Towards a realistic long-term benchmark for open-web research agents
Peter Mühlbacher, Nikos I Bosse, and Lawrence Phillips. Towards a realistic long-term benchmark for open-web research agents. arXiv preprint arXiv:2409.14913, 2024
-
[9]
Datascibench: An llm agent benchmark for data science
Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. Datascibench: An llm agent benchmark for data science. arXiv preprint arXiv:2502.13897, 2025
-
[10]
Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai
NVIDIA. Introducing nvidia ace for games - spark life into virtual charac- ters with generative ai. https://www.nvidia.com/en-us/geforce/news/ nvidia-ace-for-games-generative-ai-npcs/ , 2025. Accessed: 2025-05-13
work page 2025
-
[11]
A survey on large language model-based game agents
Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Tekin, Gaowen Liu, Ramana Kompella, and Ling Liu. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024
-
[12]
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Inter- active fiction games: A colossal adventure
Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020
work page 2020
- [14]
-
[15]
Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023
-
[16]
Chessgpt: Bridging policy learning and language modeling
Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36:7216–7262, 2023. 10
work page 2023
-
[17]
The nethack learning environment
Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020
work page 2020
-
[18]
arXiv preprint arXiv:2109.06780 , year=
Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021
-
[19]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022
work page 2022
-
[20]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Civrealm: A learning and reasoning odyssey in civilization for decision-making agents
Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, et al. Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024
-
[22]
Pokéllmon: A human-parity agent for pokémon battles with large language models
Sihao Hu, Tiansheng Huang, and Ling Liu. Pokéllmon: A human-parity agent for pokémon battles with large language models. arXiv preprint arXiv:2402.01118, 2024
-
[23]
Large language models play starcraft ii: Benchmarks and a chain of summarization approach
Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37:133386–133442, 2024
work page 2024
-
[24]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[25]
Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024
-
[26]
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents. arXiv preprint arXiv:2406.06613, 2024
-
[27]
Gamearena: Evaluating llm reasoning through live computer games
Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games. arXiv preprint arXiv:2412.06394, 2024
-
[28]
Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023
-
[29]
Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543, 2024
-
[30]
Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025
Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025
-
[31]
Karlsson, Bo An, and Zongqing Lu
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186, 2024
-
[32]
Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-mage: A game evaluation framework for assessing visual-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. arXiv preprint arXiv:2503.06047, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[35]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[36]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023
work page 2023
-
[37]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[38]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pages 9118–9147. PMLR, 2022
work page 2022
-
[40]
Llm-planner: Few-shot grounded planning for embodied agents with large language models
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023
work page 2023
-
[41]
Large language models as commonsense knowledge for large-scale task planning
Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36:31967– 31987, 2023
work page 2023
-
[42]
Fireact: Toward language agent fine-tuning
Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023
-
[43]
Agenttuning: Enabling generalized agent abilities for llms
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023
-
[44]
Agent-flan: Designing data and methods of effective agent tuning for large language models
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024
-
[45]
Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories
Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories. arXiv preprint arXiv:2410.07706, 2024
-
[46]
Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments. arXiv preprint arXiv:2501.10893, 2025
-
[47]
Executable code actions elicit better llm agents
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[48]
Agile: A novel reinforcement learning framework of llm agents
Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents. arXiv preprint arXiv:2405.14751, 2024
-
[49]
Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025
Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents. arXiv preprint arXiv:2502.01600, 2025. 12
-
[50]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024
-
[52]
Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-v0: Train real-world long-horizon agents via reinforcement learning, 2025
work page 2025
-
[53]
Street Fighter III: 3rd Strike
Capcom. Street Fighter III: 3rd Strike. https://streetfighter.fandom.com/wiki/ Street_Fighter_III:_3rd_Strike, 1997. Accessed: 2025-05-12
work page 1997
-
[54]
Super Mario Bros for OpenAI Gym
Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-12
work page 2018
-
[55]
Capcom. Phoenix Wright: Ace Attorney. https://aceattorney.fandom.com/wiki/ Phoenix_Wright:_Ace_Attorney, 2001. Accessed: 2025-05-12
work page 2001
- [56]
-
[57]
Game Freak. Pokémon Red Version. https://pokemon.fandom.com/wiki/Pok%C3% A9mon_Red_and_Blue_Versions, 1996. Accessed: 2025-05-12
work page 1996
-
[58]
Red Hook Studios. Darkest Dungeon. https://www.darkestdungeon.com, 2016. Accessed: 2025-05-12
work page 2016
- [59]
-
[60]
PrismarineJS contributors. PrismarineJS/mineflayer: Create Minecraft bots with a powerful, stable, and high-level JavaScript API. https://github.com/PrismarineJS/mineflayer,
-
[61]
Accessed: 2025-05-01
work page 2025
-
[62]
ConcernedApe. Stardew Valley. https://www.stardewvalley.net, 2016. Accessed: 2025- 05-12
work page 2016
-
[63]
Blizzard Entertainment. StarCraft II. https://starcraft2.com, 2010. Accessed: 2025-05- 12
work page 2010
-
[64]
MegaCrit. Slay the Spire. https://www.megacrit.com, 2017. Accessed: 2025-05-12
work page 2017
-
[65]
Hempuli. Baba is you. https://hempuli.com/baba/, 2019. Accessed: 2025-05-12
work page 2019
-
[66]
Gabriele Cirulli. 2048. https://play2048.co/, 2014. Accessed: 2025-05-12
work page 2048
-
[67]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024
-
[70]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Claude 3.7 sonnet: Our most capable model yet
Anthropic. Claude 3.7 sonnet: Our most capable model yet. https://www.anthropic.com/ news/claude-3-7-sonnet , 2025. Accessed: 2025-05-08
work page 2025
-
[73]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Wikipedia contributors. List of video game genres. https://en.wikipedia.org/wiki/ List_of_video_game_genres, 2025. Accessed: 2025-05-22
work page 2025
-
[75]
DIAMBRA: Reinforcement Learning Platform for Competitive Video Games
DIAMBRA. DIAMBRA: Reinforcement Learning Platform for Competitive Video Games. https://www.diambra.ai/, 2025. Accessed: 2025-05-22
work page 2025
-
[76]
YOLOv11: An Overview of the Key Architectural Enhancements
Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952
work page 1952
-
[78]
Super Mario Bros for OpenAI Gym
Christian Kauten. Super Mario Bros for OpenAI Gym. https://github.com/Kautenja/ gym-super-mario-bros , 2018. Accessed: 2025-05-21
work page 2018
-
[79]
Harmony: A library for patching, replacing and decorating .net and mono methods during runtime
Andreas Pardeike. Harmony: A library for patching, replacing and decorating .net and mono methods during runtime. https://github.com/pardeike/Harmony, 2025. Accessed: 2025- 05-21
work page 2025
-
[80]
Bepinex: Unity / xna game patcher and plugin framework
BepInEx Contributors. Bepinex: Unity / xna game patcher and plugin framework. https: //github.com/BepInEx/BepInEx, 2025. Accessed: 2025-05-21
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.