pith. machine review for the scientific record. sign in

arxiv: 2604.18394 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

OpenGame: Open Agentic Coding for Games

Jiaming Han, Jinyuan Hu, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Qianyin Xiao, Ruize Ma, Tianshuo Peng, Xiangyu Yue, Yaozhi Zheng, Yilei Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:08 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic codinggame generationcode LLMweb gamesreinforcement learningsoftware agentsmulti-file codeinteractive applications
0
0 comments X

The pith

OpenGame uses reusable template and debug skills plus a specialized code model to generate fully playable web games from text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard LLMs can handle isolated code tasks but collapse when building games because they lose track of scene wiring, state across files, and real-time logic. The paper claims its OpenGame framework fixes this with a Game Skill made of a Template Skill that reuses stable project skeletons and a Debug Skill that tracks proven fixes, all driven by the GameCoder-27B model trained through pre-training, fine-tuning, and reinforcement learning from game execution. It also supplies OpenGame-Bench, which runs games in a headless browser and uses vision-language models to score build health, visual usability, and how well the result matches the original intent. On 150 varied prompts the system produces working games where earlier approaches fail. This would let code agents move from single-file fixes to complete interactive systems.

Core claim

OpenGame is the first open-source agentic framework for end-to-end web game creation. Its core is Game Skill, a reusable capability built from a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes, together enabling stable architectures and systematic repair of integration errors. This is powered by GameCoder-27B, a 27B code LLM trained in three stages of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. The framework is evaluated with OpenGame-Bench, which measures Build Health, Visual Usability, and Intent Alignment via headless browser runs and VLM judging

What carries the argument

Game Skill, composed of Template Skill for evolving project skeletons and Debug Skill for maintaining protocols of verified fixes, powered by GameCoder-27B through continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning.

If this is right

  • Agentic code systems can now produce complete, playable multi-file projects instead of breaking on cross-file inconsistencies.
  • Evaluating interactive applications requires running them in headless browsers and judging output with vision-language models rather than static code checks.
  • Specialized three-stage training with execution feedback improves reliability on tightly coupled real-time systems like games.
  • Reusable skill libraries that grow from experience can help maintain architectural consistency across successive generations of code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same template-plus-debug pattern could apply to other domains that demand consistent state across many files, such as web apps or simulations.
  • If the debug skill generalizes beyond games, it may lower the amount of human debugging needed for large agent-generated codebases.
  • Open-sourcing the framework and benchmark invites community extensions to new game engines or additional interactive application types.

Load-bearing premise

That the combination of Template Skill, Debug Skill, and execution-grounded reinforcement learning on GameCoder-27B can systematically resolve integration errors and logical incoherence in multi-file game projects rather than merely patching syntax.

What would settle it

Running the 150 generated games in a real browser and checking whether they execute without crashes, maintain consistent visuals and logic, and match the prompt intent according to the Build Health, Visual Usability, and Intent Alignment scores.

Figures

Figures reproduced from arXiv: 2604.18394 by Jiaming Han, Jinyuan Hu, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Qianyin Xiao, Ruize Ma, Tianshuo Peng, Xiangyu Yue, Yaozhi Zheng, Yilei Jiang.

Figure 1
Figure 1. Figure 1: End-to-end agentic game generation with OpenGame. Diverse users provide natural language specifications to autonomously create fully playable 2D games across distinct genres (e.g., action, educational, and tower defense). Each generated project features a complete game lifecycle seamlessly integrated with multimodal visual and audio assets. ABSTRACT Game development sits at the intersection of creative des… view at source ↗
Figure 2
Figure 2. Figure 2: The OpenGame architecture. The framework integrates three coupled components: (a) a multi-stage code-model training pipeline that establishes engine-specific priors, (b) an autonomous agent workflow that translates natural-language game ideas into runnable projects through a structured six-phase process, and (c) an agent-evolution module that continuously refines structural scaffolding (Template Skill) and… view at source ↗
Figure 4
Figure 4. Figure 4: Intent Alignment (IA) scores across different [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces OpenGame, an open-source agentic framework for generating fully playable web games from high-level natural language prompts. Its core contribution is the Game Skill, which combines a Template Skill (for evolving project skeletons) and a Debug Skill (for maintaining verified fix protocols) to scaffold architectures and repair cross-file integration errors. This is powered by GameCoder-27B, a 27B-parameter code LLM trained via continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. The authors also present OpenGame-Bench, an evaluation pipeline that assesses Build Health through headless browser execution alongside Visual Usability and Intent Alignment via VLM judging. They report state-of-the-art results across 150 diverse game prompts, claiming that prior LLMs collapse on multi-file inconsistencies while OpenGame produces end-to-end playable games.

Significance. If the central claims hold, the work would be a meaningful step forward in agentic coding for complex, interactive systems. Demonstrating reliable handling of cross-file state, scene wiring, and logical coherence in game engines goes beyond isolated code tasks and could influence broader efforts in building real-world applications with LLMs. The open release of the framework, specialized model, and benchmark would provide concrete resources for the community, particularly if the execution-grounded RL and evolving skills prove reproducible.

major comments (3)
  1. [OpenGame-Bench] OpenGame-Bench section: the evaluation delegates Visual Usability and Intent Alignment to VLM judging while using headless execution only for Build Health. This setup risks missing persistent cross-file logical incoherence (e.g., mismatched state desynchronization or scene wiring errors) that manifest only during interactive play rather than static or visual inspection, directly weakening the claim that the Template + Debug Skill pipeline systematically resolves integration errors instead of patching surface issues.
  2. [Results] Results and experimental setup: the SOTA claim on 150 prompts lacks reported baselines (including whether prior models received equivalent skills or RL stages), error bars, data splits, or handling of post-hoc fixes. Without these, it is unclear whether the reported gains reflect genuine advances in repair capability or differences in evaluation protocol.
  3. [GameCoder-27B] GameCoder-27B training pipeline: the three-stage process (continual pre-training, SFT, execution-grounded RL) is described at a high level, but no ablation results or quantitative metrics per stage are provided to show that the RL component specifically improves resolution of multi-file inconsistencies over the base model.
minor comments (2)
  1. [Introduction] The abstract and introduction use the term 'Game Skill' without an early formal definition or diagram showing how Template Skill and Debug Skill interact during agent execution.
  2. [Game Skill] Notation for the evolving library of project skeletons and the 'living protocol' of fixes is introduced informally; a small table or pseudocode example would clarify the data structures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions where the manuscript will be updated to improve clarity and rigor.

read point-by-point responses
  1. Referee: [OpenGame-Bench] OpenGame-Bench section: the evaluation delegates Visual Usability and Intent Alignment to VLM judging while using headless execution only for Build Health. This setup risks missing persistent cross-file logical incoherence (e.g., mismatched state desynchronization or scene wiring errors) that manifest only during interactive play rather than static or visual inspection, directly weakening the claim that the Template + Debug Skill pipeline systematically resolves integration errors instead of patching surface issues.

    Authors: We acknowledge that headless browser execution primarily surfaces runtime errors and crashes, while VLM-based judging evaluates static and dynamic visual frames; neither fully substitutes for extended interactive playtesting that could reveal subtle state desynchronization. The Debug Skill protocol is grounded in repeated execution feedback to target integration errors, but we agree the current pipeline does not exhaustively rule out all latent logical issues. In revision we will add an explicit limitations paragraph in the OpenGame-Bench section discussing this gap and outlining planned extensions to interactive testing harnesses. revision: partial

  2. Referee: [Results] Results and experimental setup: the SOTA claim on 150 prompts lacks reported baselines (including whether prior models received equivalent skills or RL stages), error bars, data splits, or handling of post-hoc fixes. Without these, it is unclear whether the reported gains reflect genuine advances in repair capability or differences in evaluation protocol.

    Authors: The 150-prompt evaluation applies the identical OpenGame-Bench protocol to both our system and prior LLMs/code agents; the baselines therefore do not receive the Game Skill framework or GameCoder-27B, which constitute the core contributions. We will revise the experimental section to (i) explicitly tabulate baseline configurations, (ii) report standard deviations across repeated runs, (iii) describe the prompt selection and split procedure, and (iv) confirm that all reported outcomes derive from fully autonomous agent trajectories without manual post-processing. revision: yes

  3. Referee: [GameCoder-27B] GameCoder-27B training pipeline: the three-stage process (continual pre-training, SFT, execution-grounded RL) is described at a high level, but no ablation results or quantitative metrics per stage are provided to show that the RL component specifically improves resolution of multi-file inconsistencies over the base model.

    Authors: We will expand the GameCoder-27B section with available per-stage quantitative metrics (perplexity reduction after continual pre-training, pass@1 improvements after SFT, and reward curves from the execution-grounded RL phase). Comprehensive ablations that isolate the RL stage’s contribution specifically to multi-file inconsistency resolution are not present in the current manuscript; we will note this limitation and flag it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new framework, model, and benchmark are independent contributions

full rationale

The paper's central claims rest on the introduction of OpenGame (with its Template Skill and Debug Skill), the GameCoder-27B three-stage training pipeline, and the new OpenGame-Bench evaluation pipeline, followed by empirical results on 150 diverse prompts. No load-bearing self-citations appear in the provided text, no predictions or results reduce by construction to fitted inputs or prior self-referential definitions, and the evaluation metrics (headless execution plus VLM judging) are presented as newly defined rather than tautological. The derivation chain is self-contained empirical work on novel artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that game development failures are primarily due to cross-file inconsistencies that can be addressed by reusable skills and execution feedback rather than fundamental limitations in current LLMs.

axioms (2)
  • domain assumption LLMs can be specialized for game engines via continual pre-training, SFT, and execution-grounded RL
    Invoked in the description of GameCoder-27B training pipeline
  • domain assumption Headless browser execution plus VLM judging can reliably score interactive playability
    Basis for OpenGame-Bench scoring of Build Health, Visual Usability, and Intent Alignment
invented entities (2)
  • Game Skill no independent evidence
    purpose: Reusable evolving capability for scaffolding and debugging game projects
    Core component of the framework composed of Template Skill and Debug Skill
  • GameCoder-27B no independent evidence
    purpose: Specialized code LLM for game engine mastery
    Model powering the agentic framework after three-stage training

pith-pipeline@v0.9.0 · 5614 in / 1543 out tokens · 29890 ms · 2026-05-10T04:08:12.686910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

    Carlos E Jimenez, John Murphy, Arian Kowalczyk, Prem Mudigonda, et al. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

  2. [2]

    Devin: The first ai software engineer

    Cognition AI. Devin: The first ai software engineer. https://www.cognition-labs.com/introducing-devin, 2024

  3. [3]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  4. [4]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Luan, et al. Swe-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

  5. [5]

    Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

  6. [6]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  7. [7]

    Jimenez, Ofir Press, Ludwig Schmidt, and Diyi Yang

    John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, and Diyi Yang. Codeclash: Benchmarking goal-oriented software engineering, 2025

  8. [8]

    Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation.ArXiv, abs/2506.13832, 2025

    Hongda Zhu, Yiwen Zhang, Bing Zhao, Jingzhe Ding, Siyao Liu, Tong Liu, Dandan Wang, Yanan Liu, and Zhaojian Li. Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation.ArXiv, abs/2506.13832, 2025

  9. [9]

    arXiv:2403.03163 [cs.CL] https://arxiv.org/abs/2403

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: How far are we from automating front-end engineering?ArXiv, abs/2403.03163, 2024

  10. [10]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024

  11. [11]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  12. [12]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024

  13. [13]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

  14. [14]

    Gamedevbench: Evaluating agentic capabilities through game development, 2026

    Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, and Chris Donahue. Gamedevbench: Evaluating agentic capabilities through game development, 2026. 11 OpenGame: Open Agentic Coding for Games

  15. [15]

    Large language models and games: A survey and roadmap.IEEE Transactions on Games, 2024

    Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N Yannakakis. Large language models and games: A survey and roadmap.IEEE Transactions on Games, 2024

  16. [16]

    Deep blue.Artificial intelligence, 134(1-2):57–83, 2002

    Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue.Artificial intelligence, 134(1-2):57–83, 2002

  17. [17]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  18. [18]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

    Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

  19. [19]

    Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A

    Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, et al. Sima 2: A generalist embodied agent for virtual worlds. arXiv preprint arXiv:2512.04797, 2025

  20. [20]

    The pokeagent challenge: Competitive and long-context learning at scale.NeurIPS Competition Track, 2025

    Seth Karten, Jake Grigsby, Stephanie Milani, Kiran V odrahalli, Amy Zhang, Fei Fang, Yuke Zhu, and Chi Jin. The pokeagent challenge: Competitive and long-context learning at scale.NeurIPS Competition Track, 2025

  21. [21]

    Karten, A

    Seth Karten, Andy Luu Nguyen, and Chi Jin. Pokéchamp: an expert-level minimax language agent.arXiv preprint arXiv:2503.04094, 2025

  22. [22]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  23. [23]

    Beating the world record in pokémon emerald: An AI agent case study

    Nunu AI. Beating the world record in pokémon emerald: An AI agent case study. https://nunu.ai/case-studies/ pokemon-emerald, 2024

  24. [24]

    Hoover, Aaron Isaksen, Andy Nealen, and Julian Togelius

    Adam Summerville, Sam Snodgrass, Matthew Guzdial, Christoffer Holmgård, Amy K. Hoover, Aaron Isaksen, Andy Nealen, and Julian Togelius. Procedural content generation via machine learning (pcgml), 2018

  25. [25]

    Procedural content generation in games

    Noor Shaker, Julian Togelius, and Mark J Nelson. Procedural content generation in games. 2016

  26. [26]

    Mariogpt: Open-ended text2level generation through large language models, 2023

    Shyam Sudhakaran, Miguel González-Duque, Claire Glanois, Matthias Freiberger, Elias Najarro, and Sebastian Risi. Mariogpt: Open-ended text2level generation through large language models, 2023

  27. [27]

    Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A

    Alexander Sasha Vezhnevets, John P Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A Duéñez-Guzmán, William A Cunningham, Simon Osindero, Danny Karmon, and Joel Z Leibo. Generative agent-based modeling with actions grounded in physical, social, or digital space using concordia.arXiv preprint arXiv:2312.03664, 2023

  28. [28]

    Multi-actor generative artificial intelligence as a game engine.arXiv preprint arXiv:2507.08892, 2025

    Alexander Sasha Vezhnevets, Jayd Matyas, Logan Cross, Davide Paglieri, Minsuk Chang, William A Cunningham, Simon Osindero, William S Isaac, and Joel Z Leibo. Multi-actor generative artificial intelligence as a game engine.arXiv preprint arXiv:2507.08892, 2025

  29. [29]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  30. [30]

    Unreal engine.https://www.unrealengine.com, 1998

    Epic Games. Unreal engine.https://www.unrealengine.com, 1998

  31. [31]

    Unity game engine.https://unity.com, 2005

    Unity Technologies. Unity game engine.https://unity.com, 2005

  32. [32]

    Phaser - a fast, fun and free open source html5 game framework

    Richard Davey and Photon Storm. Phaser - a fast, fun and free open source html5 game framework. https://phaser.io, 2013

  33. [33]

    Qwen3.5-Max: Scaling open foundation models

    Qwen Team. Qwen3.5-Max: Scaling open foundation models. https://qwenlm.github.io/blog/qwen3-max/,

  34. [34]

    Accessed: 2026-04-20

  35. [35]

    MiniMax-M2.5 technical report.https://www.minimaxi.com/, 2025

    MiniMax. MiniMax-M2.5 technical report.https://www.minimaxi.com/, 2025. Accessed: 2026-04-20

  36. [36]

    GLM-4.5: Advancing open bilingual foundation models

    Zhipu AI. GLM-4.5: Advancing open bilingual foundation models. https://z.ai/blog/glm-4.5, 2025. Accessed: 2026-04-20

  37. [37]

    Kimi K2.5 technical report

    Moonshot AI. Kimi K2.5 technical report. https://moonshotai.github.io/Kimi-K2/, 2025. Accessed: 2026-04- 20

  38. [38]

    DeepSeek-V3.2: Advancing open-source language models

    DeepSeek-AI. DeepSeek-V3.2: Advancing open-source language models. https://www.deepseek.com/, 2025. Accessed: 2026-04-20

  39. [39]

    Claude Sonnet 4.6.https://www.anthropic.com/claude, 2025

    Anthropic. Claude Sonnet 4.6.https://www.anthropic.com/claude, 2025. Accessed: 2026-04-20

  40. [40]

    GPT-5.1.https://openai.com/index/gpt-5-1/, 2025

    OpenAI. GPT-5.1.https://openai.com/index/gpt-5-1/, 2025. Accessed: 2026-04-20

  41. [41]

    Gemini 3.1 Pro

    Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/technologies/gemini/, 2025. Accessed: 2026-04-20

  42. [42]

    Qwen Code: A command-line ai workflow tool for agentic coding

    Qwen Team. Qwen Code: A command-line ai workflow tool for agentic coding. https://github.com/QwenLM/ qwen-code, 2025. Accessed: 2026-04-20

  43. [43]

    Cursor: The ai code editor.https://www.cursor.com, 2024

    Anysphere. Cursor: The ai code editor.https://www.cursor.com, 2024. Accessed: 2026-04-20. 12 OpenGame: Open Agentic Coding for Games A System Prompt Specifications This appendix presents the prompt specifications used in the OpenGame agent framework, reproduced from the source files used during evaluation. A.1 Main System Prompt The main system prompt is ...

  44. [44]

    Classification & Scaffolding— invoke classify-game-type and copy the corresponding template family into the workspace

  45. [45]

    Game Design— generate a technical GDD via generate-gdd, then expand per-file todos from GDD Section 5

  46. [46]

    Asset Synthesis— call generate-game-assets and generate-tilemap based on the GDD asset registry and ASCII maps

  47. [47]

    Config & Registration— merge gameConfig.json and register all scenes in main.ts / LevelManager.ts

  48. [48]

    Code Implementation— three-layer reading strategy (API summary → targeted source → implementation guide), followed by hook-based coding against template files

  49. [49]

    Verification— static self-review checklist from debug_protocol.md, then npm run build , npm run test, andnpm run dev. A.2 Game Classification Tool Prompt This tool classifies a user’s game idea into one of five archetypes usingPhysics-First Logic(gravity, perspective, and movement type) rather than genre names. It calls an external LLM (DeepSeek-v3.2 by d...

  50. [50]

    Tool Description— the one-line capability summary and parameter list exposed to the agent as a tool manifest entry

  51. [51]

    System Prompt— classification rules for five archetypes (platformer,top_down,grid_logic,tower_defense, ui_heavy), each with a key discriminating question, physics profile, and common-mistake warnings

  52. [52]

    A.3 GDD Generation Tool Prompt This tool generates a technical game Design Document (GDD) tailored to a specific archetype

    User Prompt— the runtime template that wraps the user’s game description and requests a JSON-only response. A.3 GDD Generation Tool Prompt This tool generates a technical game Design Document (GDD) tailored to a specific archetype. The system prompt is dynamically assembled from a fixed header plus three documents loaded from disk: docs/gdd/core.md (unive...

  53. [53]

    Tool Description— function signature and required parameters ( raw_user_requirement, archetype)

  54. [54]

    System Prompt – Fixed Header— instructs the model to act as a game design engineer and enforces four core rules: user-faithful, config-first, zero custom code, and hook integrity

  55. [55]

    User Prompt— runtime template requesting a 6-section Technical GDD with archetype-specific guidance injected at call time

  56. [56]

    Section 1 Asset Guidance – Platformer— side-view animation frames, tileset grid format, and audio SFX list

  57. [57]

    Section 1 Asset Guidance – UI Heavy— front-view bust shots, per-expression image naming, and UI audio conventions

  58. [58]

    Section 1 Asset Guidance – Top-Down— directional animation triplets and tilemap-vs-arena sub-mode rules

  59. [59]

    13 OpenGame: Open Agentic Coding for Games

    Section 1 Asset Guidance – Grid Logic— strict type:"image" parameter constraints and background overlay model. 13 OpenGame: Open Agentic Coding for Games

  60. [60]

    A.4 Todo List Tool Prompt This tool creates and manages a structured task list for the agent’s coding session, enabling real-time progress tracking across multi-phase workflows

    Section 1 Asset Guidance – Tower Defense— tower, enemy, projectile, and icon asset conventions with correct JSON examples. A.4 Todo List Tool Prompt This tool creates and manages a structured task list for the agent’s coding session, enabling real-time progress tracking across multi-phase workflows. Parameters: todos array of items with id, content, and s...

  61. [61]

    A.5 Asset Generation Tool Prompts This tool generates game assets (images, animations, audio, tilesets, backgrounds) using AI vision and audio models (Tongyi / Doubao backends)

    Full Tool Prompt— comprehensive guidance on when to use the todo list (3+ step tasks, multi-file refactors, game development pipelines), worked examples of both correct and incorrect usage, and task state management rules (one in-progress at a time; mark complete immediately upon finishing). A.5 Asset Generation Tool Prompts This tool generates game asset...

  62. [62]

    Background Generation— full-scene, edge-to-edge illustration prompt; explicitly forbids characters, UI elements, and transparency

  63. [63]

    Image (Sprite) Generation— single isolated object on a pure white background with centered composition

  64. [64]

    Animation Base Image— side-view chibi character in neutral idle pose; used as the seed frame for the I2V pipeline

  65. [65]

    Animation Frame – I2V (Image-to-Video)— motion description for the image-to-video model; enforces consistent side-view framing and identical character size across frames

  66. [66]

    Animation Frame – I2I (Image-to-Image)— per-frame prompt with frame index and total count for the image-to-image pipeline

  67. [67]

    Tileset Generation— 3 ×3 seamless tileset with strict row/column layout, zero gaps, full 1024×1024 canvas coverage, and forbidden elements list. A.6 Audio Generation Prompts (ABC Notation) The audio generation pipeline uses a two-step process: (1) generate ABC music notation via LLM, then (2) convert the ABC notation to W A V using symusic/Python. The com...

  68. [68]

    ABC System Prompt— mandatory header fields ( X:, T:, M:, L:, Q:, K:), note-length and rest syntax reference, and a valid two-part example; instructs the model to produce loop-friendly game music with actual note sequences (not placeholders)

  69. [69]

    #"]), mode (

    ABC Generation Prompt (User Message)— runtime template specifying duration, audio type (BGM/SFX), genre, tempo, and description; requests a JSON response withnotation and comments fields and provides good/bad notation examples. A.7 Tilemap Generation Tool This is apurely algorithmictool with no LLM prompts. It converts ASCII map layouts into Phaser Tilema...

  70. [70]

    15 OpenGame: Open Agentic Coding for Games B Prompt Appendix Pages You are a game coding agent specializing in 2D game development tasks

    Platformer Rules (Built-in)— physics settings (Y-axis gravity, side view), available behaviors (PlatformerMovement, MeleeAttack, RangedAttack, PatrolAI, ChaseAI), nine ultimate skill types, ASCII level design legend with placement constraints, and the canonical gameConfig.json schema using the{ "value": X }wrapper format. 15 OpenGame: Open Agentic Coding ...

  71. [71]

    **Classify**: Call `classify-game-type` tool with user 's game idea. Uses **Physics-First Logic** (not genre names): | Module | Physics | Key Question | Examples | |--------|---------|--------------|----------| | `platformer` | Side + Gravity | Does character FALL? | Mario, Terraria, Street Fighter | | `top_down` | Top-Down + Free | Can move UP without ju...

  72. [72]

    Do NOT read any source files yet** -- template code is only read in Phase 5

    **Scaffold**: Use `run_shell_command` to copy templates and docs (FOUR steps, in order): ```bash # Step 1: Copy core template (creates src/, public/, config files) cp -r {TEMPLATES_DIR}/core/* ./ # Step 2: Copy module-specific code INTO src/ (ADDITIVE merge) cp -r {TEMPLATES_DIR}/modules/{archetype}/src/* ./src/ # Step 3: Copy core documentation mkdir -p ...

  73. [73]

    **Call `generate-gdd`** with: - `raw_user_requirement`: User 's game idea - `archetype`: From Phase 1 classification (REQUIRED) The tool auto-loads three documents: - `{DOCS_DIR}/gdd/core.md` -- universal GDD format - `{DOCS_DIR}/modules/{archetype}/design_rules.md` -- game design guide (gameplay, flow, feel) - `{DOCS_DIR}/modules/{archetype}/template_api...

  74. [74]

    **Save GDD** to `GAME_DESIGN.md` using `write_file` tool

  75. [75]

    archetype

    **Expand todos NOW**: GDD exists -- replace the IMPLEMENT placeholder with **specific per-file todos** from GDD Section 5 (each todo = `COPY`/`UPDATE`/`CREATE`/`MERGE` + GDD section reference). Ensure READ and VERIFY phases are still present. ↪→ ↪→ The GDD has 6 sections. Each section feeds a specific downstream step: - **Section 0** (Architecture) -> Pha...

  76. [76]

    Do not invent unasked features

    **User-Faithful**: Fulfill stated requirements. Do not invent unasked features

  77. [77]

    **Config-First**: Numeric values go in `gameConfig.json` using `{ "value": X } ` wrapper

  78. [78]

    **Zero Custom Code**: Use existing behaviors/hooks from template_api.md only

  79. [79]

    animation

    **Hook Integrity**: Every hook name MUST exist in template_api.md. Non-existent hooks cause compilation failure.↪→ **User's Game Idea**: {raw_user_requirement} **Archetype**: {archetype} Generate a Technical GDD with **6 sections** (Section 0-5) following the structure defined in the GDD Core Rules above. Be specific -- the coding agent should never need ...

  80. [80]

    Complex multi-step tasks - When a task requires 3 or more distinct steps or actions

Showing first 80 references.