pith. machine review for the scientific record. sign in

arxiv: 2605.07442 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

Borui Wan, Chaobo Jia, Guangming Sheng, Hong Xu, Ruipeng Wan, Ting Sun, Weihao Tan, Yuxuan Tong

Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM game generationautomated verificationkeypoint-based testingruntime state injectiongame correctnessparallel verificationVeriGame dataset
0
0 comments X

The pith

GameGen-Verifier verifies LLM-generated games by decomposing specifications into keypoints tested through runtime state injection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based game generation requires reliable verification because games can seem correct but fail on core mechanics during extended play. Agent-based verifiers that simulate full gameplay are slow, limited in coverage, and dependent on the agent's skill. GameGen-Verifier addresses this by splitting the specification into verifiable keypoints, each handled by an independent unit that sets the game to a specific state, runs a short test, and checks the result. On a dataset of 100 games, this achieves 92.2 percent accuracy matching human judgments, compared to 58.8 percent for the baseline, while cutting verification time by up to 16.6 times.

Core claim

We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-

What carries the argument

Independent verification units that use runtime state injection to patch the game into target states for testing individual keypoints from the specification.

If this is right

  • Verification time no longer depends on reaching distant game states through full play.
  • Critical mechanics such as state updates, interaction rules, and phase transitions can be checked directly.
  • The approach supports concurrency across verification units for better scalability.
  • Results are less sensitive to the performance of any single verification agent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This verification strategy may extend to other domains involving long-horizon interactive behaviors, such as AI-controlled simulations.
  • Combining keypoint tests with the original game code could reveal more about how well the LLM captures intended mechanics.
  • The parallel nature suggests potential for real-time verification during game generation processes.
  • Limitations in handling emergent interactions could be addressed by adding dependency graphs between keypoints in future extensions.

Load-bearing premise

Game specifications break down into independent keypoints where isolated tests capture all important mechanics without missing problems from how those mechanics interact in actual play.

What would settle it

Observe whether GameGen-Verifier misses bugs in games that only manifest through interactions between multiple keypoints during continuous play, unlike isolated tests.

Figures

Figures reproduced from arXiv: 2605.07442 by Borui Wan, Chaobo Jia, Guangming Sheng, Hong Xu, Ruipeng Wan, Ting Sun, Weihao Tan, Yuxuan Tong.

Figure 1
Figure 1. Figure 1: A user specification (1) drives a game generation agent that synthesizes worlds, mechanics, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Applying AaaV to game verification forces the agent through long gameplay sequences to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GameGen-Verifier extracts verifiable keypoints from a natural-language specification and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent's gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce GameGen-Verifier, a verification method for LLM-generated games that decomposes specifications into keypoints and uses runtime state injection for parallel, bounded verification of each keypoint via the GGV-Harness. On a dataset of 100 games, it reports achieving 92.2% accuracy against human judgments (vs. 58.8% for baseline) and up to 16.6x reduction in wall-clock time.

Significance. If the results hold, this could be a significant advance in automated verification for generated interactive content, overcoming the reachability and coverage limitations of agent-based verifiers. The use of parallel keypoint verification and runtime injection is a novel approach that could generalize to other domains requiring long-horizon correctness checks. The multi-genre dataset provides some breadth to the evaluation.

major comments (3)
  1. [Abstract] The headline claim of 92.2% accuracy requires that keypoints derived from the spec can be verified in isolation without missing violations from mechanic interactions or injection side effects. No coverage argument or completeness proof for the decomposition is supplied in the abstract or elsewhere, which is load-bearing for the reliability of the method.
  2. [§4 Experiments] The empirical results lack implementation details, error analysis, statistical tests, or description of keypoint extraction and injection mechanics. With a modest dataset of 100 games and no failure cases shown, it is difficult to assess the robustness of the accuracy and speedup numbers.
  3. [§3 Method] The GGV-Harness is described as providing concurrency management and runtime isolation, but there is no discussion of how injected states preserve cross-keypoint invariants or handle state-dependent side effects, which could invalidate the independent verification assumption.
minor comments (1)
  1. The notation for 'keypoints' and 'verification units' could be clarified with a formal definition or example early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the positive assessment of the potential significance of GameGen-Verifier and address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: [Abstract] The headline claim of 92.2% accuracy requires that keypoints derived from the spec can be verified in isolation without missing violations from mechanic interactions or injection side effects. No coverage argument or completeness proof for the decomposition is supplied in the abstract or elsewhere, which is load-bearing for the reliability of the method.

    Authors: We agree that the reliability of the 92.2% figure rests on the validity of independent keypoint verification. Section 3 of the manuscript describes the decomposition as extracting atomic assertions from the natural-language specification (state updates, interaction rules, and phase transitions), with each keypoint grounded to a bounded interaction in an isolated runtime. While we do not supply a formal completeness proof—the decomposition is heuristic and driven by the structure of game specifications—we provide empirical grounding through direct comparison to human judgments on the full VeriGame dataset. To strengthen the presentation, we will revise the abstract to explicitly note the empirical validation and add a limitations subsection discussing the assumptions of the decomposition, including potential missed interactions between mechanics. revision: partial

  2. Referee: [§4 Experiments] The empirical results lack implementation details, error analysis, statistical tests, or description of keypoint extraction and injection mechanics. With a modest dataset of 100 games and no failure cases shown, it is difficult to assess the robustness of the accuracy and speedup numbers.

    Authors: We accept that additional transparency is required. In the revised §4 we will include: pseudocode and concrete examples for keypoint extraction and runtime state injection; a dedicated error analysis subsection with representative failure cases (both false positives and false negatives); statistical significance testing (McNemar’s test for accuracy and paired t-tests for wall-clock time); and implementation details of the GGV-Harness (concurrency model, isolation primitives, and fault-recovery logic). Although the dataset comprises 100 games across seven genres, we will also add an explicit discussion of dataset size as a limitation and report per-genre breakdowns. revision: yes

  3. Referee: [§3 Method] The GGV-Harness is described as providing concurrency management and runtime isolation, but there is no discussion of how injected states preserve cross-keypoint invariants or handle state-dependent side effects, which could invalidate the independent verification assumption.

    Authors: The GGV-Harness creates a fresh, isolated game instance for each keypoint verification and performs state injection through deterministic patches that target only the variables referenced by that keypoint. This design intentionally resets cross-keypoint state to avoid carry-over. We acknowledge that an explicit treatment of invariants and side effects is missing. We will add a new paragraph in §3.2 that (a) formalizes the per-verification reset protocol, (b) explains how injection is scoped to prevent unintended global mutations, and (c) discusses residual risks of state-dependent side effects together with the mitigation strategies employed in the current implementation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is independent of method definition

full rationale

The paper describes a keypoint decomposition and runtime-injection verification harness as a design choice for checking LLM-generated games, then reports measured accuracy (92.2%) against human judgments on the external VeriGame dataset of 100 games, compared to a separate Agent-as-a-Verifier baseline. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the method or its performance numbers by construction. The central results are falsifiable external benchmarks rather than tautological predictions or renamings of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unproven assumption that specifications decompose cleanly into independent keypoints and that runtime injection faithfully reproduces target states without side effects; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Game correctness can be decomposed into independent verifiable keypoints that together cover all critical mechanics.
    Invoked in the decomposition step described in the abstract.
invented entities (2)
  • Keypoint no independent evidence
    purpose: Atomic verifiable unit extracted from a game specification
    New conceptual unit introduced to enable targeted verification.
  • GGV-Harness no independent evidence
    purpose: Scalable runtime for concurrent verification units with isolation and recovery
    New implementation artifact required to realize the parallel verification.

pith-pipeline@v0.9.0 · 5543 in / 1454 out tokens · 42063 ms · 2026-05-11T02:09:50.691453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    Godogen.https://github.com/htdt/godogen, 2026

    Alex Ermolov. Godogen.https://github.com/htdt/godogen, 2026

  2. [2]

    Claude code

    Anthropic. Claude code. https://github.com/anthropics/claude-code. Accessed: 2026

  3. [3]

    Anuttacon: AI-driven interactive entertainment.https://anuttacon.com, 2024

    Anuttacon. Anuttacon: AI-driven interactive entertainment.https://anuttacon.com, 2024

  4. [4]

    Games for iphone on the app store

    Apple. Games for iphone on the app store. https://apps.apple.com/us/iphone/games,

  5. [5]

    Automated video game testing using synthetic and human-like agents.IEEE Transactions on Games, 13(1):50–67, 2021

    Sinan Ariyurek, Aysu Betin-Can, and Elif Surer. Automated video game testing using synthetic and human-like agents.IEEE Transactions on Games, 13(1):50–67, 2021

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Unreal engine mcp server

    ChiR24. Unreal engine mcp server. https://github.com/ChiR24/Unreal_mcp, 2026. Accessed: 2026

  9. [9]

    Chrome devtools protocol: Runtime domain

    Chrome DevTools Protocol. Chrome devtools protocol: Runtime domain. https:// chromedevtools.github.io/devtools-protocol/v8/Runtime/, 2026. Accessed: 2026

  10. [10]

    Mcp for unity

    CoplayDev. Mcp for unity. https://github.com/CoplayDev/unity-mcp, 2026. Accessed: 2026

  11. [11]

    Godot mcp (model context protocol)

    ee0pdt. Godot mcp (model context protocol). https://github.com/ee0pdt/Godot-MCP,

  12. [12]

    Actors in unreal engine

    Epic Games. Actors in unreal engine. https://dev.epicgames.com/documentation/en- us/unreal-engine/actors-in-unreal-engine, 2026. Accessed: 2026

  13. [13]

    Unreal object handling

    Epic Games. Unreal object handling. https://dev.epicgames.com/documentation/en- us/unreal-engine/unreal-object-handling, 2026. Accessed: 2026

  14. [14]

    Uworld::spawnactor

    Epic Games. Uworld::spawnactor. https://dev.epicgames.com/documentation/en- us/unreal-engine/API/Runtime/Engine/Engine/UWorld/SpawnActor/1, 2026. Ac- cessed: 2026

  15. [15]

    Yannakakis

    Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. Large language models and games: A survey and roadmap.IEEE Transactions on Games, pages 1–18, 2024

  16. [16]

    Nodes and scene instances.https://docs.godotengine.org/ en/4.5/tutorials/scripting/nodes_and_scene_instances.html, 2026

    Godot Engine Documentation. Nodes and scene instances.https://docs.godotengine.org/ en/4.5/tutorials/scripting/nodes_and_scene_instances.html, 2026. Accessed: 2026

  17. [17]

    Packedscene: Godot engine documentation

    Godot Engine Documentation. Packedscene: Godot engine documentation. https:// docs.godotengine.org/en/4.5/classes/class_packedscene.html, 2026. Accessed: 2026

  18. [18]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  19. [19]

    C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969

  20. [20]

    Game development as human-llm interaction, 2024

    Jiale Hong, Xuefeng Liu, and Weinan E. Game development as human-llm interaction, 2024. 10

  21. [21]

    Metagpt: Meta programming for a multi-agent collaborative framework, 2024

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

  22. [22]

    OpenGame: Open agentic coding for games, 2026

    Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, and Xiangyu Yue. OpenGame: Open agentic coding for games, 2026

  23. [23]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  24. [24]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022

  25. [25]

    Webcompass: Towards multimodal web coding evaluation for code language models, 2026

    Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, and Jiaheng Liu. Webcompass: Towards multimodal web coding evaluation for code language models, 2026

  26. [26]

    Prompting large language models to tackle the full software development lifecycle: A case study, 2024

    Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study, 2024

  27. [27]

    Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dalgleish, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

  28. [28]

    RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023

    Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023

  29. [29]

    Uxagent: An llm agent-based usability testing framework for web design

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12. ACM, April 2025

  30. [30]

    Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

    Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

  31. [31]

    Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks

    MiniMax. Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks. Release note / model documentation, 2025. Model repository: https://github.com/MiniMax-AI/MiniMax-M2.1

  32. [32]

    Codex cli.https://github.com/openai/codex

    OpenAI. Codex cli.https://github.com/openai/codex. Accessed: 2026

  33. [33]

    Opencode

    OpenCode Contributors. Opencode. https://github.com/anomalyco/opencode. Ac- cessed: 2026

  34. [34]

    A survey of video game testing

    Cristiano Politowski, Fabio Petrillo, and Yann-Gael Gueheneuc. A survey of video game testing. In2021 IEEE/ACM International Conference on Automation of Software Test (AST), pages 90–99, 2021

  35. [35]

    Artificial playfulness: A tool for automated agent-based playtesting

    Samantha Stahlke, Atiya Nova, and Pejman Mirza-Babaei. Artificial playfulness: A tool for automated agent-based playtesting. InExtended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI EA ’19, pages 1–6, New York, NY , USA, 2019. Association for Computing Machinery. 11

  36. [36]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  37. [37]

    Unity scripting api: Gameobject.getcomponent

    Unity Technologies. Unity scripting api: Gameobject.getcomponent. https:// docs.unity3d.com/ScriptReference/GameObject.GetComponent.html, 2026. Ac- cessed: 2026

  38. [38]

    Unity scripting api: Object.instantiate

    Unity Technologies. Unity scripting api: Object.instantiate. https://docs.unity3d.com/ ScriptReference/Object.Instantiate.html, 2026. Accessed: 2026

  39. [39]

    Steam store.https://store.steampowered.com/, 2026

    Valve. Steam store.https://store.steampowered.com/, 2026. Accessed: 2026

  40. [40]

    Steam tags

    Valve. Steam tags. https://partner.steamgames.com/doc/store/tags, 2026. Accessed: 2026. 12

  41. [41]

    Automated bug finding in video games: A case study for runtime monitoring.Computers in Entertainment, 15(1):1–28, 2017

    Stefan Varvaressos, Kim Lavoie, Sarah Gaboury, and Sylvain Hallé. Automated bug finding in video games: A case study for runtime monitoring.Computers in Entertainment, 15(1):1–28, 2017

  42. [42]

    Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

  43. [43]

    MiMo-V2-Flash Technical Report

    Xiaomi LLM-Core Team. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  44. [44]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, 2023

  45. [45]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023. 13 Broader Impact and Ethical Considerations Broader impact.This work develops a ve...

  46. [46]

    Read the full specification end-to-end. Enumerate every concrete specification element, where an element may be a sentence-level or clause-level requirement covering inputs, physics/timing, win conditions, lose conditions, state transitions, game rules, HUD/UI requirements, boundary conditions, or other categories

  47. [47]

    Run ‘npm install‘ if needed, then run one build command

  48. [48]

    Launch one temporary local server with the lifecycle helper

  49. [49]

    $("<LIFECYCLE_HELPER>

    Write ‘baseline_eval_result.json‘ and stop. - Do not perform open-ended exploration loops, repeated browser sessions, or speculative deep dives after every element already has a verdict. - Do not invent elements that are not in the spec. - The JSON must include at least: - ‘game_id‘ - ‘run_id‘ - ‘final_verdict‘ (‘pass‘ | ‘fail‘) - ‘confidence‘ (0-100) - ‘...