arxiv: 2605.07442 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

Borui Wan, Chaobo Jia, Guangming Sheng, Hong Xu, Ruipeng Wan, Ting Sun, Weihao Tan, Yuxuan Tong

Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM game generationautomated verificationkeypoint-based testingruntime state injectiongame correctnessparallel verificationVeriGame dataset

0 comments

The pith

GameGen-Verifier verifies LLM-generated games by decomposing specifications into keypoints tested through runtime state injection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based game generation requires reliable verification because games can seem correct but fail on core mechanics during extended play. Agent-based verifiers that simulate full gameplay are slow, limited in coverage, and dependent on the agent's skill. GameGen-Verifier addresses this by splitting the specification into verifiable keypoints, each handled by an independent unit that sets the game to a specific state, runs a short test, and checks the result. On a dataset of 100 games, this achieves 92.2 percent accuracy matching human judgments, compared to 58.8 percent for the baseline, while cutting verification time by up to 16.6 times.

Core claim

We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-

What carries the argument

Independent verification units that use runtime state injection to patch the game into target states for testing individual keypoints from the specification.

If this is right

Verification time no longer depends on reaching distant game states through full play.
Critical mechanics such as state updates, interaction rules, and phase transitions can be checked directly.
The approach supports concurrency across verification units for better scalability.
Results are less sensitive to the performance of any single verification agent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This verification strategy may extend to other domains involving long-horizon interactive behaviors, such as AI-controlled simulations.
Combining keypoint tests with the original game code could reveal more about how well the LLM captures intended mechanics.
The parallel nature suggests potential for real-time verification during game generation processes.
Limitations in handling emergent interactions could be addressed by adding dependency graphs between keypoints in future extensions.

Load-bearing premise

Game specifications break down into independent keypoints where isolated tests capture all important mechanics without missing problems from how those mechanics interact in actual play.

What would settle it

Observe whether GameGen-Verifier misses bugs in games that only manifest through interactions between multiple keypoints during continuous play, unlike isolated tests.

Figures

Figures reproduced from arXiv: 2605.07442 by Borui Wan, Chaobo Jia, Guangming Sheng, Hong Xu, Ruipeng Wan, Ting Sun, Weihao Tan, Yuxuan Tong.

**Figure 2.** Figure 2: Applying AaaV to game verification forces the agent through long gameplay sequences to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: GameGen-Verifier extracts verifiable keypoints from a natural-language specification and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent's gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GameGen-Verifier gives a workable keypoint-plus-injection method for checking LLM games that beats open-ended agent play on speed and accuracy in their tests, but the decomposition step looks underspecified.

read the letter

The core idea here is to split a game spec into independent keypoints, then for each one patch the running game into the right state, run a short bounded interaction, and check the outcome. That replaces the usual agent-as-verifier loop of letting an LLM play until it hits or misses the bug. On their VeriGame set of 100 games they report 92 percent agreement with humans versus 59 percent for the baseline, plus up to 16x less wall time. The parallel harness with isolation and recovery is a sensible engineering piece that makes the approach scale to multiple checks at once. That combination of decomposition and direct state injection is the actual novelty; it is not just another coverage trick on top of existing agent methods. The results are presented cleanly enough to show the practical gain on this dataset. The soft spot is that nothing in the write-up explains how the keypoints are chosen or why they are guaranteed to catch interactions that only appear when mechanics run together. If two keypoints interact through shared state that the injection does not preserve, the method can pass a broken game. The paper gives no coverage argument or failure-case analysis to address that, and the 100-game set is still modest for claiming broad reliability. The baseline comparison is fair as far as it goes, but the lack of detail on keypoint extraction makes it hard to judge how much of the win comes from the method versus careful manual decomposition on the test set. This is the kind of paper that belongs in a venue focused on code generation or game AI rather than a top-tier ML conference. A serious referee should see it because the empirical delta is real and the engineering is reproducible in principle, but the authors will need to add a clear account of how keypoints are derived and tested for completeness before it can be taken as a general solution. I would bring it to a reading group for the discussion on verification trade-offs, but I would not cite it yet in my own work.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce GameGen-Verifier, a verification method for LLM-generated games that decomposes specifications into keypoints and uses runtime state injection for parallel, bounded verification of each keypoint via the GGV-Harness. On a dataset of 100 games, it reports achieving 92.2% accuracy against human judgments (vs. 58.8% for baseline) and up to 16.6x reduction in wall-clock time.

Significance. If the results hold, this could be a significant advance in automated verification for generated interactive content, overcoming the reachability and coverage limitations of agent-based verifiers. The use of parallel keypoint verification and runtime injection is a novel approach that could generalize to other domains requiring long-horizon correctness checks. The multi-genre dataset provides some breadth to the evaluation.

major comments (3)

[Abstract] The headline claim of 92.2% accuracy requires that keypoints derived from the spec can be verified in isolation without missing violations from mechanic interactions or injection side effects. No coverage argument or completeness proof for the decomposition is supplied in the abstract or elsewhere, which is load-bearing for the reliability of the method.
[§4 Experiments] The empirical results lack implementation details, error analysis, statistical tests, or description of keypoint extraction and injection mechanics. With a modest dataset of 100 games and no failure cases shown, it is difficult to assess the robustness of the accuracy and speedup numbers.
[§3 Method] The GGV-Harness is described as providing concurrency management and runtime isolation, but there is no discussion of how injected states preserve cross-keypoint invariants or handle state-dependent side effects, which could invalidate the independent verification assumption.

minor comments (1)

The notation for 'keypoints' and 'verification units' could be clarified with a formal definition or example early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the positive assessment of the potential significance of GameGen-Verifier and address each major comment below with specific plans for revision.

read point-by-point responses

Referee: [Abstract] The headline claim of 92.2% accuracy requires that keypoints derived from the spec can be verified in isolation without missing violations from mechanic interactions or injection side effects. No coverage argument or completeness proof for the decomposition is supplied in the abstract or elsewhere, which is load-bearing for the reliability of the method.

Authors: We agree that the reliability of the 92.2% figure rests on the validity of independent keypoint verification. Section 3 of the manuscript describes the decomposition as extracting atomic assertions from the natural-language specification (state updates, interaction rules, and phase transitions), with each keypoint grounded to a bounded interaction in an isolated runtime. While we do not supply a formal completeness proof—the decomposition is heuristic and driven by the structure of game specifications—we provide empirical grounding through direct comparison to human judgments on the full VeriGame dataset. To strengthen the presentation, we will revise the abstract to explicitly note the empirical validation and add a limitations subsection discussing the assumptions of the decomposition, including potential missed interactions between mechanics. revision: partial
Referee: [§4 Experiments] The empirical results lack implementation details, error analysis, statistical tests, or description of keypoint extraction and injection mechanics. With a modest dataset of 100 games and no failure cases shown, it is difficult to assess the robustness of the accuracy and speedup numbers.

Authors: We accept that additional transparency is required. In the revised §4 we will include: pseudocode and concrete examples for keypoint extraction and runtime state injection; a dedicated error analysis subsection with representative failure cases (both false positives and false negatives); statistical significance testing (McNemar’s test for accuracy and paired t-tests for wall-clock time); and implementation details of the GGV-Harness (concurrency model, isolation primitives, and fault-recovery logic). Although the dataset comprises 100 games across seven genres, we will also add an explicit discussion of dataset size as a limitation and report per-genre breakdowns. revision: yes
Referee: [§3 Method] The GGV-Harness is described as providing concurrency management and runtime isolation, but there is no discussion of how injected states preserve cross-keypoint invariants or handle state-dependent side effects, which could invalidate the independent verification assumption.

Authors: The GGV-Harness creates a fresh, isolated game instance for each keypoint verification and performs state injection through deterministic patches that target only the variables referenced by that keypoint. This design intentionally resets cross-keypoint state to avoid carry-over. We acknowledge that an explicit treatment of invariants and side effects is missing. We will add a new paragraph in §3.2 that (a) formalizes the per-verification reset protocol, (b) explains how injection is scoped to prevent unintended global mutations, and (c) discusses residual risks of state-dependent side effects together with the mitigation strategies employed in the current implementation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is independent of method definition

full rationale

The paper describes a keypoint decomposition and runtime-injection verification harness as a design choice for checking LLM-generated games, then reports measured accuracy (92.2%) against human judgments on the external VeriGame dataset of 100 games, compared to a separate Agent-as-a-Verifier baseline. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the method or its performance numbers by construction. The central results are falsifiable external benchmarks rather than tautological predictions or renamings of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unproven assumption that specifications decompose cleanly into independent keypoints and that runtime injection faithfully reproduces target states without side effects; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Game correctness can be decomposed into independent verifiable keypoints that together cover all critical mechanics.
Invoked in the decomposition step described in the abstract.

invented entities (2)

Keypoint no independent evidence
purpose: Atomic verifiable unit extracted from a game specification
New conceptual unit introduced to enable targeted verification.
GGV-Harness no independent evidence
purpose: Scalable runtime for concurrent verification units with isolation and recovery
New implementation artifact required to realize the parallel verification.

pith-pipeline@v0.9.0 · 5543 in / 1454 out tokens · 42063 ms · 2026-05-11T02:09:50.691453+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
decomposes a specification into verifiable keypoints and grounds them into independent verification units... patches the game runtime into a concrete target state, executes a bounded interaction

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

[1]

Godogen.https://github.com/htdt/godogen, 2026

Alex Ermolov. Godogen.https://github.com/htdt/godogen, 2026

work page 2026
[2]

Claude code

Anthropic. Claude code. https://github.com/anthropics/claude-code. Accessed: 2026

work page 2026
[3]

Anuttacon: AI-driven interactive entertainment.https://anuttacon.com, 2024

Anuttacon. Anuttacon: AI-driven interactive entertainment.https://anuttacon.com, 2024

work page 2024
[4]

Games for iphone on the app store

Apple. Games for iphone on the app store. https://apps.apple.com/us/iphone/games,

work page
[5]

Automated video game testing using synthetic and human-like agents.IEEE Transactions on Games, 13(1):50–67, 2021

Sinan Ariyurek, Aysu Betin-Can, and Elif Surer. Automated video game testing using synthetic and human-like agents.IEEE Transactions on Games, 13(1):50–67, 2021

work page 2021
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Unreal engine mcp server

ChiR24. Unreal engine mcp server. https://github.com/ChiR24/Unreal_mcp, 2026. Accessed: 2026

work page 2026
[9]

Chrome devtools protocol: Runtime domain

Chrome DevTools Protocol. Chrome devtools protocol: Runtime domain. https:// chromedevtools.github.io/devtools-protocol/v8/Runtime/, 2026. Accessed: 2026

work page 2026
[10]

Mcp for unity

CoplayDev. Mcp for unity. https://github.com/CoplayDev/unity-mcp, 2026. Accessed: 2026

work page 2026
[11]

Godot mcp (model context protocol)

ee0pdt. Godot mcp (model context protocol). https://github.com/ee0pdt/Godot-MCP,

work page
[12]

Actors in unreal engine

Epic Games. Actors in unreal engine. https://dev.epicgames.com/documentation/en- us/unreal-engine/actors-in-unreal-engine, 2026. Accessed: 2026

work page 2026
[13]

Unreal object handling

Epic Games. Unreal object handling. https://dev.epicgames.com/documentation/en- us/unreal-engine/unreal-object-handling, 2026. Accessed: 2026

work page 2026
[14]

Uworld::spawnactor

Epic Games. Uworld::spawnactor. https://dev.epicgames.com/documentation/en- us/unreal-engine/API/Runtime/Engine/Engine/UWorld/SpawnActor/1, 2026. Ac- cessed: 2026

work page 2026
[15]

Yannakakis

Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. Large language models and games: A survey and roadmap.IEEE Transactions on Games, pages 1–18, 2024

work page 2024
[16]

Nodes and scene instances.https://docs.godotengine.org/ en/4.5/tutorials/scripting/nodes_and_scene_instances.html, 2026

Godot Engine Documentation. Nodes and scene instances.https://docs.godotengine.org/ en/4.5/tutorials/scripting/nodes_and_scene_instances.html, 2026. Accessed: 2026

work page 2026
[17]

Packedscene: Godot engine documentation

Godot Engine Documentation. Packedscene: Godot engine documentation. https:// docs.godotengine.org/en/4.5/classes/class_packedscene.html, 2026. Accessed: 2026

work page 2026
[18]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review arXiv 2021
[19]

C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969

work page 1969
[20]

Game development as human-llm interaction, 2024

Jiale Hong, Xuefeng Liu, and Weinan E. Game development as human-llm interaction, 2024. 10

work page 2024
[21]

Metagpt: Meta programming for a multi-agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

work page 2024
[22]

OpenGame: Open agentic coding for games, 2026

Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, and Xiangyu Yue. OpenGame: Open agentic coding for games, 2026

work page 2026
[23]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024
[24]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022

work page 2022
[25]

Webcompass: Towards multimodal web coding evaluation for code language models, 2026

Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, and Jiaheng Liu. Webcompass: Towards multimodal web coding evaluation for code language models, 2026

work page 2026
[26]

Prompting large language models to tackle the full software development lifecycle: A case study, 2024

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study, 2024

work page 2024
[27]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dalgleish, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022
[28]

RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023

Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023

work page 2023
[29]

Uxagent: An llm agent-based usability testing framework for web design

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12. ACM, April 2025

work page 2025
[30]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

work page arXiv 2025
[31]

Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks

MiniMax. Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks. Release note / model documentation, 2025. Model repository: https://github.com/MiniMax-AI/MiniMax-M2.1

work page 2025
[32]

Codex cli.https://github.com/openai/codex

OpenAI. Codex cli.https://github.com/openai/codex. Accessed: 2026

work page 2026
[33]

Opencode

OpenCode Contributors. Opencode. https://github.com/anomalyco/opencode. Ac- cessed: 2026

work page 2026
[34]

A survey of video game testing

Cristiano Politowski, Fabio Petrillo, and Yann-Gael Gueheneuc. A survey of video game testing. In2021 IEEE/ACM International Conference on Automation of Software Test (AST), pages 90–99, 2021

work page 2021
[35]

Artificial playfulness: A tool for automated agent-based playtesting

Samantha Stahlke, Atiya Nova, and Pejman Mirza-Babaei. Artificial playfulness: A tool for automated agent-based playtesting. InExtended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI EA ’19, pages 1–6, New York, NY , USA, 2019. Association for Computing Machinery. 11

work page 2019
[36]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page 2026
[37]

Unity scripting api: Gameobject.getcomponent

Unity Technologies. Unity scripting api: Gameobject.getcomponent. https:// docs.unity3d.com/ScriptReference/GameObject.GetComponent.html, 2026. Ac- cessed: 2026

work page 2026
[38]

Unity scripting api: Object.instantiate

Unity Technologies. Unity scripting api: Object.instantiate. https://docs.unity3d.com/ ScriptReference/Object.Instantiate.html, 2026. Accessed: 2026

work page 2026
[39]

Steam store.https://store.steampowered.com/, 2026

Valve. Steam store.https://store.steampowered.com/, 2026. Accessed: 2026

work page 2026
[40]

Steam tags

Valve. Steam tags. https://partner.steamgames.com/doc/store/tags, 2026. Accessed: 2026. 12

work page 2026
[41]

Automated bug finding in video games: A case study for runtime monitoring.Computers in Entertainment, 15(1):1–28, 2017

Stefan Varvaressos, Kim Lavoie, Sarah Gaboury, and Sylvain Hallé. Automated bug finding in video games: A case study for runtime monitoring.Computers in Entertainment, 15(1):1–28, 2017

work page 2017
[42]

Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

work page 2023
[43]

MiMo-V2-Flash Technical Report

Xiaomi LLM-Core Team. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review arXiv 2026
[44]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, 2023

work page 2023
[45]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023. 13 Broader Impact and Ethical Considerations Broader impact.This work develops a ve...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Read the full specification end-to-end. Enumerate every concrete specification element, where an element may be a sentence-level or clause-level requirement covering inputs, physics/timing, win conditions, lose conditions, state transitions, game rules, HUD/UI requirements, boundary conditions, or other categories

work page
[47]

Run ‘npm install‘ if needed, then run one build command

work page
[48]

Launch one temporary local server with the lifecycle helper

work page
[49]

$("<LIFECYCLE_HELPER>

Write ‘baseline_eval_result.json‘ and stop. - Do not perform open-ended exploration loops, repeated browser sessions, or speculative deep dives after every element already has a verdict. - Do not invent elements that are not in the spec. - The JSON must include at least: - ‘game_id‘ - ‘run_id‘ - ‘final_verdict‘ (‘pass‘ | ‘fail‘) - ‘confidence‘ (0-100) - ‘...

work page