{"paper":{"title":"Diffusion Models Are Real-Time Game Engines","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM.","cross_cats":["cs.AI","cs.CV"],"primary_cat":"cs.LG","authors_text":"Dani Valevski, Moab Arar, Shlomi Fruchter, Yaniv Leviathan","submitted_at":"2024-08-27T07:46:07Z","abstract_excerpt":"We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random ch"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality, running at 20 frames per second on a single TPU while remaining stable over extended multi-minute play sessions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That conditioning augmentations and decoder fine-tuning will continue to prevent error accumulation and visual drift during extended auto-regressive rollouts beyond the tested multi-minute sessions, without additional mechanisms for long-term memory or consistency.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e1c4e5388cd14c16d131c71be4683ad24d218003939845bddcb1b7dcb68b3678"},"source":{"id":"2408.14837","kind":"arxiv","version":2},"verdict":{"id":"00e83f96-7c0a-47d6-a5cb-1667429aeae4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T12:01:01.639050Z","strongest_claim":"GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality, running at 20 frames per second on a single TPU while remaining stable over extended multi-minute play sessions.","one_line_summary":"A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That conditioning augmentations and decoder fine-tuning will continue to prevent error accumulation and visual drift during extended auto-regressive rollouts beyond the tested multi-minute sessions, without additional mechanisms for long-term memory or consistency.","pith_extraction_headline":"A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM."},"references":{"count":88,"sample":[{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , volume=","work_id":"bd768442-f658-45cd-8a6d-dec95c5eaa2b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=","work_id":"575b0292-9b85-4da9-a1f5-5a4768b0f754","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","work_id":"0a105815-ff2e-43ce-8566-966cdcae1af4","ref_index":3,"cited_arxiv_id":"2206.10789","is_internal_anchor":true},{"doi":"","year":2022,"title":"The Tenth International Conference on Learning Representations,","work_id":"7ec62083-2e3c-4a9c-84bb-f15142ac7b26","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Fast high-resolution image synthesis with latent adversarial diffusion distillation","work_id":"94329895-d192-430f-a517-12f01ae2de5e","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":88,"snapshot_sha256":"1528e491aceefefa9033b41e73fe427d7f05de460717ca050244942f9e38caee","internal_anchors":15},"formal_canon":{"evidence_count":3,"snapshot_sha256":"20add8aa95c3bac462ed980a09db219a776b3c636e9b714ac352a763baf0ceca"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}