pith. sign in

arxiv: 2605.15256 · v1 · pith:VBDWMOHKnew · submitted 2026-05-14 · 💻 cs.CV

ReactiveGWM: Steering NPC in Reactive Game World Models

Pith reviewed 2026-05-19 16:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords ReactiveGWMNPC interactionsgame world modelszero-shot transfercross-attention modulesdiffusion modelsplayer-NPC interactionsstrategy adherence
0
0 comments X

The pith

ReactiveGWM decouples player controls from NPC behaviors to enable zero-shot strategy transfer across game world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current game world models simulate environments from a player-centric view but treat NPCs as background pixels, missing the physical understanding of action-induced reactivities. ReactiveGWM explicitly separates player actions, injected as a lightweight additive bias into the diffusion backbone, from NPC responses such as Offense, Control, and Defense that are grounded by cross-attention modules. These modules learn a representation of interactive logic that does not tie to any single game. A sympathetic reader would care because this turns passive video renderers into active simulation engines where users can direct NPC strategies on demand and move the modules straight into existing models of other titles without retraining.

Core claim

ReactiveGWM synthesizes dynamic interactions between the player and NPC by explicitly decoupling player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: the learned modules can be plugged directly into off-the-shelf, unannotated world models of different games, instantly unlocking steerable NPC interactions without any domain-specific retraining.

What carries the argument

Cross-attention modules that ground high-level NPC responses and learn a game-agnostic representation of interactive logic for direct transfer.

If this is right

  • The model maintains fine-grain player controllability while producing robust, prompt-aligned NPC strategy adherence.
  • Steerable NPC interactions become available in any off-the-shelf unannotated world model without retraining.
  • The approach scales to strategy-rich player-NPC engagements across multiple game titles.
  • High-level NPC behaviors like Offense, Control, and Defense can be directed independently of low-level player inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling pattern could be tested in other interactive simulation settings where one agent must respond to another without retraining per domain.
  • If the interactive logic proves largely universal, similar modules might later support prompt-based control in multi-agent or robotics environments.
  • Extending the set of NPC response categories beyond Offense, Control, and Defense could be checked for continued transfer performance.

Load-bearing premise

The cross-attention modules learn a game-agnostic representation of interactive logic that transfers directly into different games without retraining or fine-tuning.

What would settle it

Plugging the cross-attention modules into an off-the-shelf world model of a new game and observing that NPC responses no longer match the prompted strategies, such as producing wrong defensive actions when an offense prompt is given.

Figures

Figures reproduced from arXiv: 2605.15256 by Danze Chen, Xingyi Yang, Yeying Jin, Yinhan Zhang, Zeqing Wang, Zhaohu Xing, Zizhao Tong.

Figure 1
Figure 1. Figure 1: Visualization of the steerable NPC executing distinct strategies in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data construction and strategy annotation pipeline. Each sample consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DiT block with action module. To condition frame generation on discrete actions at ∈ {0, 1} K (where K=10 for both SF2 and SF3), we adopt a lightweight additive bias mechanism instead of introduc￾ing heavy adapters or cross-attention modules [17, 29]. Let T denote the number of input video frames and Tv the temporal compression ratio of the VAE, so that the latent temporal length is f = T /Tv. The raw butt… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of ReactiveGWM training and training-free transfer to the different game. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Action control in ReactiveGWMbase. The player-controlled character is denoted by a ▲ triangle. The specific action button mappings for each game are detailed in Appendix A. Preserved Control and Fidelity. Crucially, empowering NPC autonomy does not compromise core mechanics. For single-action testing, ReactiveGWM maintains near-perfect Action Control (e.g., 100.0% Move-Acc and Att-Acc in SF3) and visual qu… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between the vanilla model and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution of active behaviors. The NPC (indicated by the ▲ triangle) is successfully guided to perform specific defined actions. Visual Preservation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the steerable NPC executing distinct strategies in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of LingBot-World and Matrix-Game-3.0 on SF2. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User Study Part 1 in Player Action Following. Mean participant scores (± SEM) on a 1–5 Likert scale, where higher scores indicate better action following. therefore directly measures how faithfully the generated NPC carries out the prompted strategy: a higher accuracy means the strategy condition successfully controls the NPC’s behaviour. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: User Study Part 2 in NPC Strategy Following. Per-class and overall human strategy￾classification accuracy. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ReactiveGWM, a reactive game world model for synthesizing dynamic player-NPC interactions in diffusion-based simulators. It decouples player controls (injected via additive bias into the diffusion backbone) from NPC behaviors (modeled via cross-attention modules for high-level strategies like Offense, Control, Defense). The key innovation is that these modules learn a game-agnostic representation of interactive logic, enabling zero-shot transfer by plugging the modules into off-the-shelf world models of different games without retraining. The approach is evaluated on two Street Fighter games, claiming maintained player controllability and robust prompt-aligned NPC strategy adherence.

Significance. If the central claims hold, this work could significantly advance game world models by turning them into reactive simulation engines capable of modeling NPC reactivities. The decoupling and plug-and-play transfer mechanism offers a scalable way to add steerable NPC interactions across games, which is a notable contribution if the generalization to dissimilar games is demonstrated. The absence of quantitative results in the provided description makes it difficult to assess the practical impact at this stage.

major comments (2)
  1. [Abstract] Abstract: The claim that the cross-attention modules 'learn a game-agnostic representation of interactive logic' enabling 'zero-shot strategy transfer' to 'different games' lacks supporting evidence. The evaluation is described only for two Street Fighter games, which share near-identical action spaces, visual styles, and physics. No results or experiments are provided for transfer to world models with substantially different dynamics, such as varying physics or control vocabularies, which is necessary to substantiate the broader claim.
  2. [Abstract] Abstract: No quantitative metrics, ablation studies, or error analysis are mentioned despite the abstract stating that ReactiveGWM 'maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence'. This makes it impossible to verify the strength of the results or the robustness of the decoupling approach.
minor comments (1)
  1. [Abstract] The description of the architecture could benefit from more detail on how the lightweight additive bias for player actions is implemented and how it interacts with the cross-attention modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our claims and the presentation of results. We address each major comment below and describe the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the cross-attention modules 'learn a game-agnostic representation of interactive logic' enabling 'zero-shot strategy transfer' to 'different games' lacks supporting evidence. The evaluation is described only for two Street Fighter games, which share near-identical action spaces, visual styles, and physics. No results or experiments are provided for transfer to world models with substantially different dynamics, such as varying physics or control vocabularies, which is necessary to substantiate the broader claim.

    Authors: We appreciate this observation on the breadth of the generalization claim. Our experiments demonstrate zero-shot transfer of the cross-attention modules between two Street Fighter variants without retraining, where the games differ in character-specific movesets and minor visual/physics details. This supports high-level strategy decoupling from low-level game specifics. However, we acknowledge that these games are not substantially dissimilar in core dynamics. To strengthen the evidence, we will add new experiments transferring the modules to an off-the-shelf world model from a different genre (e.g., a platformer with distinct control vocabulary and physics) and include the results in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: No quantitative metrics, ablation studies, or error analysis are mentioned despite the abstract stating that ReactiveGWM 'maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence'. This makes it impossible to verify the strength of the results or the robustness of the decoupling approach.

    Authors: We agree that the abstract would benefit from explicit reference to supporting evidence. The full manuscript reports quantitative results, including player controllability via action prediction accuracy and NPC strategy adherence via prompt alignment metrics, as well as ablation studies on the additive bias and cross-attention components and error analysis for edge cases. We will revise the abstract to summarize these key quantitative findings and metrics to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: ReactiveGWM is presented as a new architectural construction

full rationale

The paper introduces ReactiveGWM by describing a decoupling of player controls (via additive bias) from NPC behaviors (via cross-attention modules) and asserts that the modules learn a game-agnostic representation enabling zero-shot transfer. No equations, derivations, or self-citations are provided in the text that reduce the transfer claim or the agnostic representation to a fitted quantity defined by the model itself or to prior author work. The approach is framed as a novel construction evaluated on Street Fighter variants rather than a closed derivation chain that collapses to its inputs by construction. This is the most common honest finding for model-proposal papers without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to assumptions stated or implied there; no explicit free parameters, axioms, or invented entities are detailed beyond the architectural choices.

axioms (1)
  • domain assumption Cross-attention modules can learn a game-agnostic representation of interactive logic
    The transfer claim rests on this unproven generalization property of the learned modules.

pith-pipeline@v0.9.0 · 5771 in / 1271 out tokens · 63106 ms · 2026-05-19T16:27:56.770859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 16 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  2. [2]

    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv:1803.01271, 2018

  3. [3]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  5. [5]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  6. [6]

    Street fighter II: Champion Edition

    Capcom. Street fighter II: Champion Edition. Video game, 1992. Arcade

  7. [7]

    Street fighter Alpha 3

    Capcom. Street fighter Alpha 3. Video game, 1998. Arcade

  8. [8]

    Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning.arXiv preprint arXiv:2305.13840, 2023

    Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning.arXiv preprint arXiv:2305.13840, 2023

  9. [9]

    Oasis: A universe in a transformer.URL: https://oasis-model

    Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.URL: https://oasis-model. github. io, 2(3):6, 2024

  10. [10]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  11. [11]

    Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018

  12. [12]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018. 10

  13. [13]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  14. [14]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  15. [15]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  16. [16]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  17. [17]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model, 2026

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 2.0: An open-source real-time and streaming interactive world model, 2026

  18. [18]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8153–8163, 2024

  19. [19]

    Animate anyone 2: High-fidelity character image animation with environment affordance

    Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10207–10217, 2025

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  21. [21]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

  22. [22]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  23. [23]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  24. [24]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  25. [25]

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...

  26. [26]

    Stable retro, a maintained fork of openai’s gym-retro

    Mathieu Poliquin. Stable retro, a maintained fork of openai’s gym-retro. https://github.com/ Farama-Foundation/stable-retro, 2026

  27. [27]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  28. [28]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

  29. [29]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory

    Skywork AI Matrix-Game Team. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. Technical report, 2026. 11

  30. [30]

    Vision bridge transformer at scale.arXiv preprint arXiv:2511.23199, 2025

    Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, and Xinchao Wang. Vision bridge transformer at scale.arXiv preprint arXiv:2511.23199, 2025

  31. [31]

    Dai, Anja Hauth, and et al

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, and et al. Gemini: A family of highly capable multimodal models, 2025

  32. [32]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  33. [33]

    Advancing open-source world models, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models, 2026

  34. [34]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

  35. [35]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  36. [36]

    Videocomposer: Compositional video synthesis with motion controllability

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36:7594–7611, 2023

  37. [37]

    Minute-long videos with dual parallelisms.Proceedings of the AAAI Conference on Artificial Intelligence, 40(12):10358– 10366, Mar

    Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, and Xinchao Wang. Minute-long videos with dual parallelisms.Proceedings of the AAAI Conference on Artificial Intelligence, 40(12):10358– 10366, Mar. 2026

  38. [38]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  39. [39]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  40. [40]

    Daydreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

  41. [41]

    Draganything: Motion control for anything using entity representation

    Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. In European Conference on Computer Vision, pages 331–348. Springer, 2024

  42. [42]

    Magicanimate: Temporally consistent human image animation using diffusion model

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

  43. [43]

    Direct-a-video: Customized video generation with user-directed camera movement and object motion

    Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

  44. [44]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  45. [45]

    DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

  46. [46]

    Gamefactory: Creating new games with generative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11590–11599, October 2025. 12

  47. [47]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  48. [48]

    Tora: Trajectory-oriented diffusion transformer for video generation

    Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2063–2073, 2025

  49. [49]

    Offense" — actively pressures the player: (i) sustained forward movement toward the player across multiple frames, OR (ii) TWO OR MORE distinct close-range attacks

    Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. InEuropean Conference on Computer Vision, pages 145–162. Springer, 2024. 13 A Data construction This appendix details the data construction pipeline introduced i...

  50. [50]

    Sonic Boom? -> Control

  51. [51]

    Extended distance crouch/zoning posture? -> Control

  52. [52]

    Sustained forward movement OR >=2 close-range attacks? -> Offense

  53. [53]

    npc_side

    Otherwise -> Defense EDGE CASES: - Post-match KO/defeat/victory animation for most of clip -> npc_visible=false - Rendering broken or NPC missing/unidentifiable -> npc_visible=false - Ryu fireball does NOT count as Guile Sonic Boom Output EXACTLY this JSON object: { "npc_side": "left" or "right", "npc_visible": true or false, "category": "Control" | "Defe...