Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3
The pith
Natural language conditioning enables simultaneous multi-entity control and cross-entity concept transfer in video world models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Incantation is the first interactive video world model that treats natural language as the per-latent-frame (0.25 s) action interface. It pairs a pretrained bidirectional video backbone with frame-local text cross-attention to support simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. Real-time long-horizon streaming is enabled by ODE-initialized Self-Forcing distillation together with a RoPE-decoupled sliding KV-cache. The system outperforms the Action-Index baseline on cross-entity transfer (89 percent versus 43 percent) and out-of-vocabulary prompts (90 percent versus 0 percent) while sustaining 19.7 FPS at 480p with stable FVD.
What carries the argument
Frame-local text cross-attention applied to each latent frame of a pretrained bidirectional video backbone, which injects natural language instructions independently per frame to drive multi-entity actions.
If this is right
- The model surpasses the Action-Index baseline by achieving 89 percent success on cross-entity transfer and 90 percent on out-of-vocabulary prompts.
- It sustains real-time generation at 19.7 FPS at 480p with stable FVD across 2-hour rollouts.
- The same architecture applies to other environments such as The King of Fighters by changing only the per-entity action vocabulary slots.
- A preview dataset of Elden Ring player-boss combat scenes with structured action metadata has been released to support further training.
Where Pith is reading between the lines
- The language interface could extend beyond games to domains like robotic simulation or animated storytelling where users describe behaviors for multiple agents.
- It opens the possibility of creating novel scenarios by describing entity interactions in everyday words rather than predefined controls.
- Strong performance on out-of-vocabulary prompts suggests the approach may handle instructions that go beyond the training distribution.
- Stable long-horizon coherence indicates the method could support extended interactive sessions without frequent resets.
Load-bearing premise
That adding frame-local text cross-attention to a pretrained video backbone is enough to achieve multi-entity control and cross-entity concept transfer while preserving visual fidelity and temporal coherence over long sequences.
What would settle it
Generate a video sequence in which two distinct entities receive contradictory natural language instructions within the same 0.25-second frame and check whether the output shows coherent, separate actions for each entity instead of merged or incoherent motion.
Figures
read the original abstract
Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Incantation, an interactive video world model that uses natural language prompts for per-latent-frame (0.25 s) conditioning on a pretrained bidirectional video backbone via frame-local text cross-attention. It claims this interface enables simultaneous multi-entity control and concept-level cross-entity transfer beyond fixed action indices or rendering pipelines, demonstrated through superior performance on Elden Ring and King of Fighters scenes (89% cross-entity transfer vs. 43% baseline; 90% OOV vs. 0%). Additional contributions include ODE-initialized Self-Forcing distillation for real-time long-horizon streaming at 19.7 FPS with stable FVD over 2-hour rollouts, and release of a preview dataset subset.
Significance. If the empirical claims hold under rigorous controls, the work would represent a meaningful advance in video world models by replacing rigid action interfaces with expressive natural language, potentially enabling broader generalization and multi-entity interactions. Strengths include the dataset release for reproducibility, the engineering for real-time performance, and direct comparisons showing gains on transfer and OOV tasks. The significance is limited by the current lack of verification details that would confirm the gains arise from the NL interface rather than backbone statistics.
major comments (2)
- [§4] §4 (Experiments) and associated tables: the central claim of simultaneous independent multi-entity control and concept-level cross-entity transfer rests on aggregate metrics (89% transfer, 90% OOV) without reported dataset size, number of evaluation episodes, statistical significance tests, per-entity fidelity breakdowns, or controls for prompt difficulty. This leaves open whether the observed gains isolate the NL interface or reflect global scene statistics from the pretrained backbone.
- [§3.1] §3.1 (Architecture, frame-local text cross-attention): the conditioning mechanism applies text cross-attention to global frame features without entity-specific tokens, phrase-to-entity binding, attention masks, or visual grounding (e.g., segmentation). This design risks non-independent control where conditioning for one entity bleeds into others, undermining the claim that the NL interface itself enables independent multi-entity control beyond the backbone's implicit statistics.
minor comments (2)
- [Abstract and §2] The abstract and §2 mention 'per-latent-frame (0.25 s)' conditioning but do not specify the exact latent frame rate or how it aligns with the video backbone's temporal resolution; a brief equation or diagram would clarify.
- [Figures] Figure captions for rollout examples could include quantitative FVD values per sequence length to better support the 'stable over 2-hour rollouts' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the empirical and architectural descriptions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the central claim of simultaneous independent multi-entity control and concept-level cross-entity transfer rests on aggregate metrics (89% transfer, 90% OOV) without reported dataset size, number of evaluation episodes, statistical significance tests, per-entity fidelity breakdowns, or controls for prompt difficulty. This leaves open whether the observed gains isolate the NL interface or reflect global scene statistics from the pretrained backbone.
Authors: We agree that the current presentation of results would benefit from greater transparency on the evaluation protocol. In the revised manuscript we will report the exact size of the held-out evaluation set, the number of episodes per metric, and the outcomes of statistical significance tests (e.g., bootstrap confidence intervals or paired tests) for the 89 % vs. 43 % and 90 % vs. 0 % differences. We will also add per-entity fidelity tables and a breakdown of prompt difficulty (simple vs. compound instructions) to control for that variable. Because the Action-Index baseline uses the identical pretrained backbone and training distribution, the comparison already isolates the effect of the conditioning interface to a meaningful degree; we will make this point explicit in the revision. revision: yes
-
Referee: [§3.1] §3.1 (Architecture, frame-local text cross-attention): the conditioning mechanism applies text cross-attention to global frame features without entity-specific tokens, phrase-to-entity binding, attention masks, or visual grounding (e.g., segmentation). This design risks non-independent control where conditioning for one entity bleeds into others, undermining the claim that the NL interface itself enables independent multi-entity control beyond the backbone's implicit statistics.
Authors: The architecture deliberately applies frame-local cross-attention to the full set of visual tokens so that a single natural-language prompt can describe multiple entities without requiring explicit segmentation or per-entity tokens at inference time. The pretrained bidirectional backbone already encodes rich multi-entity scene structure; the text cross-attention therefore lets the language prompt modulate those existing representations rather than learning bindings from scratch. Our qualitative rollouts and the large gap versus the Action-Index baseline indicate that control remains largely independent in practice. We will expand §3.1 with a short discussion of implicit phrase-to-entity binding and will include attention-map visualizations in the supplement to illustrate separation of control signals. revision: partial
Circularity Check
No significant circularity; claims rest on empirical comparisons
full rationale
The paper describes an empirical architecture (pretrained bidirectional video backbone + frame-local text cross-attention + ODE-initialized Self-Forcing distillation) and reports direct performance numbers against external baselines (89% vs 43% cross-entity transfer, 90% vs 0% OOV prompts, 19.7 FPS). No equations, fitted parameters, or uniqueness theorems are presented that reduce by construction to the inputs or to self-citations. The central claims of multi-entity control and cross-entity transfer are supported by aggregate metrics on held-out rollouts rather than quantities defined in terms of the model's own conditioning variables. This is the normal case of a system paper whose results are falsifiable against stated baselines.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We pair a pretrained bidirectional video backbone with frame-local text cross-attention... ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
per-frame, per-entity natural-language conditioning... cross-entity action transfer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
COMBAT: Conditional world models for behavioral agent training.arXiv preprint arXiv:2603.00825, 2026
Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, and Spencer Frazier. COMBAT: Conditional world models for behavioral agent training.arXiv preprint arXiv:2603.00825, 2026
-
[2]
Diffusion for world modeling: Visual details matter in Atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[3]
Logic-guided vector fields for constrained generative modeling.arXiv preprint arXiv:2602.02009, 2026
Ali Baheri. Logic-guided vector fields for constrained generative modeling.arXiv preprint arXiv:2602.02009, 2026
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Taehv: Tiny autoencoder for hunyuan video
Ollin Boer Bohan. Taehv: Tiny autoencoder for hunyuan video. https://github.com/ madebyollin/taehv, 2025
work page 2025
-
[8]
Genie: Generative interactive environments
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[9]
GameGen-X: Interactive open-world game video generation
Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. GameGen-X: Interactive open-world game video generation. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[10]
Christopher, Michael Cardei, Jinhao Liang, and Ferdinando Fioretto
Jacob K. Christopher, Michael Cardei, Jinhao Liang, and Ferdinando Fioretto. Neuro-symbolic generative diffusion models for physically grounded, robust, and safe generation. InProceedings of the International Conference on Neuro-Symbolic Systems, volume 288 ofProceedings of Machine Learning Research, pages 188–213. PMLR, 2025
work page 2025
-
[11]
Oasis: A universe in a transformer
Decart AI and Etched AI. Oasis: A universe in a transformer. https://oasis-model. github.io/, 2024. 10
work page 2024
-
[12]
Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. LiveWorld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026
-
[13]
The matrix: Infinite-horizon world generation with real-time moving control
Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024
-
[14]
Artur d’Avila Garcez and Luis C. Lamb. Neural-symbolic learning and reasoning: A survey and interpretation. InNeuro-Symbolic Artificial Intelligence: The State of the Art, volume 342 ofFrontiers in Artificial Intelligence and Applications, pages 1–51. IOS Press, 2022
work page 2022
-
[15]
Genie 3: A new frontier for world models
Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/ blog/genie-3-a-new-frontier-for-world-models/, 2025. Google DeepMind Blog
work page 2025
-
[16]
Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025
-
[17]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[18]
Mastering diverse control tasks through world models.Nature, 640:647–653, 2025
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025
work page 2025
-
[19]
LM- Infinite: Zero-shot extreme length generalization for large language models
Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM- Infinite: Zero-shot extreme length generalization for large language models. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 3991–4008, 2024
work page 2024
-
[20]
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[22]
Vid2World: Crafting video diffusion models to interactive world models
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[23]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[24]
Kling AI launches 3.0 model, ushering in an era where everyone can be a director
Kuaishou Technology. Kling AI launches 3.0 model, ushering in an era where everyone can be a director. https://ir.kuaishou.com/news-releases/news-release-details/ kling-ai-launches-30-model-ushering-era-where-everyone-can-be , February
-
[25]
Accessed: 2026-05-01
work page 2026
-
[26]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[27]
Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, and Hanwang Zhang. Reasoning physical video generation with diffusion timestep tokens via reinforcement learning.arXiv preprint arXiv:2504.15932, 2025
-
[28]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 11
work page 2023
-
[29]
Genie 2: A large-scale foundation world model
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model. https://deepmind.google/blog/ genie-2-a-large-scale-foundation-world-model/, 2024. Google DeepMind Blog
work page 2024
-
[30]
Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, and Nataniel Ruiz. MultiGen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026
-
[31]
A VID: Adapting video diffusion models to world models
Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. A VID: Adapting video diffusion models to world models. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[32]
Solaris: Building a multiplayer video world model in minecraft
Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in Minecraft.arXiv preprint arXiv:2602.22208, 2026
-
[33]
Zero-shot conditioning of score-based diffusion models by neuro-symbolic constraints
Davide Scassola, Sebastiano Saccani, Ginevra Carbone, and Luca Bortolussi. Zero-shot conditioning of score-based diffusion models by neuro-symbolic constraints. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20302–20309, 2025
work page 2025
-
[34]
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
BlendRL: A framework for merging symbolic and neural policy learning
Hikaru Shindo, Quentin Delfosse, Devendra Singh Dhami, and Kristian Kersting. BlendRL: A framework for merging symbolic and neural policy learning. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[36]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 529:484–489, 2016
work page 2016
-
[37]
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction- following interactive game world model.arXiv preprint arXiv:2511.23429, 2025
-
[39]
Advancing Open-source World Models
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[42]
Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grand- master level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019
work page 2019
-
[43]
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[44]
Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, and Ming-Ming Cheng. Infinite-World: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026
-
[45]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[46]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, 2024
work page 2024
-
[48]
GameFactory: Creating new games with generative interactive videos
Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. GameFactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11590–11599, 2025
work page 2025
-
[49]
Matrix-game: Interactive world foundation model, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
-
[50]
Neuro-symbolic synergy for interactive world modeling.arXiv preprint arXiv:2602.10480, 2026
Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, and Tianyi Zhou. Neuro-symbolic synergy for interactive world modeling.arXiv preprint arXiv:2602.10480, 2026
-
[51]
Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, and Xiaoyun Yuan. ShareVerse: Multi-agent consistent video generation for shared world modeling.arXiv preprint arXiv:2603.02697, 2026. 13 Table 3: Systematic comparison of interactive video world models. ✓ = supported, ✗ = not supported, ∼ = partial.Multi-entityrequires independent and simultaneous control...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.