pith. sign in

arxiv: 2606.09507 · v1 · pith:HV4OFHFWnew · submitted 2026-06-08 · 💻 cs.CV

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Pith reviewed 2026-06-27 17:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-agent video generationcamera-controllable world modelcross-view consistencygeometry-aware denoisingvideo world modelsmulti-view video synthesis
0
0 comments X

The pith

Prisma-World treats all agent videos as one full-attention sequence with camera geometry injection to force cross-view consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independent per-agent generation risks creating different versions of the same scene in overlapping views. The paper claims a single model can avoid this by running a joint denoising process that shares evidence across views through attention. Relative camera positions are injected to bias overlapping regions toward agreement, while multi-agent RoPE keeps agent identities separate yet temporally aligned. An overlap-decaying curriculum and minimap guidance further tighten consistency and spatial structure. Experiments on a new UE5 dataset with precise annotations are presented to show flexible agent counts and controllable cameras remain possible under these constraints.

Core claim

Prisma-World formulates multi-agent generation as a joint geometry-aware denoising process. All agent videos are processed in one full-attention sequence; multi-agent RoPE distinguishes agent identities while preserving synchronized temporal coordinates; relative camera geometry is injected into attention to bias overlapping viewpoints toward shared scene evidence. An overlap-decaying curriculum and minimap-conditioned guidance are added to strengthen multi-view consistency and global spatial perception. A single model thereby produces high-fidelity videos with flexible agent numbers, camera controllability, and improved cross-view consistency.

What carries the argument

Joint geometry-aware denoising process that processes all agent videos in one full-attention sequence, distinguishes agents via multi-agent RoPE, and injects relative camera geometry into attention.

If this is right

  • One trained model can handle any number of agents without separate training runs.
  • Camera trajectories can be specified per agent while the shared scene remains consistent.
  • Minimap guidance supplies global structure that individual camera inputs alone do not provide.
  • The same architecture supports complex, composable multi-agent view groups across diverse scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-attention pattern could be applied to other multi-view tasks such as synchronized 3D reconstruction from multiple moving cameras.
  • If geometry injection proves decisive, similar conditioning might replace post-hoc consistency fixes in existing video generators.
  • Training on the described overlap-decaying schedule may transfer to non-video domains where partial overlaps must be reconciled.

Load-bearing premise

That full-attention sequence processing plus relative camera geometry injection is enough to couple separate denoising trajectories into geometrically consistent outputs.

What would settle it

Generated videos in which overlapping agent views display mismatched object positions, layouts, or appearances in the shared scene region.

Figures

Figures reproduced from arXiv: 2606.09507 by Dianyi Wang, Huiqiang Sun, Kang Liao, Kun Wang, Sheng Jin, Size Wu, Wei Li, Xingyu Zeng, Yangguang Li, Zhan Peng, Zhiguo Cao, Ziwei Liu.

Figure 1
Figure 1. Figure 1: In this paper, we introduce Prisma-World, a camera-controlled multi-agent world model capable of synthesizing multi-agent videos within complex scenes while preserving multi-view consistency. Furthermore, our model offers the flexibility to specify the number of output agents and supports minimap conditioning to provide explicit local spatial structural guidance for each agent. visual streams, let alone gu… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Prisma-World generates multi-agent videos within a single joint denoising process. Our MA-RoPE keeps frames at the same time step aligned across agents while distinguishing tokens from different agents. The minimap branch provides local spatial guidance by projecting each agent position onto a top-down minimap and injecting the extracted layout feature into the corresponding agent tokens. … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison. Compared to other baselines, our model can synthesize high￾quality multi-agent videos with precise camera control while preserving multi-view consistency. 4 Experiment 4.1 PrismaDataset We construct PrismaDataset in Unreal Engine 5 to provide multi-agent videos with accurate camera and action annotations. For each static scene, we use NavMesh in UE5 to identify navigable regions and… view at source ↗
Figure 4
Figure 4. Figure 4: Scalability of agent number. Prisma-World allows for the arbitrary specification of the number of output agents across diverse and complex scenes, while robustly preserving both the visual quality and multi-view consistency in output videos. Baselines. We compare our proposed method against two baseline frameworks: (1) State-of-the￾Art Single-Agent World Models (Lingbot-World [9]): We directly extend the c… view at source ↗
Figure 3
Figure 3. Figure 3: As demonstrated, our approach successfully synthesizes high-quality multi-agent videos while [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on dynamic shift and minimap. Training: We discard the progressive curriculum learning strategy, initializing the training phase directly and exclusively with hard samples (w/o odc). (3) Dynamic Noise Shift: We enforce a static noise shift parameter during training, completely disregarding the fluctuating number of agents within the batch (w/o dy-shift). (4) Minimap: We compare the results before … view at source ↗
read the original abstract

Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Prisma-World, a camera-controllable multi-agent video world model that formulates multi-agent generation as a joint geometry-aware denoising process. It processes all agent videos in one full-attention sequence, employs multi-agent RoPE to distinguish identities while preserving temporal synchronization, and injects relative camera geometry into attention to promote cross-view consistency. The framework is augmented with an overlap-decaying curriculum and minimap-conditioned structural guidance. A new large-scale UE5 dataset, PrismaDataset, is introduced with panoramic multi-agent views, flexible agent counts, and precise annotations. The central claim is that a single model generates high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding.

Significance. If the quantitative claims are substantiated, the work would advance multi-agent video world models by explicitly coupling overlapping views through architectural modifications rather than post-hoc corrections, addressing a key limitation in extending single-observer models. The release of PrismaDataset represents a concrete enabling contribution for the community. The combination of full-attention processing with geometry injection offers a scalable path for controllable multi-view generation with potential applications in simulation and robotics.

major comments (2)
  1. [Abstract] Abstract: the claim of 'improved cross-view consistency' and 'high-fidelity multi-agent videos' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis, which is load-bearing for evaluating whether the joint denoising process delivers the stated gains over independent generation.
  2. [Abstract] Abstract: the description of relative camera geometry injection into attention (and multi-agent RoPE) does not specify the implementation mechanism (additive encoding, masking, or scaling), leaving open whether this bias is sufficient to couple independent denoising trajectories when agent count varies and overlap is partial, as required by the central consistency claim.
minor comments (1)
  1. [Abstract] The abstract refers to 'Experiments show...' but provides no indication of the evaluation protocol, dataset splits, or consistency metrics used, which would aid immediate assessment of the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The manuscript's Experiments and Methods sections provide the supporting quantitative results and technical specifications. We address each comment below and will revise the abstract to incorporate clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'improved cross-view consistency' and 'high-fidelity multi-agent videos' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis, which is load-bearing for evaluating whether the joint denoising process delivers the stated gains over independent generation.

    Authors: The abstract is a high-level summary of the work. The full manuscript reports quantitative metrics, baselines (including independent generation), ablation studies on the joint full-attention and geometry components, and consistency error analysis in Section 4. These results substantiate the claims regarding the joint denoising process. We will revise the abstract to briefly reference the empirical gains demonstrated in the experiments. revision: yes

  2. Referee: [Abstract] Abstract: the description of relative camera geometry injection into attention (and multi-agent RoPE) does not specify the implementation mechanism (additive encoding, masking, or scaling), leaving open whether this bias is sufficient to couple independent denoising trajectories when agent count varies and overlap is partial, as required by the central consistency claim.

    Authors: The abstract provides an overview of the approach. The precise implementation mechanism for injecting relative camera geometry into attention and the multi-agent RoPE design is described in the Methods section of the manuscript. We agree that a brief indication of the mechanism would improve clarity in the abstract regarding how the bias couples trajectories under varying agent counts and partial overlap. We will revise the abstract accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural and dataset contributions are independent of claimed outputs

full rationale

The paper presents an architectural approach (full-attention sequence processing, multi-agent RoPE, relative camera geometry injection) plus an overlap-decaying curriculum and minimap guidance, together with a new PrismaDataset for training and evaluation. No equations, derivations, or fitted parameters are described that would reduce the claimed cross-view consistency gains to quantities defined by the inputs themselves. The central claims rest on the design choices and empirical evaluation on the introduced dataset rather than any self-definitional, fitted-input, or self-citation reduction. This is the expected non-finding for a model-description paper whose consistency improvements are asserted via architecture and data rather than mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; typical latent diffusion training involves many unlisted hyperparameters whose effect on the consistency claim cannot be audited.

pith-pipeline@v0.9.1-grok · 5830 in / 1089 out tokens · 20803 ms · 2026-06-27T17:01:50.776660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 20 linked inside Pith

  1. [1]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  2. [2]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

  3. [3]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

  4. [4]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  5. [5]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  6. [6]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

    Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

  7. [7]

    Oasis: A universe in a transformer.https: // oasis-model

    Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.https: // oasis-model. github. io, 2024

  8. [8]

    https://deepmind.google/ models/genie/

    Genie3: A New Frontier for World Models.Google DeepMind, 2025. https://deepmind.google/ models/genie/

  9. [9]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  10. [10]

    Sora: Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Sora: Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  11. [11]

    Seedance 2.0: Advancing video generation for world complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, 2026

  12. [12]

    Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

    Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

  13. [13]

    Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  14. [14]

    Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

  15. [15]

    Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  16. [16]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

  17. [17]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 10

  18. [18]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

  19. [19]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  20. [20]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  21. [21]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conference Papers, pages 1–11, 2024

  22. [22]

    Diffusion as shader: 3d-aware video diffusion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In ACM SIGGRAPH Conference Papers, pages 1–12, 2025

  23. [23]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, pages 14834–14844, 2025

  24. [24]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, pages 313–331, 2024

  25. [25]

    Thinking with camera: A unified multimodal model for camera-centric understanding and generation

    Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation. InInternational Conference on Learning Representations, 2026

  26. [26]

    Cat3d: Create anything in 3d with multi-view diffusion models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024

  27. [27]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

  28. [28]

    Wonderland: Navigating 3d scenes from a single image

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstanti- nos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InCVPR, pages 798–810, 2025

  29. [29]

    Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

  30. [30]

    Gta: A geometry-aware attention mechanism for multi-view transformers

    Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers. InICLR, volume 2024, pages 8172–8208, 2024

  31. [31]

    Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

    Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

  32. [32]

    Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

    Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

  33. [33]

    World models.arXiv preprint arXiv:1803.10122, 2018

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  34. [34]

    Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 11

  35. [35]

    The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

    Ruili Feng, Han Zhang, Zhilei Shu, Zhantao Yang, Longxiang Tang, Zhicai Wang, Andy Zheng, Jie Xiao, Zhiheng Liu, Ruihang Chu, et al. The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

  36. [36]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, volume 2025, pages 73754–73776, 2025

  37. [37]

    Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

    Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

  38. [38]

    Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

  39. [39]

    Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

    Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

  40. [40]

    Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

    Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, et al. Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

  41. [41]

    Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

    Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

  42. [42]

    Aether: Geometric-aware unified world modeling

    Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InICCV, pages 8535–8546, 2025

  43. [43]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

  44. [44]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

  45. [45]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia Conference Papers, pages 1–11, 2025

  46. [46]

    Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

    Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

  47. [47]

    Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

    Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

  48. [48]

    Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

  49. [49]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

  50. [50]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

  51. [51]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, pages 22963–22974, 2025

  52. [52]

    Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

  53. [53]

    World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 12

  54. [54]

    Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  55. [55]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  56. [56]

    Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  57. [57]

    Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints

    Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. In ICLR, pages 58038–58060, 2025

  58. [58]

    Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

  59. [59]

    Cavia: Camera-controllable multi-view video diffusion with view-integrated attention

    Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, and Hao Tang. Cavia: Camera-controllable multi-view video diffusion with view-integrated attention. arXiv preprint arXiv:2410.10774, 2024

  60. [60]

    Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

    Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

  61. [61]

    Introducing multiverse: The first ai multiplayer world model, 2025

    Enigma team. Introducing multiverse: The first ai multiplayer world model, 2025. URLhttps://enigma. inc/blog

  62. [62]

    Solaris: Building a multiplayer video world model in minecraft

    Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026

  63. [63]

    Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

    Haoyu Wu, Jiwen Yu, Yingtian Zou, and Xihui Liu. Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

  64. [64]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  65. [65]

    Geocalib: Learning single-image calibration with geometric optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuropean Conference on Computer Vision, pages 1–20, 2024

  66. [66]

    Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

  67. [67]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  68. [68]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

  69. [69]

    Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

    Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025. 13