Prisma-World: Camera-Controllable Multi-Agent Video World Model

Dianyi Wang; Huiqiang Sun; Kang Liao; Kun Wang; Sheng Jin; Size Wu; Wei Li; Xingyu Zeng; Yangguang Li; Zhan Peng

arxiv: 2606.09507 · v1 · pith:HV4OFHFWnew · submitted 2026-06-08 · 💻 cs.CV

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Huiqiang Sun , Zhan Peng , Size Wu , Kun Wang , Kang Liao , Dianyi Wang , Xingyu Zeng , Sheng Jin

show 4 more authors

Yangguang Li Zhiguo Cao Ziwei Liu Wei Li

This is my paper

Pith reviewed 2026-06-27 17:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-agent video generationcamera-controllable world modelcross-view consistencygeometry-aware denoisingvideo world modelsmulti-view video synthesis

0 comments

The pith

Prisma-World treats all agent videos as one full-attention sequence with camera geometry injection to force cross-view consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independent per-agent generation risks creating different versions of the same scene in overlapping views. The paper claims a single model can avoid this by running a joint denoising process that shares evidence across views through attention. Relative camera positions are injected to bias overlapping regions toward agreement, while multi-agent RoPE keeps agent identities separate yet temporally aligned. An overlap-decaying curriculum and minimap guidance further tighten consistency and spatial structure. Experiments on a new UE5 dataset with precise annotations are presented to show flexible agent counts and controllable cameras remain possible under these constraints.

Core claim

Prisma-World formulates multi-agent generation as a joint geometry-aware denoising process. All agent videos are processed in one full-attention sequence; multi-agent RoPE distinguishes agent identities while preserving synchronized temporal coordinates; relative camera geometry is injected into attention to bias overlapping viewpoints toward shared scene evidence. An overlap-decaying curriculum and minimap-conditioned guidance are added to strengthen multi-view consistency and global spatial perception. A single model thereby produces high-fidelity videos with flexible agent numbers, camera controllability, and improved cross-view consistency.

What carries the argument

Joint geometry-aware denoising process that processes all agent videos in one full-attention sequence, distinguishes agents via multi-agent RoPE, and injects relative camera geometry into attention.

If this is right

One trained model can handle any number of agents without separate training runs.
Camera trajectories can be specified per agent while the shared scene remains consistent.
Minimap guidance supplies global structure that individual camera inputs alone do not provide.
The same architecture supports complex, composable multi-agent view groups across diverse scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-attention pattern could be applied to other multi-view tasks such as synchronized 3D reconstruction from multiple moving cameras.
If geometry injection proves decisive, similar conditioning might replace post-hoc consistency fixes in existing video generators.
Training on the described overlap-decaying schedule may transfer to non-video domains where partial overlaps must be reconciled.

Load-bearing premise

That full-attention sequence processing plus relative camera geometry injection is enough to couple separate denoising trajectories into geometrically consistent outputs.

What would settle it

Generated videos in which overlapping agent views display mismatched object positions, layouts, or appearances in the shared scene region.

Figures

Figures reproduced from arXiv: 2606.09507 by Dianyi Wang, Huiqiang Sun, Kang Liao, Kun Wang, Sheng Jin, Size Wu, Wei Li, Xingyu Zeng, Yangguang Li, Zhan Peng, Zhiguo Cao, Ziwei Liu.

**Figure 1.** Figure 1: In this paper, we introduce Prisma-World, a camera-controlled multi-agent world model capable of synthesizing multi-agent videos within complex scenes while preserving multi-view consistency. Furthermore, our model offers the flexibility to specify the number of output agents and supports minimap conditioning to provide explicit local spatial structural guidance for each agent. visual streams, let alone gu… view at source ↗

**Figure 2.** Figure 2: Method overview. Prisma-World generates multi-agent videos within a single joint denoising process. Our MA-RoPE keeps frames at the same time step aligned across agents while distinguishing tokens from different agents. The minimap branch provides local spatial guidance by projecting each agent position onto a top-down minimap and injecting the extracted layout feature into the corresponding agent tokens. … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison. Compared to other baselines, our model can synthesize highquality multi-agent videos with precise camera control while preserving multi-view consistency. 4 Experiment 4.1 PrismaDataset We construct PrismaDataset in Unreal Engine 5 to provide multi-agent videos with accurate camera and action annotations. For each static scene, we use NavMesh in UE5 to identify navigable regions and… view at source ↗

**Figure 4.** Figure 4: Scalability of agent number. Prisma-World allows for the arbitrary specification of the number of output agents across diverse and complex scenes, while robustly preserving both the visual quality and multi-view consistency in output videos. Baselines. We compare our proposed method against two baseline frameworks: (1) State-of-theArt Single-Agent World Models (Lingbot-World [9]): We directly extend the c… view at source ↗

**Figure 3.** Figure 3: As demonstrated, our approach successfully synthesizes high-quality multi-agent videos while [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Ablation on dynamic shift and minimap. Training: We discard the progressive curriculum learning strategy, initializing the training phase directly and exclusively with hard samples (w/o odc). (3) Dynamic Noise Shift: We enforce a static noise shift parameter during training, completely disregarding the fluctuating number of agents within the batch (w/o dy-shift). (4) Minimap: We compare the results before … view at source ↗

read the original abstract

Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prisma-World adds joint full-attention processing with multi-agent RoPE and camera geometry injection to video world models plus a new UE5 dataset, but the abstract gives no numbers so the consistency gains remain unproven.

read the letter

Prisma-World extends single-agent video world models to multiple agents by running all views through one full-attention sequence, using a multi-agent RoPE variant to keep time coordinates aligned while separating agents, and injecting relative camera geometry into attention to push overlapping views toward the same scene content. It adds an overlap-decaying curriculum and minimap guidance, and releases PrismaDataset with panoramic UE5 captures, flexible agent counts, and precise annotations.

The concrete new elements are the RoPE modification and the geometry injection into attention; these are presented as ways to couple denoising trajectories without post-hoc fixes. The dataset is also a clear practical step for anyone who needs annotated multi-view sequences with controllable cameras.

The approach makes sense on paper: independent generation can produce mismatched objects across views, so joint processing with geometric bias is a direct response. The curriculum and minimap elements target global consistency and spatial grounding in a reasonable way.

The main gap is evidence. The abstract claims improved cross-view consistency and fidelity but reports no metrics, baselines, or ablations, so it is impossible to judge whether the geometry signal is strong enough when overlaps are partial or agent count changes. The stress-test concern about whether the attention modifications actually couple the trajectories holds until the results section shows otherwise.

This is for researchers working on controllable video generation or multi-agent simulation. Readers focused on world models or multi-view consistency would find the architectural choices and dataset useful to examine.

It deserves peer review so the quantitative results and implementation details can be checked.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Prisma-World, a camera-controllable multi-agent video world model that formulates multi-agent generation as a joint geometry-aware denoising process. It processes all agent videos in one full-attention sequence, employs multi-agent RoPE to distinguish identities while preserving temporal synchronization, and injects relative camera geometry into attention to promote cross-view consistency. The framework is augmented with an overlap-decaying curriculum and minimap-conditioned structural guidance. A new large-scale UE5 dataset, PrismaDataset, is introduced with panoramic multi-agent views, flexible agent counts, and precise annotations. The central claim is that a single model generates high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding.

Significance. If the quantitative claims are substantiated, the work would advance multi-agent video world models by explicitly coupling overlapping views through architectural modifications rather than post-hoc corrections, addressing a key limitation in extending single-observer models. The release of PrismaDataset represents a concrete enabling contribution for the community. The combination of full-attention processing with geometry injection offers a scalable path for controllable multi-view generation with potential applications in simulation and robotics.

major comments (2)

[Abstract] Abstract: the claim of 'improved cross-view consistency' and 'high-fidelity multi-agent videos' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis, which is load-bearing for evaluating whether the joint denoising process delivers the stated gains over independent generation.
[Abstract] Abstract: the description of relative camera geometry injection into attention (and multi-agent RoPE) does not specify the implementation mechanism (additive encoding, masking, or scaling), leaving open whether this bias is sufficient to couple independent denoising trajectories when agent count varies and overlap is partial, as required by the central consistency claim.

minor comments (1)

[Abstract] The abstract refers to 'Experiments show...' but provides no indication of the evaluation protocol, dataset splits, or consistency metrics used, which would aid immediate assessment of the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The manuscript's Experiments and Methods sections provide the supporting quantitative results and technical specifications. We address each comment below and will revise the abstract to incorporate clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'improved cross-view consistency' and 'high-fidelity multi-agent videos' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis, which is load-bearing for evaluating whether the joint denoising process delivers the stated gains over independent generation.

Authors: The abstract is a high-level summary of the work. The full manuscript reports quantitative metrics, baselines (including independent generation), ablation studies on the joint full-attention and geometry components, and consistency error analysis in Section 4. These results substantiate the claims regarding the joint denoising process. We will revise the abstract to briefly reference the empirical gains demonstrated in the experiments. revision: yes
Referee: [Abstract] Abstract: the description of relative camera geometry injection into attention (and multi-agent RoPE) does not specify the implementation mechanism (additive encoding, masking, or scaling), leaving open whether this bias is sufficient to couple independent denoising trajectories when agent count varies and overlap is partial, as required by the central consistency claim.

Authors: The abstract provides an overview of the approach. The precise implementation mechanism for injecting relative camera geometry into attention and the multi-agent RoPE design is described in the Methods section of the manuscript. We agree that a brief indication of the mechanism would improve clarity in the abstract regarding how the bias couples trajectories under varying agent counts and partial overlap. We will revise the abstract accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural and dataset contributions are independent of claimed outputs

full rationale

The paper presents an architectural approach (full-attention sequence processing, multi-agent RoPE, relative camera geometry injection) plus an overlap-decaying curriculum and minimap guidance, together with a new PrismaDataset for training and evaluation. No equations, derivations, or fitted parameters are described that would reduce the claimed cross-view consistency gains to quantities defined by the inputs themselves. The central claims rest on the design choices and empirical evaluation on the introduced dataset rather than any self-definitional, fitted-input, or self-citation reduction. This is the expected non-finding for a model-description paper whose consistency improvements are asserted via architecture and data rather than mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; typical latent diffusion training involves many unlisted hyperparameters whose effect on the consistency claim cannot be audited.

pith-pipeline@v0.9.1-grok · 5830 in / 1089 out tokens · 20803 ms · 2026-06-27T17:01:50.776660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 20 linked inside Pith

[1]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[2]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

arXiv 2025
[3]

Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

arXiv 2025
[4]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025
[5]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[6]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026
[7]

Oasis: A universe in a transformer.https: // oasis-model

Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.https: // oasis-model. github. io, 2024

2024
[8]

https://deepmind.google/ models/genie/

Genie3: A New Frontier for World Models.Google DeepMind, 2025. https://deepmind.google/ models/genie/

2025
[9]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

Pith/arXiv arXiv 2026
[10]

Sora: Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Sora: Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

2024
[11]

Seedance 2.0: Advancing video generation for world complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026
[12]

Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

arXiv 2026
[13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[14]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

2026
[15]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024
[16]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

2025
[17]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 10

Pith/arXiv arXiv 2023
[18]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

2024
[19]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022
[20]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[21]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conference Papers, pages 1–11, 2024

2024
[22]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In ACM SIGGRAPH Conference Papers, pages 1–12, 2025

2025
[23]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, pages 14834–14844, 2025

2025
[24]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, pages 313–331, 2024

2024
[25]

Thinking with camera: A unified multimodal model for camera-centric understanding and generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation. InInternational Conference on Learning Representations, 2026

2026
[26]

Cat3d: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024

Pith/arXiv arXiv 2024
[27]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

2025
[28]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstanti- nos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InCVPR, pages 798–810, 2025

2025
[29]

Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

2026
[30]

Gta: A geometry-aware attention mechanism for multi-view transformers

Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers. InICLR, volume 2024, pages 8172–8208, 2024

2024
[31]

Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

arXiv 2025
[32]

Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

arXiv 2024
[33]

World models.arXiv preprint arXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018
[34]

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 11

Pith/arXiv arXiv 2023
[35]

The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

Ruili Feng, Han Zhang, Zhilei Shu, Zhantao Yang, Longxiang Tang, Zhicai Wang, Andy Zheng, Jie Xiao, Zhiheng Liu, Ruihang Chu, et al. The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

2026
[36]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, volume 2025, pages 73754–73776, 2025

2025
[37]

Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

arXiv 2024
[38]

Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

2024
[39]

Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

arXiv 2025
[40]

Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, et al. Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

arXiv 2025
[41]

Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

arXiv 2026
[42]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InICCV, pages 8535–8546, 2025

2025
[43]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025
[44]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026
[45]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia Conference Papers, pages 1–11, 2025

2025
[46]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

arXiv 2026
[47]

Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

arXiv 2025
[48]

Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

2026
[49]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

2024
[50]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

2026
[51]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, pages 22963–22974, 2025

2025
[52]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025
[53]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 12

Pith/arXiv arXiv 2025
[54]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[55]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[56]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025
[57]

Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints

Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. In ICLR, pages 58038–58060, 2025

2025
[58]

Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

2024
[59]

Cavia: Camera-controllable multi-view video diffusion with view-integrated attention

Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, and Hao Tang. Cavia: Camera-controllable multi-view video diffusion with view-integrated attention. arXiv preprint arXiv:2410.10774, 2024

arXiv 2024
[60]

Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

arXiv 2025
[61]

Introducing multiverse: The first ai multiplayer world model, 2025

Enigma team. Introducing multiverse: The first ai multiplayer world model, 2025. URLhttps://enigma. inc/blog

2025
[62]

Solaris: Building a multiplayer video world model in minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026

arXiv 2026
[63]

Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

Haoyu Wu, Jiwen Yu, Yingtian Zou, and Xihui Liu. Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

Pith/arXiv arXiv 2026
[64]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[65]

Geocalib: Learning single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuropean Conference on Computer Vision, pages 1–20, 2024

2024
[66]

Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018
[67]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

2004
[68]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

2018
[69]

Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025. 13

Pith/arXiv arXiv 2025

[1] [1]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[2] [2]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

arXiv 2025

[3] [3]

Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

arXiv 2025

[4] [4]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025

[5] [5]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[6] [6]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026

[7] [7]

Oasis: A universe in a transformer.https: // oasis-model

Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.https: // oasis-model. github. io, 2024

2024

[8] [8]

https://deepmind.google/ models/genie/

Genie3: A New Frontier for World Models.Google DeepMind, 2025. https://deepmind.google/ models/genie/

2025

[9] [9]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

Pith/arXiv arXiv 2026

[10] [10]

Sora: Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Sora: Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

2024

[11] [11]

Seedance 2.0: Advancing video generation for world complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026

[12] [12]

Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

arXiv 2026

[13] [13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[14] [14]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.NeurIPS, 38:24195–24228, 2026

2026

[15] [15]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024

[16] [16]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

2025

[17] [17]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 10

Pith/arXiv arXiv 2023

[18] [18]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

2024

[19] [19]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022

[20] [20]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[21] [21]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conference Papers, pages 1–11, 2024

2024

[22] [22]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In ACM SIGGRAPH Conference Papers, pages 1–12, 2025

2025

[23] [23]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, pages 14834–14844, 2025

2025

[24] [24]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, pages 313–331, 2024

2024

[25] [25]

Thinking with camera: A unified multimodal model for camera-centric understanding and generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation. InInternational Conference on Learning Representations, 2026

2026

[26] [26]

Cat3d: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024

Pith/arXiv arXiv 2024

[27] [27]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

2025

[28] [28]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstanti- nos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InCVPR, pages 798–810, 2025

2025

[29] [29]

Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.NeurIPS, 38:15984–16009, 2026

2026

[30] [30]

Gta: A geometry-aware attention mechanism for multi-view transformers

Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers. InICLR, volume 2024, pages 8172–8208, 2024

2024

[31] [31]

Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

arXiv 2025

[32] [32]

Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states.arXiv preprint arXiv:2406.09455, 2024

arXiv 2024

[33] [33]

World models.arXiv preprint arXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018

[34] [34]

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 11

Pith/arXiv arXiv 2023

[35] [35]

The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

Ruili Feng, Han Zhang, Zhilei Shu, Zhantao Yang, Longxiang Tang, Zhicai Wang, Andy Zheng, Jie Xiao, Zhiheng Liu, Ruihang Chu, et al. The matrix: Infinite-horizon world generation with real-time moving control.NeurIPS, 38:87318–87344, 2026

2026

[36] [36]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, volume 2025, pages 73754–73776, 2025

2025

[37] [37]

Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

arXiv 2024

[38] [38]

Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.NeurIPS, 37:58757–58791, 2024

2024

[39] [39]

Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

arXiv 2025

[40] [40]

Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, et al. Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437, 2025

arXiv 2025

[41] [41]

Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

arXiv 2026

[42] [42]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InICCV, pages 8535–8546, 2025

2025

[43] [43]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025

[44] [44]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026

[45] [45]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia Conference Papers, pages 1–11, 2025

2025

[46] [46]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

arXiv 2026

[47] [47]

Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine.arXiv preprint arXiv:2511.22815, 2025

arXiv 2025

[48] [48]

Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.NeurIPS, 38:49371–49393, 2026

2026

[49] [49]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 37:24081–24125, 2024

2024

[50] [50]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.NeurIPS, 38:167283–167308, 2026

2026

[51] [51]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, pages 22963–22974, 2025

2025

[52] [52]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025

[53] [53]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 12

Pith/arXiv arXiv 2025

[54] [54]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[55] [55]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[56] [56]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025

[57] [57]

Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints

Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. In ICLR, pages 58038–58060, 2025

2025

[58] [58]

Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS, 37:16240–16271, 2024

2024

[59] [59]

Cavia: Camera-controllable multi-view video diffusion with view-integrated attention

Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, and Hao Tang. Cavia: Camera-controllable multi-view video diffusion with view-integrated attention. arXiv preprint arXiv:2410.10774, 2024

arXiv 2024

[60] [60]

Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling.arXiv preprint arXiv:2512.02793, 2025

arXiv 2025

[61] [61]

Introducing multiverse: The first ai multiplayer world model, 2025

Enigma team. Introducing multiverse: The first ai multiplayer world model, 2025. URLhttps://enigma. inc/blog

2025

[62] [62]

Solaris: Building a multiplayer video world model in minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026

arXiv 2026

[63] [63]

Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

Haoyu Wu, Jiwen Yu, Yingtian Zou, and Xihui Liu. Multiworld: Scalable multi-agent multi-view video world models.arXiv preprint arXiv:2604.18564, 2026

Pith/arXiv arXiv 2026

[64] [64]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[65] [65]

Geocalib: Learning single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuropean Conference on Computer Vision, pages 1–20, 2024

2024

[66] [66]

Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018

[67] [67]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

2004

[68] [68]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

2018

[69] [69]

Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025. 13

Pith/arXiv arXiv 2025