Latent Spatial Memory for Video World Models

Bohan Zhuang; Donny Y. Chen; Feng Chen; Haoyu Zhao; Weijie Wang; Yefei He; Yifan Yang; Yuqing Yang; Zeyu Zhang; Zicheng Duan

arxiv: 2606.09828 · v1 · pith:EA254GL6new · submitted 2026-06-08 · 💻 cs.CV

Latent Spatial Memory for Video World Models

Weijie Wang , Haoyu Zhao , Yifan Yang , Feng Chen , Zeyu Zhang , Yefei He , Zicheng Duan , Donny Y. Chen

show 2 more authors

Yuqing Yang Bohan Zhuang

This is my paper

Pith reviewed 2026-06-27 16:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent spatial memoryvideo world modelsdiffusion latent space3D consistencydepth-guided back-projectionlatent-space warpingvideo generationspatial cache

0 comments

The pith

Video world models store 3D scene memory directly in diffusion latent space to avoid pixel reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video world models typically rely on explicit RGB point clouds for 3D consistency across frames, but this requires repeated rendering and encoding that is slow and discards latent features. The paper introduces latent spatial memory as a persistent 3D cache built by lifting latent tokens into 3D with depth-guided back-projection. Novel views are then synthesized through direct latent-space warping instead of pixel-space operations. This change removes both the computational cost of round trips through pixels and the associated information loss. A reader would care if the approach delivers the claimed speed and memory gains while matching or exceeding prior quality on standard benchmarks.

Core claim

The central claim is that a latent spatial memory constructed by depth-guided back-projection of diffusion tokens into 3D and queried by direct latent-space warping preserves spatial consistency, eliminates pixel-space information loss, and removes the need for repeated VAE encoding and rendering, resulting in up to 10.57 times faster end-to-end generation and 55 times smaller memory footprint while reaching state-of-the-art WorldScore performance and strong RealEstate10K reconstruction.

What carries the argument

Latent spatial memory: a persistent 3D cache of scene information stored directly in diffusion latent space, built via depth-guided back-projection of latent tokens and queried via latent-space warping.

If this is right

End-to-end video generation runs up to 10.57 times faster than explicit 3D baselines.
Memory footprint drops by a factor of 55 relative to explicit 3D baselines.
State-of-the-art performance is reached on the WorldScore benchmark.
Strong reconstruction quality is maintained on the RealEstate10K dataset.
Both information loss from pixel round trips and the cost of repeated encoding and rendering are eliminated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-memory structure could be tested in non-video generative tasks that require multi-view consistency, such as novel-view synthesis from single images.
If the speed gains hold at larger scales, the method may enable longer coherent video sequences on consumer hardware.
The reliance on the diffusion model's geometric prior suggests the approach may transfer to other latent generative architectures that already encode 3D cues.

Load-bearing premise

Depth-guided back-projection of latent tokens into 3D followed by direct latent-space warping preserves 3D spatial consistency at least as well as explicit RGB point-cloud methods without introducing new artifacts.

What would settle it

An experiment in which the latent-memory method produces measurably lower WorldScore or worse RealEstate10K reconstruction quality than an explicit RGB point-cloud baseline would falsify the central performance claim.

read the original abstract

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper moves 3D memory into diffusion latents to skip RGB point-cloud overhead and reports large speed and memory gains with no internal contradictions visible.

read the letter

The core advance is storing a persistent 3D cache directly in the diffusion latent space rather than building and rendering explicit RGB point clouds. They lift latent tokens via depth-guided back-projection and query new views by warping in latent space, which removes the repeated VAE encode/decode steps that normally add cost and lose information.

This produces the headline numbers: up to 10.57× faster end-to-end generation and 55× smaller memory footprint versus explicit baselines, plus SOTA on WorldScore and solid reconstruction on RealEstate10K. The approach is clean on paper because it re-uses the geometric prior already present in the diffusion model instead of re-learning it in pixel space.

The main soft spot is that the performance edge depends on the latent space already carrying enough accurate 3D structure for the warping step to avoid new artifacts. The abstract gives no derivation details, ablation tables, or error bars, so it is impossible to judge how tightly the experiments control for that assumption. The claims are not circular; they rest on external benchmarks.

This paper is aimed at researchers building diffusion-based world models who need longer consistent video at lower cost. Anyone working on video synthesis or simulation will find the technique worth testing. It is coherent enough and the efficiency claims are large enough that it deserves a serious referee even if the experiments need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces latent spatial memory for video world models as a persistent 3D cache stored directly in diffusion latent space. Mirage builds this memory via depth-guided back-projection of latent tokens and queries it through direct latent-space warping, avoiding the repeated rendering/VAE encoding of explicit RGB point-cloud baselines. Experiments report up to 10.57× faster end-to-end generation, 55× memory reduction, SOTA results on WorldScore, and strong reconstruction on RealEstate10K.

Significance. If the central claim holds—that latent-space geometric priors suffice to preserve 3D consistency without pixel-space roundtrips—this offers a practical route to scalable video world models by cutting both compute and memory costs while retaining or improving benchmark performance. The approach directly addresses a known bottleneck in explicit 3D memory methods.

major comments (2)

[§3.2] §3.2 (Latent-Space Warping): the claim that direct warping 'preserves 3D spatial consistency at least as well as explicit RGB point-cloud methods' is load-bearing for the efficiency claims, yet the section provides no quantitative comparison of artifact rates or consistency metrics (e.g., depth error or temporal coherence) between latent warping and the RGB baseline under identical depth inputs.
[Table 3] Table 3 (WorldScore results): the SOTA claim rests on the reported scores, but the table does not report standard deviations across seeds or runs; without this, it is impossible to assess whether the gains over the explicit 3D baseline are statistically reliable or sensitive to the particular depth estimator used.

minor comments (2)

[Abstract] The abstract and §1 use 'up to 10.57×' and '55×' without clarifying whether these are best-case or average-case figures across the evaluated sequences.
[Eq. 3] Notation for the latent token lifting operation (Eq. 3) is introduced without an accompanying diagram showing the coordinate transformation from 2D latent grid to 3D.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional quantitative evidence would strengthen the presentation of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2 (Latent-Space Warping): the claim that direct warping 'preserves 3D spatial consistency at least as well as explicit RGB point-cloud methods' is load-bearing for the efficiency claims, yet the section provides no quantitative comparison of artifact rates or consistency metrics (e.g., depth error or temporal coherence) between latent warping and the RGB baseline under identical depth inputs.

Authors: We agree that a direct quantitative comparison of consistency metrics (depth error, temporal coherence, artifact rates) under identical depth inputs would make the claim more robust. The current manuscript relies on end-to-end benchmark superiority (SOTA WorldScore, strong RealEstate10K reconstruction) as indirect evidence that latent warping preserves sufficient 3D consistency for practical generation. To address the gap, we will add a controlled ablation in the revised Section 3.2 that reports these metrics for both latent warping and the RGB baseline using the same depth maps and camera trajectories. revision: yes
Referee: [Table 3] Table 3 (WorldScore results): the SOTA claim rests on the reported scores, but the table does not report standard deviations across seeds or runs; without this, it is impossible to assess whether the gains over the explicit 3D baseline are statistically reliable or sensitive to the particular depth estimator used.

Authors: We concur that standard deviations are necessary to evaluate statistical reliability and sensitivity to the depth estimator. The reported gains are large relative to typical variance in these benchmarks, but this is not a substitute for explicit reporting. In the revised manuscript we will recompute the WorldScore results over multiple random seeds, report means and standard deviations in Table 3, and include a brief note on the depth estimator used across all methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a latent spatial memory method for video world models, constructing memory via depth-guided back-projection in latent space and querying via direct warping. Performance claims (speedup, memory reduction, SOTA on WorldScore/RealEstate10K) rest on external benchmarks and empirical comparison to explicit 3D baselines rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivation steps are shown that reduce to inputs by construction. The approach is self-contained against the stated geometric prior and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5774 in / 1273 out tokens · 19454 ms · 2026-06-27T16:45:46.364825+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WorldOlympiad: Can Your World Model Survive a Triathlon?
cs.CV 2026-06 unverdicted novelty 5.0

WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.

Reference graph

Works this paper leans on

66 extracted references · 22 linked inside Pith · cited by 1 Pith paper

[1]

Sora.https://openai.com/sora/, 2024

OpenAI. Sora.https://openai.com/sora/, 2024

2024
[2]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[3]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[4]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[5]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[7]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

2024
[8]

Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Pith/arXiv arXiv 2024
[9]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024
[10]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024. 14

arXiv 2024
[11]

Spatia: Video generation with updatable spatial memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[12]

Voyager: Long-range and world- consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, et al. Voyager: Long-range and world- consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

arXiv 2025
[13]

Freeman, and Jiajun Wu

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image.arXiv:2406.09394, 2024

arXiv 2024
[14]

Wonderjourney: Going from anywhere to everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

2024
[15]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026
[16]

Chen, and Jiwen Lu

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, and Jiwen Lu. Drivegen3d: Boosting feed-forward driving scene generation with efficient video diffusion.arXiv preprint arXiv:2510.15264, 2025

Pith/arXiv arXiv 2025
[17]

World-r1: Reinforcing 3d constraints for text-to-video generation.arXiv preprint arXiv:2604.24764, 2026

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation.arXiv preprint arXiv:2604.24764, 2026

Pith/arXiv arXiv 2026
[18]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025
[19]

Captain safari: A world engine with pose-aligned 3d memory.arXiv preprint arXiv:2511.22815, 2025

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine with pose-aligned 3d memory.arXiv preprint arXiv:2511.22815, 2025

arXiv 2025
[20]

Depth anything 3: Recovering the visual space from any views.International Conference on Learning Representations (ICLR), 2026

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.International Conference on Learning Representations (ICLR), 2026

2026
[21]

Feed-forward 3d scene modeling: A problem-driven perspective.arXiv preprint arXiv:2604.14025, 2026

Weijie Wang, Qihang Cao, Sensen Gao, Donny Y Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, et al. Feed-forward 3d scene modeling: A problem-driven perspective.arXiv preprint arXiv:2604.14025, 2026

Pith/arXiv arXiv 2026
[22]

Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs

Weijie Wang, Donny Y Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang. Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs. 38:113407–113436, 2026

2026
[23]

Chen, and Bohan Zhuang

Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y. Chen, and Bohan Zhuang. Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction.arXiv preprint arXiv:2509.19297, 2025

Pith/arXiv arXiv 2025
[24]

Chen, and Bohan Zhuang

Weijie Wang, Zimu Li, Jinchuan Shi, Zeyu Zhang, Botao Ye, Marc Pollefeys, Donny Y. Chen, and Bohan Zhuang. Trisplat: Simulation-ready feed-forward 3d scene reconstruction.arXiv preprint arXiv:2605.26115, 2026. 15

Pith/arXiv arXiv 2026
[25]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023

2023
[26]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

Pith/arXiv arXiv 2025
[27]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27713–27724, October 2025

2025
[28]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

arXiv 2025
[29]

Stereo magni- fication: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magni- fication: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Pith/arXiv arXiv 2018
[30]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[31]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[32]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023
[33]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024
[34]

Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991, 2024

Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, and Jun Huang. Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991, 2024

arXiv 2024
[35]

Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

arXiv 2024
[36]

Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453, 2025

Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, et al. Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453, 2025

arXiv 2025
[37]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 16

Pith/arXiv arXiv 2024
[38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[39]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[40]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

arXiv 2024
[41]

Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

Pith/arXiv arXiv 2025
[42]

Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Pith/arXiv arXiv 2025
[43]

Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

arXiv 2024
[44]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024
[45]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

arXiv 2025
[46]

I2vcontrol-camera: Precise video camera control with adjustable motion strength.arXiv preprint arXiv:2411.06525, 2024

Wanquan Feng, Jiawei Liu, Pengqi Tu, Tianhao Qi, Mingzhen Sun, Tianxiang Ma, Songtao Zhao, Siyu Zhou, and Qian He. I2vcontrol-camera: Precise video camera control with adjustable motion strength.arXiv preprint arXiv:2411.06525, 2024

arXiv 2024
[47]

Panflow: Decoupled motion control for panoramic video generation

Cheng Zhang, Hanwen Liang, Donny Y Chen, Qianyi Wu, Konstantinos N Plataniotis, Camilo Cruz Gambardella, and Jianfei Cai. Panflow: Decoupled motion control for panoramic video generation. volume 40, pages 12385–12393, 2026

2026
[48]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024
[49]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

arXiv 2025
[50]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

arXiv 2025
[51]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 17

2025
[52]

Omnicam: Unified multimodal video generation via camera control.arXiv preprint arXiv:2504.02312, 2025

Xiaoda Yang, Jiayang Xu, Kaixuan Luan, Xinyu Zhan, Hongshun Qiu, Shijun Shi, Hao Li, Shuai Yang, Li Zhang, Checheng Yu, et al. Omnicam: Unified multimodal video generation via camera control.arXiv preprint arXiv:2504.02312, 2025

arXiv 2025
[53]

Invisible stitch: Gener- ating smooth 3d scenes with depth inpainting

Paul Engstler, Andrea Vedaldi, Iro Laina, and Christian Rupprecht. Invisible stitch: Gener- ating smooth 3d scenes with depth inpainting. InArxiv, 2024

2024
[54]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis.arXiv preprint arXiv:2503.13265, 2025

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis.arXiv preprint arXiv:2503.13265, 2025

arXiv 2025
[55]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

arXiv 2025
[56]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025
[57]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

arXiv 2025
[58]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[59]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[60]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022

2022
[61]

Flashworld: High-quality 3d scene generation within seconds, 2025

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds, 2025

2025
[62]

Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

arXiv 2023
[63]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[64]

MapAnything: Universal feed-forward metric 3D reconstruction, 2025

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebas- tian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction, 2...

Pith/arXiv arXiv 2025
[65]

UniDepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 18

2024
[66]

hole rate

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17191–17202, 2025. 19 A Geometric Details This appendix spells out the geometric quantities that the main text defers, so that Section 4 stays rea...

2025

[1] [1]

Sora.https://openai.com/sora/, 2024

OpenAI. Sora.https://openai.com/sora/, 2024

2024

[2] [2]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[3] [3]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[4] [4]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[5] [5]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[6] [6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[7] [7]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

2024

[8] [8]

Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Pith/arXiv arXiv 2024

[9] [9]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024

[10] [10]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024. 14

arXiv 2024

[11] [11]

Spatia: Video generation with updatable spatial memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[12] [12]

Voyager: Long-range and world- consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, et al. Voyager: Long-range and world- consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

arXiv 2025

[13] [13]

Freeman, and Jiajun Wu

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image.arXiv:2406.09394, 2024

arXiv 2024

[14] [14]

Wonderjourney: Going from anywhere to everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

2024

[15] [15]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026

[16] [16]

Chen, and Jiwen Lu

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, and Jiwen Lu. Drivegen3d: Boosting feed-forward driving scene generation with efficient video diffusion.arXiv preprint arXiv:2510.15264, 2025

Pith/arXiv arXiv 2025

[17] [17]

World-r1: Reinforcing 3d constraints for text-to-video generation.arXiv preprint arXiv:2604.24764, 2026

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation.arXiv preprint arXiv:2604.24764, 2026

Pith/arXiv arXiv 2026

[18] [18]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025

[19] [19]

Captain safari: A world engine with pose-aligned 3d memory.arXiv preprint arXiv:2511.22815, 2025

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine with pose-aligned 3d memory.arXiv preprint arXiv:2511.22815, 2025

arXiv 2025

[20] [20]

Depth anything 3: Recovering the visual space from any views.International Conference on Learning Representations (ICLR), 2026

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.International Conference on Learning Representations (ICLR), 2026

2026

[21] [21]

Feed-forward 3d scene modeling: A problem-driven perspective.arXiv preprint arXiv:2604.14025, 2026

Weijie Wang, Qihang Cao, Sensen Gao, Donny Y Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, et al. Feed-forward 3d scene modeling: A problem-driven perspective.arXiv preprint arXiv:2604.14025, 2026

Pith/arXiv arXiv 2026

[22] [22]

Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs

Weijie Wang, Donny Y Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang. Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs. 38:113407–113436, 2026

2026

[23] [23]

Chen, and Bohan Zhuang

Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y. Chen, and Bohan Zhuang. Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction.arXiv preprint arXiv:2509.19297, 2025

Pith/arXiv arXiv 2025

[24] [24]

Chen, and Bohan Zhuang

Weijie Wang, Zimu Li, Jinchuan Shi, Zeyu Zhang, Botao Ye, Marc Pollefeys, Donny Y. Chen, and Bohan Zhuang. Trisplat: Simulation-ready feed-forward 3d scene reconstruction.arXiv preprint arXiv:2605.26115, 2026. 15

Pith/arXiv arXiv 2026

[25] [25]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023

2023

[26] [26]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

Pith/arXiv arXiv 2025

[27] [27]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27713–27724, October 2025

2025

[28] [28]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

arXiv 2025

[29] [29]

Stereo magni- fication: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magni- fication: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Pith/arXiv arXiv 2018

[30] [30]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[31] [31]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[32] [32]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023

[33] [33]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024

[34] [34]

Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991, 2024

Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, and Jun Huang. Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991, 2024

arXiv 2024

[35] [35]

Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

arXiv 2024

[36] [36]

Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453, 2025

Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, et al. Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453, 2025

arXiv 2025

[37] [37]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 16

Pith/arXiv arXiv 2024

[38] [38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[39] [39]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[40] [40]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

arXiv 2024

[41] [41]

Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

Pith/arXiv arXiv 2025

[42] [42]

Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Pith/arXiv arXiv 2025

[43] [43]

Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

arXiv 2024

[44] [44]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024

[45] [45]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

arXiv 2025

[46] [46]

I2vcontrol-camera: Precise video camera control with adjustable motion strength.arXiv preprint arXiv:2411.06525, 2024

Wanquan Feng, Jiawei Liu, Pengqi Tu, Tianhao Qi, Mingzhen Sun, Tianxiang Ma, Songtao Zhao, Siyu Zhou, and Qian He. I2vcontrol-camera: Precise video camera control with adjustable motion strength.arXiv preprint arXiv:2411.06525, 2024

arXiv 2024

[47] [47]

Panflow: Decoupled motion control for panoramic video generation

Cheng Zhang, Hanwen Liang, Donny Y Chen, Qianyi Wu, Konstantinos N Plataniotis, Camilo Cruz Gambardella, and Jianfei Cai. Panflow: Decoupled motion control for panoramic video generation. volume 40, pages 12385–12393, 2026

2026

[48] [48]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024

[49] [49]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

arXiv 2025

[50] [50]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.arXiv preprint arXiv:2501.03847, 2025

arXiv 2025

[51] [51]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 17

2025

[52] [52]

Omnicam: Unified multimodal video generation via camera control.arXiv preprint arXiv:2504.02312, 2025

Xiaoda Yang, Jiayang Xu, Kaixuan Luan, Xinyu Zhan, Hongshun Qiu, Shijun Shi, Hao Li, Shuai Yang, Li Zhang, Checheng Yu, et al. Omnicam: Unified multimodal video generation via camera control.arXiv preprint arXiv:2504.02312, 2025

arXiv 2025

[53] [53]

Invisible stitch: Gener- ating smooth 3d scenes with depth inpainting

Paul Engstler, Andrea Vedaldi, Iro Laina, and Christian Rupprecht. Invisible stitch: Gener- ating smooth 3d scenes with depth inpainting. InArxiv, 2024

2024

[54] [54]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis.arXiv preprint arXiv:2503.13265, 2025

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis.arXiv preprint arXiv:2503.13265, 2025

arXiv 2025

[55] [55]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

arXiv 2025

[56] [56]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025

[57] [57]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

arXiv 2025

[58] [58]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[59] [59]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[60] [60]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022

2022

[61] [61]

Flashworld: High-quality 3d scene generation within seconds, 2025

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds, 2025

2025

[62] [62]

Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

arXiv 2023

[63] [63]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[64] [64]

MapAnything: Universal feed-forward metric 3D reconstruction, 2025

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebas- tian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction, 2...

Pith/arXiv arXiv 2025

[65] [65]

UniDepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 18

2024

[66] [66]

hole rate

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17191–17202, 2025. 19 A Geometric Details This appendix spells out the geometric quantities that the main text defers, so that Section 4 stays rea...

2025