pith. sign in

arxiv: 2606.30045 · v1 · pith:RKBTOV7Znew · submitted 2026-06-29 · 💻 cs.CV

Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

Pith reviewed 2026-06-30 06:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords neural implicit sceneinteractive video generationscene representationtransformer VAEdiffusion transformercamera-controlled explorationlong-horizon consistencyposed-view data
0
0 comments X

The pith

Interactive world exploration rolls out a fixed Neural Implicit Scene state instead of growing frame latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scene-centric approach to camera-controlled video generation that replaces sequences of latent frames with a single compact implicit representation of the scene. This splits the process into a stochastic update of the scene state and a separate deterministic rendering step driven by camera pose. NeuWorld realizes the idea using a transformer VAE to encode sparse posed views into the Neural Implicit Scene and a diffusion transformer to evolve that state forward using trajectory and history information. The system trains from scratch on public posed-view datasets and produces long video sequences that stay consistent without external encoders or pretrained video models. A reader would care because the factorization directly targets the source of drift and inefficiency in prior frame-by-frame methods.

Core claim

The paper claims that interactive generation factorizes into stochastic transition of a compact Neural Implicit Scene (NIS) and deterministic pose-conditioned rendering given the sampled state; NeuWorld instantiates this with a transformer VAE that learns locally anchored NIS from sparse posed frames and a diffusion transformer that evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history, achieving strong long-horizon consistency with favorable inference efficiency while training from scratch without pretrained backbones.

What carries the argument

Neural Implicit Scene (NIS), a fixed-length renderable implicit state that serves as the rollout variable separating scene transition from observation synthesis.

Load-bearing premise

A transformer VAE can learn locally anchored NIS from sparse posed frames that a diffusion transformer can evolve consistently using only camera trajectories and retrieved history.

What would settle it

Run a long camera trajectory through a known synthetic 3D scene and measure whether rendered views accumulate geometric or appearance drift relative to ground-truth renders from the same poses.

read the original abstract

Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes 'Walking in the Implicit,' a scene-centric paradigm for interactive video generation that replaces rollout over growing frame latents with transitions over a fixed-length renderable implicit state called Neural Implicit Scene (NIS). This factorizes the task into stochastic NIS evolution (via diffusion transformer conditioned on camera trajectories and geometry-aware history) and deterministic pose-conditioned rendering. NeuWorld instantiates the approach with a transformer VAE that learns locally anchored NIS from sparse posed frames, reuses the VAE encoder as a unified conditioner, and is trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, claiming strong long-horizon consistency and favorable inference efficiency.

Significance. If the technical claims hold, the factorization could meaningfully advance interactive world exploration by decoupling state consistency from high-frequency synthesis and by avoiding heterogeneous encoders. The from-scratch training on public data and reuse of the VAE encoder as conditioner are notable strengths that would support reproducibility and simplicity if empirically validated.

major comments (2)
  1. [Abstract, §3] Abstract and §3: the central claim that a transformer VAE produces 'locally anchored' renderable NIS from sparse posed frames, and that a diffusion transformer can evolve this state under camera-trajectory and geometry-aware history conditioning while preserving long-horizon consistency, is the load-bearing premise; however, no equations, loss formulations, architecture diagrams, or ablation results are supplied to verify that the VAE actually yields a compact, renderable state rather than collapsing to frame-like latents.
  2. [Abstract] Abstract: the assertion of 'strong long-horizon consistency' and 'favorable inference efficiency' relative to frame-latent baselines is presented without reference to any quantitative metrics, datasets, or comparison tables, making it impossible to assess whether the factorization delivers the claimed gains.
minor comments (1)
  1. [§3] Notation for NIS and the conditioning mechanisms should be introduced with explicit definitions and dimensionality statements early in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and constructive comments. We address each major point below and will revise the manuscript to improve clarity and support for the central claims.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: the central claim that a transformer VAE produces 'locally anchored' renderable NIS from sparse posed frames, and that a diffusion transformer can evolve this state under camera-trajectory and geometry-aware history conditioning while preserving long-horizon consistency, is the load-bearing premise; however, no equations, loss formulations, architecture diagrams, or ablation results are supplied to verify that the VAE actually yields a compact, renderable state rather than collapsing to frame-like latents.

    Authors: We agree that explicit verification of the NIS properties strengthens the presentation. Section 3.2 describes the transformer VAE encoder that maps sparse posed frames to a fixed-length NIS via pose-conditioned cross-attention, and the diffusion transformer in §3.3 evolves this state. To address the concern directly, the revision will add the VAE loss formulation (reconstruction plus KL divergence), a dedicated architecture diagram, and an ablation in §4.3 comparing NIS renderability (via novel-view PSNR) against frame-latent collapse. These changes will be incorporated. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of 'strong long-horizon consistency' and 'favorable inference efficiency' relative to frame-latent baselines is presented without reference to any quantitative metrics, datasets, or comparison tables, making it impossible to assess whether the factorization delivers the claimed gains.

    Authors: The abstract summarizes results that are quantified in §4 on datasets including RealEstate10K and ACID, with tables reporting long-horizon metrics (e.g., consistency PSNR over 100+ frames) and inference speed versus frame-latent baselines. We will revise the abstract to include explicit references to these tables and datasets so the claims are grounded from the outset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a scene-centric paradigm by defining Neural Implicit Scene (NIS) as a fixed-length renderable implicit state that factorizes stochastic transition from deterministic pose-conditioned rendering. This is instantiated via a transformer VAE and diffusion transformer trained from scratch on public posed-view data, with no equations, loss formulations, or derivations supplied that reduce by construction to fitted inputs, self-definitions, or self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are present in the provided text. The central claims rest on independent training and empirical consistency rather than circular reductions, making the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract does not provide details on free parameters or axioms; the main invented entity is the NIS.

invented entities (1)
  • Neural Implicit Scene (NIS) no independent evidence
    purpose: Fixed-length renderable implicit state for scene representation to factorize generation
    Central new concept proposed in the paper to enable the scene-centric paradigm.

pith-pipeline@v0.9.1-grok · 5722 in / 1108 out tokens · 56421 ms · 2026-06-30T06:11:10.195662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  2. [2]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

  3. [3]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, pages 15791–15801, 2025

  4. [4]

    Gamefactory: Creating new games with generative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InICCV, 2025

  5. [5]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  6. [6]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  7. [7]

    Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

  8. [8]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

  9. [9]

    Gen3c: 3d-informed world-consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, pages 6121–6132, 2025

  10. [10]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025. 12

  11. [11]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InICLR, 2025

  12. [12]

    Rayzer: A self-supervised large view synthesis model

    Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. Rayzer: A self-supervised large view synthesis model. InICCV, 2025

  13. [13]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

  14. [14]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In CVPR, pages 22160–22169, 2024

  15. [15]

    Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021

  16. [16]

    Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  17. [17]

    Zip-nerf: Anti-aliased grid-based neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InICCV, pages 19697–19705, 2023

  18. [18]

    Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

    Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. InICCV, pages 14335–14345, 2021

  19. [19]

    Baking neural radiance fields for real-time view synthesis

    Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, and Paul Debevec. Baking neural radiance fields for real-time view synthesis. InICCV, pages 5875–5884, 2021

  20. [20]

    Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

    Christian Reiser, Rick Szeliski, Dor Verbin, Pratul Srinivasan, Ben Mildenhall, Andreas Geiger, Jon Barron, and Peter Hedman. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

  21. [21]

    Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs

    Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. InCVPR, pages 5480–5490, 2022

  22. [22]

    Nerf in the wild: Neural radiance fields for unconstrained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, pages 7210–7219, 2021

  23. [23]

    Nerf–: Neural radiance fields without known camera parameters

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. 2021

  24. [24]

    Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction

    Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. InCVPR, pages 5459–5469, 2022

  25. [25]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, pages 5501–5510, 2022

  26. [26]

    Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

  27. [27]

    Point-nerf: Point-based neural radiance fields

    Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. InCVPR, pages 5438–5448, 2022

  28. [28]

    Differentiable point-based radiance fields for efficient view synthesis

    Qiang Zhang, Seung-Hwan Baek, Szymon Rusinkiewicz, and Felix Heide. Differentiable point-based radiance fields for efficient view synthesis. InSIGGRAPH Asia, pages 1–12, 2022. 13

  29. [29]

    Neural points: Point cloud representation with neural fields for arbitrary upsampling

    Wanquan Feng, Jin Li, Hongrui Cai, Xiaonan Luo, and Juyong Zhang. Neural points: Point cloud representation with neural fields for arbitrary upsampling. InCVPR, pages 18633–18642, 2022

  30. [30]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InCVPR, pages 4578–4587, 2021

  31. [31]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, pages 4690–4699, 2021

  32. [32]

    Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InICCV, pages 14124–14133, 2021

  33. [33]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024

  34. [34]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In ECCV, pages 370–386. Springer, 2024

  35. [35]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. InICLR, 2025

  36. [36]

    Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

    Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, and Peidong Liu. Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

  37. [37]

    Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

    Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. InCVPR, pages 6229–6238, 2022

  38. [38]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  39. [39]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

  40. [40]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2024

  41. [41]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InICLR, 2024

  42. [42]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  43. [43]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  44. [44]

    Cameractrl: Enabling camera control for text-to-video generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 14

  45. [45]

    Collaborative video diffusion: Consistent multi-video generation with camera control

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems, 37:16240–16271, 2024

  46. [46]

    Vd3d: Taming large video diffusion transformers for 3d camera control

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. InICLR, 2025

  47. [47]

    Direct-a-video: Customized video generation with user-directed camera movement and object motion

    Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

  48. [48]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

  49. [49]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  50. [50]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URLhttps: //papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution . https://worldmodels.github.io

  51. [51]

    Learning Interactive Real-World Simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

  52. [52]

    Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

    Etched Decart. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

  53. [53]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  54. [54]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  55. [55]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  56. [56]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, pages 6613–6623, 2024

  57. [57]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, pages 20697–20709, 2024

  58. [58]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

  59. [59]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  60. [60]

    Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025

    Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025. 15

  61. [61]

    More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

    Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

  62. [62]

    Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

    Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InSIGGRAPH Asia, pages 1–12, 2025

  63. [63]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

  64. [64]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia, 2025

  65. [65]

    Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

  66. [66]

    Julius Plucker. Xvii. on a new geometry of space.Philosophical Transactions of the Royal Society of London, (155):725–791, 1865

  67. [67]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InECCV, pages 694–711. Springer, 2016

  68. [68]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, pages 1125–1134, 2017

  69. [69]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021

  70. [70]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  71. [71]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

  72. [72]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  73. [73]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  74. [74]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, pages 22669–22679, 2023

  75. [75]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  76. [76]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

  77. [77]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  78. [78]

    Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

    Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 16

  79. [79]

    Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  80. [80]

    Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025. 17 A Appendix A.1 Additional Analysis of Partial NIS and NIS Space A.1.1 Visual Evidence from Masked Reconstruction Motivation.We provide visual evidence for the key empirical property used by our unified ...