Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

Chengrui Dong; Cong Qiu; Dongxu Wei; Hailong Qin; Hangning Zhou; Mu Yang; Peidong Liu; Zhenhua Du; Zhiqi Li

arxiv: 2606.30045 · v1 · pith:RKBTOV7Znew · submitted 2026-06-29 · 💻 cs.CV

Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

Zhiqi Li , Chengrui Dong , Zhenhua Du , Hangning Zhou , Cong Qiu , Hailong Qin , Mu Yang , Dongxu Wei

show 1 more author

Peidong Liu

This is my paper

Pith reviewed 2026-06-30 06:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords neural implicit sceneinteractive video generationscene representationtransformer VAEdiffusion transformercamera-controlled explorationlong-horizon consistencyposed-view data

0 comments

The pith

Interactive world exploration rolls out a fixed Neural Implicit Scene state instead of growing frame latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scene-centric approach to camera-controlled video generation that replaces sequences of latent frames with a single compact implicit representation of the scene. This splits the process into a stochastic update of the scene state and a separate deterministic rendering step driven by camera pose. NeuWorld realizes the idea using a transformer VAE to encode sparse posed views into the Neural Implicit Scene and a diffusion transformer to evolve that state forward using trajectory and history information. The system trains from scratch on public posed-view datasets and produces long video sequences that stay consistent without external encoders or pretrained video models. A reader would care because the factorization directly targets the source of drift and inefficiency in prior frame-by-frame methods.

Core claim

The paper claims that interactive generation factorizes into stochastic transition of a compact Neural Implicit Scene (NIS) and deterministic pose-conditioned rendering given the sampled state; NeuWorld instantiates this with a transformer VAE that learns locally anchored NIS from sparse posed frames and a diffusion transformer that evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history, achieving strong long-horizon consistency with favorable inference efficiency while training from scratch without pretrained backbones.

What carries the argument

Neural Implicit Scene (NIS), a fixed-length renderable implicit state that serves as the rollout variable separating scene transition from observation synthesis.

Load-bearing premise

A transformer VAE can learn locally anchored NIS from sparse posed frames that a diffusion transformer can evolve consistently using only camera trajectories and retrieved history.

What would settle it

Run a long camera trajectory through a known synthetic 3D scene and measure whether rendered views accumulate geometric or appearance drift relative to ground-truth renders from the same poses.

read the original abstract

Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is swapping frame-latent rollouts for a fixed-length Neural Implicit Scene that gets updated stochastically and rendered deterministically, which is a clean factorization if the implementation holds.

read the letter

The central claim is that interactive generation works better when the state being rolled out is a compact, renderable implicit scene rather than a growing sequence of video latents. NeuWorld implements this with a transformer VAE that encodes sparse posed frames into the NIS and a diffusion transformer that evolves the state given camera trajectories and retrieved history. Reusing the VAE encoder for all conditioning signals keeps the architecture uniform and removes the need for separate video or 3D backbones.

Trained from scratch on public posed-view data, the setup is straightforward and avoids common external dependencies. That choice is worth noting because it tests whether the factorization itself can carry the consistency load.

The soft spot is the gap between the stated goal of long-horizon consistency and the evidence supplied in the abstract. The key assumption—that the VAE produces locally anchored, renderable states the diffusion model can update reliably—remains unexamined without the actual training curves, ablation tables, or rollout examples. If those states drift or the diffusion step introduces artifacts over dozens of steps, the efficiency and consistency advantages disappear.

This work is aimed at people building camera-controlled world models or simulation environments who are already thinking about implicit representations. A reader looking for a concrete alternative to frame-based latent rollouts will see a distinct direction here.

The idea is distinct enough and the training protocol clean enough that it should go to peer review so the experiments can be checked directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes 'Walking in the Implicit,' a scene-centric paradigm for interactive video generation that replaces rollout over growing frame latents with transitions over a fixed-length renderable implicit state called Neural Implicit Scene (NIS). This factorizes the task into stochastic NIS evolution (via diffusion transformer conditioned on camera trajectories and geometry-aware history) and deterministic pose-conditioned rendering. NeuWorld instantiates the approach with a transformer VAE that learns locally anchored NIS from sparse posed frames, reuses the VAE encoder as a unified conditioner, and is trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, claiming strong long-horizon consistency and favorable inference efficiency.

Significance. If the technical claims hold, the factorization could meaningfully advance interactive world exploration by decoupling state consistency from high-frequency synthesis and by avoiding heterogeneous encoders. The from-scratch training on public data and reuse of the VAE encoder as conditioner are notable strengths that would support reproducibility and simplicity if empirically validated.

major comments (2)

[Abstract, §3] Abstract and §3: the central claim that a transformer VAE produces 'locally anchored' renderable NIS from sparse posed frames, and that a diffusion transformer can evolve this state under camera-trajectory and geometry-aware history conditioning while preserving long-horizon consistency, is the load-bearing premise; however, no equations, loss formulations, architecture diagrams, or ablation results are supplied to verify that the VAE actually yields a compact, renderable state rather than collapsing to frame-like latents.
[Abstract] Abstract: the assertion of 'strong long-horizon consistency' and 'favorable inference efficiency' relative to frame-latent baselines is presented without reference to any quantitative metrics, datasets, or comparison tables, making it impossible to assess whether the factorization delivers the claimed gains.

minor comments (1)

[§3] Notation for NIS and the conditioning mechanisms should be introduced with explicit definitions and dimensionality statements early in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and constructive comments. We address each major point below and will revise the manuscript to improve clarity and support for the central claims.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: the central claim that a transformer VAE produces 'locally anchored' renderable NIS from sparse posed frames, and that a diffusion transformer can evolve this state under camera-trajectory and geometry-aware history conditioning while preserving long-horizon consistency, is the load-bearing premise; however, no equations, loss formulations, architecture diagrams, or ablation results are supplied to verify that the VAE actually yields a compact, renderable state rather than collapsing to frame-like latents.

Authors: We agree that explicit verification of the NIS properties strengthens the presentation. Section 3.2 describes the transformer VAE encoder that maps sparse posed frames to a fixed-length NIS via pose-conditioned cross-attention, and the diffusion transformer in §3.3 evolves this state. To address the concern directly, the revision will add the VAE loss formulation (reconstruction plus KL divergence), a dedicated architecture diagram, and an ablation in §4.3 comparing NIS renderability (via novel-view PSNR) against frame-latent collapse. These changes will be incorporated. revision: yes
Referee: [Abstract] Abstract: the assertion of 'strong long-horizon consistency' and 'favorable inference efficiency' relative to frame-latent baselines is presented without reference to any quantitative metrics, datasets, or comparison tables, making it impossible to assess whether the factorization delivers the claimed gains.

Authors: The abstract summarizes results that are quantified in §4 on datasets including RealEstate10K and ACID, with tables reporting long-horizon metrics (e.g., consistency PSNR over 100+ frames) and inference speed versus frame-latent baselines. We will revise the abstract to include explicit references to these tables and datasets so the claims are grounded from the outset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a scene-centric paradigm by defining Neural Implicit Scene (NIS) as a fixed-length renderable implicit state that factorizes stochastic transition from deterministic pose-conditioned rendering. This is instantiated via a transformer VAE and diffusion transformer trained from scratch on public posed-view data, with no equations, loss formulations, or derivations supplied that reduce by construction to fitted inputs, self-definitions, or self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are present in the provided text. The central claims rest on independent training and empirical consistency rather than circular reductions, making the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract does not provide details on free parameters or axioms; the main invented entity is the NIS.

invented entities (1)

Neural Implicit Scene (NIS) no independent evidence
purpose: Fixed-length renderable implicit state for scene representation to factorize generation
Central new concept proposed in the paper to enable the scene-centric paradigm.

pith-pipeline@v0.9.1-grok · 5722 in / 1108 out tokens · 56421 ms · 2026-06-30T06:11:10.195662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024
[2]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

2025
[3]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, pages 15791–15801, 2025

2025
[4]

Gamefactory: Creating new games with generative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InICCV, 2025

2025
[5]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[6]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[7]

Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

work page arXiv 2024
[8]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, pages 6121–6132, 2025

2025
[10]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025. 12

2025
[11]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InICLR, 2025

2025
[12]

Rayzer: A self-supervised large view synthesis model

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. Rayzer: A self-supervised large view synthesis model. InICCV, 2025

2025
[13]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

2018
[14]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In CVPR, pages 22160–22169, 2024

2024
[15]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021

2021
[16]

Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[17]

Zip-nerf: Anti-aliased grid-based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InICCV, pages 19697–19705, 2023

2023
[18]

Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. InICCV, pages 14335–14345, 2021

2021
[19]

Baking neural radiance fields for real-time view synthesis

Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, and Paul Debevec. Baking neural radiance fields for real-time view synthesis. InICCV, pages 5875–5884, 2021

2021
[20]

Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

Christian Reiser, Rick Szeliski, Dor Verbin, Pratul Srinivasan, Ben Mildenhall, Andreas Geiger, Jon Barron, and Peter Hedman. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

2023
[21]

Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs

Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. InCVPR, pages 5480–5490, 2022

2022
[22]

Nerf in the wild: Neural radiance fields for unconstrained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, pages 7210–7219, 2021

2021
[23]

Nerf–: Neural radiance fields without known camera parameters

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. 2021

2021
[24]

Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction

Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. InCVPR, pages 5459–5469, 2022

2022
[25]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, pages 5501–5510, 2022

2022
[26]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

2022
[27]

Point-nerf: Point-based neural radiance fields

Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. InCVPR, pages 5438–5448, 2022

2022
[28]

Differentiable point-based radiance fields for efficient view synthesis

Qiang Zhang, Seung-Hwan Baek, Szymon Rusinkiewicz, and Felix Heide. Differentiable point-based radiance fields for efficient view synthesis. InSIGGRAPH Asia, pages 1–12, 2022. 13

2022
[29]

Neural points: Point cloud representation with neural fields for arbitrary upsampling

Wanquan Feng, Jin Li, Hongrui Cai, Xiaonan Luo, and Juyong Zhang. Neural points: Point cloud representation with neural fields for arbitrary upsampling. InCVPR, pages 18633–18642, 2022

2022
[30]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InCVPR, pages 4578–4587, 2021

2021
[31]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, pages 4690–4699, 2021

2021
[32]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InICCV, pages 14124–14133, 2021

2021
[33]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024

2024
[34]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In ECCV, pages 370–386. Springer, 2024

2024
[35]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. InICLR, 2025

2025
[36]

Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, and Peidong Liu. Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

work page arXiv 2025
[37]

Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. InCVPR, pages 6229–6238, 2022

2022
[38]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

2024
[40]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2024

2024
[41]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InICLR, 2024

2024
[42]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[43]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024
[44]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 14

2025
[45]

Collaborative video diffusion: Consistent multi-video generation with camera control

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems, 37:16240–16271, 2024

2024
[46]

Vd3d: Taming large video diffusion transformers for 3d camera control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. InICLR, 2025

2025
[47]

Direct-a-video: Customized video generation with user-directed camera movement and object motion

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

2024
[48]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

2025
[49]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

2024
[50]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URLhttps: //papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution . https://worldmodels.github.io

2018
[51]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

Etched Decart. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

2024
[53]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[55]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, pages 6613–6623, 2024

2024
[57]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, pages 20697–20709, 2024

2024
[58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

2025
[59]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025. 15

work page arXiv 2025
[61]

More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

work page arXiv 2025
[62]

Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InSIGGRAPH Asia, pages 1–12, 2025

2025
[63]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

2025
[64]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia, 2025

2025
[65]

Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

2025
[66]

Julius Plucker. Xvii. on a new geometry of space.Philosophical Transactions of the Royal Society of London, (155):725–791, 1865
[67]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InECCV, pages 694–711. Springer, 2016

2016
[68]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, pages 1125–1134, 2017

2017
[69]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021

2021
[70]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

2022
[71]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

2023
[72]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[73]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015
[74]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, pages 22669–22679, 2023

2023
[75]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019
[76]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

2023
[77]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[78]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 16

work page arXiv 2025
[79]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[80]

Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025. 17 A Appendix A.1 Additional Analysis of Partial NIS and NIS Space A.1.1 Visual Evidence from Masked Reconstruction Motivation.We provide visual evidence for the key empirical property used by our unified ...

work page arXiv 2025

[1] [1]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024

[2] [2]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

2025

[3] [3]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, pages 15791–15801, 2025

2025

[4] [4]

Gamefactory: Creating new games with generative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InICCV, 2025

2025

[5] [5]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021

[6] [6]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[7] [7]

Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

work page arXiv 2024

[8] [8]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, pages 6121–6132, 2025

2025

[10] [10]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025. 12

2025

[11] [11]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InICLR, 2025

2025

[12] [12]

Rayzer: A self-supervised large view synthesis model

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. Rayzer: A self-supervised large view synthesis model. InICCV, 2025

2025

[13] [13]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

2018

[14] [14]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In CVPR, pages 22160–22169, 2024

2024

[15] [15]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021

2021

[16] [16]

Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-nerf: Structured view-dependent appearance for neural radiance fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[17] [17]

Zip-nerf: Anti-aliased grid-based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InICCV, pages 19697–19705, 2023

2023

[18] [18]

Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. InICCV, pages 14335–14345, 2021

2021

[19] [19]

Baking neural radiance fields for real-time view synthesis

Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, and Paul Debevec. Baking neural radiance fields for real-time view synthesis. InICCV, pages 5875–5884, 2021

2021

[20] [20]

Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

Christian Reiser, Rick Szeliski, Dor Verbin, Pratul Srinivasan, Ben Mildenhall, Andreas Geiger, Jon Barron, and Peter Hedman. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes.ACM Transactions on Graphics (ToG), 42(4):1–12, 2023

2023

[21] [21]

Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs

Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. InCVPR, pages 5480–5490, 2022

2022

[22] [22]

Nerf in the wild: Neural radiance fields for unconstrained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, pages 7210–7219, 2021

2021

[23] [23]

Nerf–: Neural radiance fields without known camera parameters

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. 2021

2021

[24] [24]

Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction

Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. InCVPR, pages 5459–5469, 2022

2022

[25] [25]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, pages 5501–5510, 2022

2022

[26] [26]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

2022

[27] [27]

Point-nerf: Point-based neural radiance fields

Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. InCVPR, pages 5438–5448, 2022

2022

[28] [28]

Differentiable point-based radiance fields for efficient view synthesis

Qiang Zhang, Seung-Hwan Baek, Szymon Rusinkiewicz, and Felix Heide. Differentiable point-based radiance fields for efficient view synthesis. InSIGGRAPH Asia, pages 1–12, 2022. 13

2022

[29] [29]

Neural points: Point cloud representation with neural fields for arbitrary upsampling

Wanquan Feng, Jin Li, Hongrui Cai, Xiaonan Luo, and Juyong Zhang. Neural points: Point cloud representation with neural fields for arbitrary upsampling. InCVPR, pages 18633–18642, 2022

2022

[30] [30]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InCVPR, pages 4578–4587, 2021

2021

[31] [31]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, pages 4690–4699, 2021

2021

[32] [32]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InICCV, pages 14124–14133, 2021

2021

[33] [33]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024

2024

[34] [34]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In ECCV, pages 370–386. Springer, 2024

2024

[35] [35]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. InICLR, 2025

2025

[36] [36]

Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, and Peidong Liu. Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025

work page arXiv 2025

[37] [37]

Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. InCVPR, pages 6229–6238, 2022

2022

[38] [38]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

2024

[40] [40]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2024

2024

[41] [41]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InICLR, 2024

2024

[42] [42]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[43] [43]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024

[44] [44]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 14

2025

[45] [45]

Collaborative video diffusion: Consistent multi-video generation with camera control

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems, 37:16240–16271, 2024

2024

[46] [46]

Vd3d: Taming large video diffusion transformers for 3d camera control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. InICLR, 2025

2025

[47] [47]

Direct-a-video: Customized video generation with user-directed camera movement and object motion

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

2024

[48] [48]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, pages 22875–22889, 2025

2025

[49] [49]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

2024

[50] [50]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URLhttps: //papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution . https://worldmodels.github.io

2018

[51] [51]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

Etched Decart. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

2024

[53] [53]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[55] [55]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, pages 6613–6623, 2024

2024

[57] [57]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, pages 20697–20709, 2024

2024

[58] [58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

2025

[59] [59]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025. 15

work page arXiv 2025

[61] [61]

More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. More: 3d visual geometry reconstruction meets mixture-of- experts.arXiv preprint arXiv:2510.27234, 2025

work page arXiv 2025

[62] [62]

Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InSIGGRAPH Asia, pages 1–12, 2025

2025

[63] [63]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025

2025

[64] [64]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia, 2025

2025

[65] [65]

Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.Advances in Neural Information Processing Systems, 2025

2025

[66] [66]

Julius Plucker. Xvii. on a new geometry of space.Philosophical Transactions of the Royal Society of London, (155):725–791, 1865

[67] [67]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InECCV, pages 694–711. Springer, 2016

2016

[68] [68]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, pages 1125–1134, 2017

2017

[69] [69]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021

2021

[70] [70]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

2022

[71] [71]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

2023

[72] [72]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[73] [73]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015

[74] [74]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, pages 22669–22679, 2023

2023

[75] [75]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019

[76] [76]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

2023

[77] [77]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[78] [78]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 16

work page arXiv 2025

[79] [79]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[80] [80]

Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025. 17 A Appendix A.1 Additional Analysis of Partial NIS and NIS Space A.1.1 Visual Evidence from Masked Reconstruction Motivation.We provide visual evidence for the key empirical property used by our unified ...

work page arXiv 2025