arxiv: 2601.00678 · v2 · submitted 2026-01-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida , Daniela Ivanova , Tong Shi , John H. Williamson , Paul Henderson

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-video generation3D Gaussianscamera controldynamic scenesingle-image conditioningvideo synthesis4D reconstruction

0 comments

The pith

A single image generates camera-controlled 4D video by building dynamic 3D Gaussians in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that takes one static image and produces a video sequence whose camera path can be freely specified by the user. It builds an explicit 3D Gaussian representation of the scene geometry and directly samples plausible object motion inside the same forward pass. This removes the need for separate iterative denoising steps that other methods use to add motion, yielding both faster inference and stronger guarantees of temporal and geometric consistency. A sympathetic reader would care because the approach could make controllable scene animation from ordinary photographs practical for simulation, robotics, and content creation.

Core claim

We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames.

What carries the argument

Dynamic 3D Gaussians that jointly encode the scene's static geometry extracted from the input image and the sampled object motions.

Load-bearing premise

A single static image contains enough information to construct an accurate 3D Gaussian representation and to sample plausible, temporally consistent object motions that align with an arbitrary camera trajectory.

What would settle it

If videos generated for large camera movements exhibit visible object trajectory errors or geometric drift from the input image, the single-pass construction would be shown to be insufficient.

Figures

Figures reproduced from arXiv: 2601.00678 by Daniela Ivanova, John H. Williamson, Melonie de Almeida, Paul Henderson, Tong Shi.

**Figure 1.** Figure 1: Pixel-to-4D: Given an input image It, encs encodes It and its estimated depths Dt and fuses features from DINOv2. The combined features are decoded by decs to predict static Gaussian parameters d, ∆, r, s, σ, c. Conditioned on the combined features, splat velocities v and accelerations a are generated using decvae and decd from latent Gaussian noise. These are aggregrated over object segmentations to give … view at source ↗

**Figure 2.** Figure 2: Qualitative comparisons on four datasets. Each block shows the input frame at [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative ablation results on Waymo: Input and predicted frames and depths at [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation results on KITTI, showing input and predicted frames and depths at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-pass dynamic 3D Gaussians from one image for camera-controlled video is the core idea, but missing numbers make the SOTA claims hard to assess.

read the letter

This paper puts forward a single forward pass that turns one image into a dynamic 3D Gaussian representation and then samples object motion to produce video along a user-specified camera path. It contrasts this with prior work that splits the task into point cloud construction followed by separate motion injection. The new element is the unified construction of the Gaussians that already incorporate dynamics, avoiding the two-step pipeline. That setup could reduce temporal inconsistencies and make inference faster by skipping iterative denoising for motion. The experiments claim better quality on KITTI, Waymo, RealEstate10K, and DL3DV-10K, which would be useful if true for simulation or content creation tasks. The paper does well in identifying the controllability gap in existing image-to-video models and proposing an explicit 3D intermediate that aligns with given trajectories. The datasets chosen cover relevant scenarios for driving and indoor/outdoor scenes. Soft spots center on the experimental support. The abstract states state-of-the-art results but supplies no metrics, no comparison tables, and no ablation studies. This leaves the central claims unverified. The assumption that one image suffices for accurate 3D Gaussians and consistent dynamics across novel views may require more than 2D photometric losses; without details on regularization, it is unclear if view-dependent artifacts or inconsistent motion arise when the camera path differs from training data. This paper is for researchers focused on 3D-aware video synthesis and controllable generation. A reader interested in practical advances in inference speed and consistency would find it relevant. It deserves a serious referee because the idea is coherent and the problem is real, even if the current evidence is thin.

Referee Report

2 major / 2 minor

Summary. The paper proposes Pixel-to-4D, a novel framework that, from a single input image, constructs an explicit 3D Gaussian scene representation and samples plausible object motions in one forward pass. This enables fast, camera-controlled image-to-video generation without iterative denoising steps. The authors claim state-of-the-art video quality and inference efficiency on the KITTI, Waymo, RealEstate10K, and DL3DV-10K datasets.

Significance. If the central claims hold, the work would offer a meaningful advance in controllable video synthesis by combining explicit 3D representations with single-pass dynamics prediction, potentially improving both geometric consistency and speed relative to diffusion-based baselines that rely on iterative refinement.

major comments (2)

[Abstract] Abstract: The assertion of state-of-the-art results on KITTI, Waymo, RealEstate10K and DL3DV-10K is presented without any quantitative metrics, tables, ablation studies, or error analysis, leaving the central empirical claim unsupported in the provided text and preventing verification of the reported gains in quality and efficiency.
[Method] Method (Dynamic 3D Gaussians construction): The single-image prediction of both static Gaussians and object motion lacks explicit discussion of 3D regularization (e.g., depth supervision, cross-view consistency losses, or multi-view rendering terms). Without such constraints the representation risks being under-determined, which directly threatens 3D consistency when rendering along arbitrary camera trajectories that deviate from the training distribution.

minor comments (2)

[Abstract] Abstract: The phrase 'samples plausible object motion' is used without defining the motion parameterization or the loss used to train it; a brief clarification would improve readability.
[Abstract] The project page URL is given but no supplementary video or code link is referenced in the abstract; adding such pointers would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below and indicate the revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of state-of-the-art results on KITTI, Waymo, RealEstate10K and DL3DV-10K is presented without any quantitative metrics, tables, ablation studies, or error analysis, leaving the central empirical claim unsupported in the provided text and preventing verification of the reported gains in quality and efficiency.

Authors: We appreciate the referee highlighting the need for stronger support in the abstract. The full manuscript provides detailed quantitative tables, ablation studies, and error analysis in Section 4 across all listed datasets. To directly address the concern, we will revise the abstract to include a brief reference to key supporting metrics (e.g., superior PSNR/SSIM and inference speed relative to baselines) while maintaining its concise nature. This change ensures the SOTA claim is better grounded even in the summary text. revision: yes
Referee: [Method] Method (Dynamic 3D Gaussians construction): The single-image prediction of both static Gaussians and object motion lacks explicit discussion of 3D regularization (e.g., depth supervision, cross-view consistency losses, or multi-view rendering terms). Without such constraints the representation risks being under-determined, which directly threatens 3D consistency when rendering along arbitrary camera trajectories that deviate from the training distribution.

Authors: We agree that an explicit discussion of regularization strengthens the method description. The current training already leverages dataset-induced constraints from multi-view video data, but we will expand the Method section with a new paragraph detailing the regularization: monocular depth supervision on Gaussian centers, a cross-view consistency term via auxiliary renderings, and multi-view photometric losses. These additions clarify how the single-pass prediction remains well-constrained for novel trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: novel framework presented without self-referential derivations or fitted predictions

full rationale

The provided abstract and description frame the contribution as a new procedural framework that constructs 3D Gaussians and samples motion from a single image in one forward pass. No equations, parameter-fitting steps, or self-citations are exhibited that would reduce any claimed prediction to an input quantity by construction. The method is positioned as independent of prior fitted results from the same authors, with evaluation on external datasets (KITTI, Waymo, RealEstate10K, DL3DV-10K). This satisfies the criteria for a self-contained proposal with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unstated premise that a learned single-image encoder can produce a complete dynamic 3D Gaussian scene whose motion sampling yields temporally coherent renderings; no explicit free parameters or invented entities are enumerated in the abstract.

axioms (1)

domain assumption A single image suffices to infer both static 3D geometry and plausible future object motions
Implicit in the single-image input and single-forward-pass claim.

invented entities (1)

Dynamic 3D Gaussians no independent evidence
purpose: Unified representation of scene geometry and object motion
Introduced as the core intermediate structure enabling one-pass generation.

pith-pipeline@v0.9.0 · 5567 in / 1137 out tokens · 42749 ms · 2026-05-16T18:16:01.870709+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each pixel predicts parameters for N≥1 Gaussians: P={(δi,Δi,ri,si,σi,ci,vi,ai)}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Ren- derdiffusion: Image diffusion for 3d reconstruction, inpaint- ing and generation

Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Hen- derson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Ren- derdiffusion: Image diffusion for 3d reconstruction, inpaint- ing and generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12608–12618, 2023. 2

work page 2023
[2]

Denoising diffusion via image-based rendering

Titas Anciukevi ˇcius, Fabian Manhardt, Federico Tombari, and Paul Henderson. Denoising diffusion via image-based rendering. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024
[3]

Lindell, and Sergey Tulyakov

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22875–22889, 2025. 2

work page 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint, arXiv:2311.15127, 2023. Accessed: Oct. 08, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DINOv2: Learning Robust Visual Features without Supervision

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- nov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2012. 7

work page 2012
[8]

Vision meets robotics: The kitti dataset, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset, 2013. 2, 5

work page 2013
[9]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Unsupervised object-centric video generation and decomposition in 3d.Ad- vances in Neural Information Processing Systems, 33:3106– 3117, 2020

Paul Henderson and Christoph H Lampert. Unsupervised object-centric video generation and decomposition in 3d.Ad- vances in Neural Information Processing Systems, 33:3106– 3117, 2020. 1, 2

work page 2020
[11]

Denoising dif- fusion implicit models

Stefano Ermon Jiaming Song, Chenlin Meng. Denoising dif- fusion implicit models. InICLR, 2021. 7

work page 2021
[12]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4

work page 2023
[14]

Collab- orative video diffusion: Consistent multi-video generation 8 with camera control.Advances in Neural Information Pro- cessing Systems, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collab- orative video diffusion: Consistent multi-video generation 8 with camera control.Advances in Neural Information Pro- cessing Systems, 37:16240–16271, 2024. 2

work page 2024
[15]

Efros, and Xiaolong Wang

Zihang Lai, Sifei Liu, Alexei A. Efros, and Xiaolong Wang. Video autoencoder: Self-supervised disentangle- ment of static 3d structure and motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9730–9740, 2021. 1, 2

work page 2021
[16]

Realcam-i2v: Real-world image-to-video generation with interactive complex camera control

Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28785–28796, 2025. 1, 2, 7

work page 2025
[17]

Phy124: Fast physics-driven 4d con- tent generation from a single image.arXiv preprint arXiv:2409.07179, 2024

Jiajing Lin, Zhenzhong Wang, Yongjie Hou, Yuzhou Tang, and Min Jiang. Phy124: Fast physics-driven 4d con- tent generation from a single image.arXiv preprint arXiv:2409.07179, 2024. 2, 3

work page arXiv 2024
[18]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vi- sion

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Anirud- dha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vi- sion. InProceedings of the IEEE/CVF ...

work page 2024
[19]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 2

work page 2023
[20]

Zero-1-to-3: Zero-shot one image to 3d object, 2023

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2

work page 2023
[21]

Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age, 2024

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age, 2024. 2

work page 2024
[22]

Wonder3d: Sin- gle image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

work page 2024
[23]

Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024. 3

work page arXiv 2024
[24]

Waymo open dataset: Panoramic video panoptic segmentation

Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Liang-Chieh Chen, and Henrik Kretzschmar. Waymo open dataset: Panoramic video panoptic segmentation. In European Conference on Computer Vision, pages 53–72. Springer, 2022. 2, 5, 7

work page 2022
[25]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6121–6132, 2025. 1, 2

work page 2025
[27]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015. 3

work page 2015
[28]

Make-it-4d: Synthesizing a consistent long-term dynamic scene video from a single image

Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, and Guosheng Lin. Make-it-4d: Synthesizing a consistent long-term dynamic scene video from a single image. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 8167–8175, 2023. 2, 3

work page 2023
[29]

Mvdream: Multi-view diffusion for 3d gen- eration, 2024

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration, 2024. 2

work page 2024
[30]

Dimensionx: Create any 3d and 4d scenes from a single image with de- coupled video diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhu, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with de- coupled video diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13695– 13706, 2025. 2, 3

work page 2025
[31]

Viewset diffusion:(0-) image-conditioned 3d gener- ative models from 2d data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gener- ative models from 2d data. InProceedings of the IEEE/CVF international conference on computer vision, pages 8863– 8873, 2023. 2

work page 2023
[32]

Henriques, Christian Rup- precht, and Andrea Vedaldi

Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, Jo ˜ao F. Henriques, Christian Rup- precht, and Andrea Vedaldi. Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image.arXiv preprint arXiv:2402.03807, 2024. 2

work page arXiv 2024
[33]

Splatter image: Ultra-fast single-view 3d recon- struction

Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10208– 10217, 2024. 2, 3, 7

work page 2024
[34]

Dreamgaussian: Generative gaussian splatting for ef- ficient 3d content creation, 2024

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for ef- ficient 3d content creation, 2024. 2

work page 2024
[35]

Consistent view synthe- sis with pose-guided diffusion models

Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthe- sis with pose-guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16773–16783, 2023. 1, 2

work page 2023
[36]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 1, 2, 7

work page 2024
[37]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEu- 9 ropean Conference on Computer Vision, pages 399–417. Springer, 2024. 1, 2

work page 2024
[38]

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 1

work page internal anchor Pith review arXiv 2024
[39]

Forecasting future videos from novel views via disentangled 3d scene representation

Sudhir Yarram and Junsong Yuan. Forecasting future videos from novel views via disentangled 3d scene representation. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2

work page 2024
[40]

Long-term photometric consistent novel view synthesis with diffusion models

Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7094–7104, 2023. 2

work page 2023
[41]

Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2402.00000, 2024

Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2402.00000, 2024. 1, 2, 7

work page arXiv 2024
[42]

A unified approach for text- and image-guided 4d scene generation

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text- and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024. 2, 3

work page 2024
[43]

Stereo magnification: Learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. Proceedings of SIG- GRAPH 2018. 2, 5

work page 2018
[44]

Ewa volume splatting

Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Ewa volume splatting. InProceedings Visu- alization, 2001. VIS’01., pages 29–538. IEEE, 2001. 1, 2, 3 10

work page 2001