pith. machine review for the scientific record. sign in

arxiv: 2604.21915 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Vista4D: Video Reshooting with 4D Point Clouds

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reshooting4D point cloudsdynamic scene synthesiscamera trajectory controlview synthesis4D reconstructionmultiview video
0
0 comments X

The pith

Vista4D grounds video reshooting in 4D point clouds to support new camera trajectories while preserving original dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to take an input video and generate a new version seen from different camera paths without changing the motion inside it. It does this by first turning the video into a 4D point cloud that keeps track of both space and time, using static pixel segmentation and reconstruction to hold onto what was already visible. Training the system on reconstructed multiview dynamic data is meant to make it tolerant of the noise and errors that appear in real footage. If the approach works, it would let users change viewpoints in existing videos more reliably than before, opening doors to better editing tools for dynamic scenes.

Core claim

Vista4D builds a 4D point cloud from the input video with static pixel segmentation and 4D reconstruction to preserve seen content and supply precise camera signals, then trains on reconstructed multiview dynamic data so the model stays robust when point cloud artifacts appear at inference time on real videos.

What carries the argument

4D-grounded point cloud representation built from static pixel segmentation and 4D reconstruction, which explicitly preserves content and provides camera control signals.

Load-bearing premise

Training on reconstructed multiview dynamic data supplies enough robustness against point cloud artifacts and depth estimation errors when the method runs on real-world videos with difficult trajectories.

What would settle it

A real video with complex motion and inaccurate depth maps where the output exhibits more 4D inconsistencies or visual artifacts than the state-of-the-art baselines.

Figures

Figures reproduced from arXiv: 2604.21915 by Bolei Zhou, Koichi Namekata, Kuan Heng Lin, Micah Goldblum, Ning Yu, Pablo Salamanca, Paul Debevec, Ryan Burgert, Yash Kant, Yiwei Zhao, Yuancheng Xu, Zhizheng Liu.

Figure 1
Figure 1. Figure 1: 4D-grounded video reshooting. Given an input video, Vista4D re-synthesizes the scene with the same dynamics from different camera trajectories and viewpoints by grounding the input video and target cameras in a 4D point cloud. Vista4D is robust to point cloud artifacts and generalizes to real-world applications such as 4D scene recomposition and dynamic scene expansion. †Work done during an internship at E… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Vista4D. Given an input source video, we build a 4D point cloud where static pixels are temporally persistent via segmentation and 4D reconstruction. We then render the point cloud in the target cameras which users define. Lastly, the source video and point cloud render & alpha mask are jointly processed by the finetuned video diffusion model to generate a video of the same dynamic scene in the… view at source ↗
Figure 3
Figure 3. Figure 3: Multiview 4D reconstruction artifacts. (a) Double re￾projection [7] first renders the target video point cloud in the source camera, then rerendering it in the target camera to create occluded regions for paired training, thus viewing the target video depth map from its frontal, artifact-free view. (b) In contrast, rendering the source video point cloud from the target camera with dynamic multiview data ex… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on real-life monocular videos. We show two video reshooting examples of Vista4D compared to our baselines, TrajectoryCrafter [7], GEN3C [8], EX-4D [9], ReCamMaster [10], and CamCloneMaster [22]. [60]. We measure masked (indicated by the prefix “m”) and full PSNR, SSIM, and LPIPS for synthesis quality [60], along with optical flow endpoint error (EPE) for motion qual￾ity [70]. Our met… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on novel-view synthesis. We show two samples of Vista4D compared our baselines on the iphone dataset [60]. resulting in relatively little camera change from the source video, which results in seemingly better metrics. Qualitative comparisons in 4 and Supplementary A, along with the user study in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dynamic scene expansion. With our 4D-grounded temporally-persistent point cloud, Vista4D can do video reshooting with additional scene information from casual scene captures or alternate angles by doing joint 4D reconstruction of these frames with the source video. Doing so reduces video model hallucinations and provides stronger control beyond the source video. of temporally-persistent casual scene captur… view at source ↗
Figure 9
Figure 9. Figure 9: Long video inference with memory. Vista4D can reshoot long videos by doing inference in chunks. By registering static pixels of newly generated chunks back into the temporally-persistent 4D point cloud, Vista4D maintains an explicit, 4D-grounded memory of generated content. We showcase this above as the camera arcs around the scene, indicated with color-matched yellow and pink boxes. and implicit prior (so… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparison on real-life monocular videos, part 1/2. We show two more video reshooting examples of Vista4D compared to baselines: ReCamMaster [10], CamCloneMaster [22], EX-4D [9], TrajectoryCrafter [7], and GEN3C [8]. We encourage viewing these comparisons as videos on our project page, which also contains more comparisons. MultiCamVideo. MultiCamVideo renders its scenes at fixed camera in… view at source ↗
Figure 11
Figure 11. Figure 11: More qualitative comparison on real-life monocular videos, part 2/2. We show two more video reshooting examples of Vista4D compared to baselines: ReCamMaster [10], CamCloneMaster [22], EX-4D [9], TrajectoryCrafter [7], and GEN3C [8]. We encourage viewing these comparisons as videos on our project page, which also contains more comparisons. a combination of cogvlm2-video-llama3-chat and cogvlm2-llama3-capt… view at source ↗
Figure 12
Figure 12. Figure 12: Video reshooting results at 720p. We show video reshooting results of Vista4D with our 1280 × 720 finetuned checkpoint. More 720p results of our method can be found as videos on our project page. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More dynamic scene expansion results. We show more dynamic scene expansion results, where we incorporate additional scene information from casual scene captures by doing joint 4D reconstruction of these frames with the source video. We encourage viewing these results (and more) as videos on our project page. dropping the source video and/or the point cloud render, we set their latents as Gaussian noise fo… view at source ↗
Figure 14
Figure 14. Figure 14: More 4D scene recomposition results. We show more 4D scene recomposition results by directly manipulating the 4D point cloud. To prevent conditioning conflicts between the unedited source video and render of the edited point cloud, we instead condition on the edited source video which is just the edited point cloud rendered from the source cameras. We encourage viewing these results (and more) as videos o… view at source ↗
Figure 15
Figure 15. Figure 15: More long video inference results. We show more results of inference on long videos. To do so, we chunk the source video into 49-frame clips and run inference clip-by-clip. To explicitly preserve generated content, we continuously integrate the point cloud from the newly generated video into the existing one after each inference pass, which is visualized in the point cloud renders above. We encourage view… view at source ↗
Figure 16
Figure 16. Figure 16: Model architecture. The above diagram shows the model architecture for Vista4D. The fire icon indicates trainable parameters. We build upon Wan2.1-T2V-14B [2], and we omit timestep conditioning, text prompt to token embedding, modulation, layer normalization, output unshuffle, and diffusion model denoising in the diagram for simplicity. All patchify layers are initialized from the base video model besides… view at source ↗
Figure 17
Figure 17. Figure 17: Camera design UI. The above screenshot shows our current camera design UI, built on top of Viser [80]. Users can set camera intrinsics and extrinsics keyframes and interpolation tension/smoothness, while being able to preview the 4D reconstructed point cloud from their defined target cameras in real time. tion?”), and overall video fidelity (“Which option has the best overall quality (video fidelity, geom… view at source ↗
Figure 18
Figure 18. Figure 18: User study. The left screenshot shows an example video-camera pair of our user study, where users are asked to select their preferred method/option on three dimensions: Source video preservation (Q1), camera control accuracy (Q2), and overall video fidelity (Q3). The right screenshot shows our user study UI highlighting the source video, point cloud render, and corresponding output video as users hover on… view at source ↗
Figure 19
Figure 19. Figure 19: Ablation on depth artifacts and source video conditioning. We show ablation samples on training with depth artifacts (we simulate training without depth artifacts by always doing double reprojection for point cloud rendering [7]) and source video conditioning (comparing our in-context/frame-concatenated source video conditioning with no source video and source video injected via cross-attention). Both exa… view at source ↗
Figure 20
Figure 20. Figure 20: Ablation on static pixel temporal persistence. We show ablation samples on training with and without static pixel temporal persistence. Both examples above show the no-temporal-persistence model struggling to preserve seen content from the source video, such as the snow and rock mountain (left) and the metal fence and the road beyond it (right). The model without temporal persistence also exhibits inaccur… view at source ↗
read the original abstract

We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: https://eyeline-labs.github.io/Vista4D

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Vista4D, a video reshooting framework that constructs a 4D point cloud representation from an input video using static pixel segmentation and 4D reconstruction. This representation grounds the scene dynamics and provides camera signals, allowing re-synthesis from new trajectories. The method trains on reconstructed multiview dynamic data to gain robustness against point cloud artifacts at inference time on real monocular videos. It claims superior 4D consistency, camera control, and visual quality over baselines across varied videos and paths, plus generalization to applications such as dynamic scene expansion and 4D scene recomposition.

Significance. If the robustness and generalization claims hold, the work would advance practical 4D video editing and novel-view synthesis for dynamic real-world scenes by explicitly addressing depth artifacts and camera control failures that plague prior methods. The training strategy on multiview reconstructions to mitigate monocular failure modes represents a concrete engineering contribution that could influence downstream applications in film, VR, and content creation.

major comments (2)
  1. [Abstract] Abstract and results: The central generalization claim—that training exclusively on reconstructed multiview dynamic data confers robustness to the artifact distribution arising from monocular depth estimation on real-world videos with challenging trajectories—lacks supporting quantitative evidence such as ablations on noise level, trajectory difficulty, or failure-case breakdowns. This assumption is load-bearing for the headline result of improved 4D consistency on real inputs.
  2. [Abstract] Abstract: The assertions of 'improved 4D consistency, camera control, and visual quality' and generalization to real-world applications are presented without reference to specific metrics, error bars, statistical controls, or baseline comparisons, preventing verification of the data-to-claim link.
minor comments (1)
  1. [Abstract] The project page link is referenced for results, code, and models; ensure the manuscript itself contains sufficient self-contained quantitative tables or figures so that claims can be assessed without external resources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and have revised the manuscript to strengthen the presentation of our claims and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results: The central generalization claim—that training exclusively on reconstructed multiview dynamic data confers robustness to the artifact distribution arising from monocular depth estimation on real-world videos with challenging trajectories—lacks supporting quantitative evidence such as ablations on noise level, trajectory difficulty, or failure-case breakdowns. This assumption is load-bearing for the headline result of improved 4D consistency on real inputs.

    Authors: We agree that the abstract would benefit from more direct quantitative support for the generalization claim. The manuscript demonstrates robustness via comprehensive evaluations on real monocular videos with varied trajectories and point cloud artifacts, but we acknowledge the value of targeted ablations. We have added new quantitative ablations on point cloud noise levels, trajectory difficulty variations, and failure-case breakdowns in the revised manuscript to directly substantiate the contribution of multiview training. revision: yes

  2. Referee: [Abstract] Abstract: The assertions of 'improved 4D consistency, camera control, and visual quality' and generalization to real-world applications are presented without reference to specific metrics, error bars, statistical controls, or baseline comparisons, preventing verification of the data-to-claim link.

    Authors: We concur that the abstract's high-level claims would be strengthened by explicit links to the underlying evidence. The detailed metrics, error bars, statistical controls, and baseline comparisons are provided in the Experiments section. We have revised the abstract to reference these specific results and comparisons more directly, improving the connection between claims and data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent training and evaluation claims

full rationale

The paper describes Vista4D as an empirical framework that builds a 4D point cloud from input video via static segmentation and 4D reconstruction, then trains a model on reconstructed multiview dynamic data to achieve robustness at inference time on real videos. No equations, derivations, or parameter-fitting steps are presented that reduce the claimed 4D consistency, camera control, or generalization performance to quantities defined by the authors' own fitted values or self-citations. The central claims rest on the training distribution providing robustness, which is an empirical assumption rather than a self-referential reduction; the method is presented as a practical pipeline whose success is measured against external baselines and real-world applications, not by construction from its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the 4D point cloud representation and the training procedure; no explicit free parameters, mathematical axioms, or newly postulated physical entities are stated in the abstract. The 4D point cloud is treated as a constructed representation rather than an invented entity requiring independent evidence.

pith-pipeline@v0.9.0 · 5537 in / 1093 out tokens · 44335 ms · 2026-05-09T22:04:02.440164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 2

  2. [2]

    Wan: Open and advanced large-scale video gen- erative models, 2025

    Team Wan. Wan: Open and advanced large-scale video gen- erative models, 2025. 4, 13, 15, 17, 20

  3. [3]

    Video models are zero-shot learners and reasoners, 2025

    Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners, 2025

  4. [4]

    Hunyuanvideo: A systematic framework for large video generative models, 2025

    Weijie Kong et al. Hunyuanvideo: A systematic framework for large video generative models, 2025

  5. [5]

    World simulation with video foundation models for physical ai, 2025

    NVIDIA. World simulation with video foundation models for physical ai, 2025

  6. [6]

    Mochi 1, 2024

    Genmo Team. Mochi 1, 2024. 2

  7. [7]

    Trajec- toryCrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajec- toryCrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2, 3, 4, 5, 6, 13, 14, 15, 18, 22, 23

  8. [8]

    GEN3C: 3d-informed world-consistent video generation with precise camera con- trol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. GEN3C: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, 2025. 2, 4, 5, 6, 13, 14, 15

  9. [9]

    EX-4D: Extreme viewpoint 4d video synthesis via depth watertight mesh, 2025

    Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. EX-4D: Extreme viewpoint 4d video synthesis via depth watertight mesh, 2025. 2, 4, 5, 6, 13, 14, 15

  10. [10]

    ReCamMaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCamMaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 3, 4, 5, 6, 13, 14, 15, 17, 21

  11. [11]

    Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. 3, 5, 13, 14

  12. [12]

    Reangle-a- video: 4d video generation as video-to-video translation

    Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a- video: 4d video generation as video-to-video translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11164–11175, October 2025. 2

  13. [13]

    NVS- solver: Video diffusion model as zero-shot novel view synthe- sizer

    Meng YOU, Zhiyu Zhu, Hui LIU, and Junhui Hou. NVS- solver: Video diffusion model as zero-shot novel view synthe- sizer. InICLR, 2025

  14. [14]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation, 2025

    Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation, 2025. 2

  15. [15]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InCVPR, pages 22831–22840, June 2025. 2, 3

  16. [16]

    Depthcrafter: Generating consistent long depth sequences for open-world videos

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InCVPR, pages 2005–2015, June 2025

  17. [17]

    Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors

    Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song- Hai Zhang, and Ying Shan. Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors. InICCV, pages 6632–6644, October 2025. 2, 3

  18. [18]

    invs : Repurposing diffusion inpainters for novel view synthesis

    Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, and Igor Gilitschenski. invs : Repurposing diffusion inpainters for novel view synthesis. InSIGGRAPH Asia, 2023. 2

  19. [19]

    Multidiff: Consistent novel view synthesis from a single image

    Norman M ¨uller, Katja Schwarz, Barbara R ¨ossle, Lorenzo Porzi, Samuel Rota Bul `o, Matthias Nießner, and Peter 9 Kontschieder. Multidiff: Consistent novel view synthesis from a single image. InCVPR, pages 10258–10268, June

  20. [20]

    Trajectory attention for fine-grained video motion control

    Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control. InICLR, 2025. 2

  21. [21]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis.ECCV, 2024

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis.ECCV, 2024. 3

  22. [22]

    CamCloneMaster: Enabling reference-based camera control for video generation

    Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. CamCloneMaster: Enabling reference-based camera control for video generation. InSIGGRAPH Asia, 2025. 3, 4, 5, 6, 13, 14, 15

  23. [23]

    Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025a

    Jensen (Jinghao) Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Gen- erative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3

  24. [24]

    Spad: Spatially aware multi-view diffusers

    Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. InCVPR, 2024

  25. [25]

    arXiv:2303.11328 , year=

    Chen Liu and et al. Zero-1-to-3: Zero-shot novel view syn- thesis from a single image.arXiv:2303.11328, 2023

  26. [26]

    MVDream: Multi-view diffusion for 3D generation

    Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. InProc. ICLR, 2024

  27. [27]

    Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion

    Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view syn- thesis and 3D generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024. 3

  28. [28]

    Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025. 3

  29. [29]

    Vd3d: Taming large video diffusion transformers for 3d camera control.Proc

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin- Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.Proc. ICLR, 2025

  30. [30]

    SV4D: Dynamic 3d content genera- tion with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

    Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024. 3

  31. [31]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 3

  32. [32]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InCVPR, pages 21686–21697, 2024

  33. [33]

    Global Structure-from-Motion Revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Sch¨onberger. Global Structure-from-Motion Revisited. In European Conference on Computer Vision (ECCV), 2024. 3, 21

  34. [34]

    Flashdepth: Real-time streaming video depth es- timation at 2k resolution.arXiv preprint arXiv:2504.07093,

    Gene Chou, Wenqi Xian, Guandao Yang, Mohamed Abdelfat- tah, Bharath Hariharan, Noah Snavely, Ning Yu, and Paul Debevec. Flashdepth: Real-time streaming video depth es- timation at 2k resolution.arXiv preprint arXiv:2504.07093,

  35. [35]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

  36. [36]

    Unidepthv2: Universal monocular metric depth estimation made simpler

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 3

  37. [37]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InCVPR, pages 10486–10496, June 2025. 3

  38. [38]

    Zhai, and Shenlong Wang

    David Yifan Yao, Albert J. Zhai, and Shenlong Wang. Uni4d: Unifying visual foundation models for 4d modeling from a single video. InCVPR, pages 1116–1126, June 2025. 5, 15

  39. [39]

    Vipe: Video pose engine for 3d geometric perception, 2025

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception, 2025. 3, 21

  40. [40]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 3

  41. [41]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 3

  42. [42]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

  43. [43]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d recon- struction.arXiv preprint arXiv:2509.13414, 2025. 3

  44. [44]

    π3: Scalable permutation-equivariant visual geometry learning, 2025

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning, 2025. 3, 5, 13, 17, 21

  45. [45]

    Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

    Q. Sun et al. Monst3r: A simple approach for estimating geometry in dynamic scenes.arXiv:2410.03825, 2024

  46. [46]

    GA”. The “Type

    D. Zhuo et al. Streaming 4d visual geometry transformer. arXiv:2507.11539, 2025

  47. [47]

    arXiv preprint arXiv:2504.07961 , year =

    Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video genera- tors for geometric 4d scene reconstruction.arXiv preprint arXiv:2504.07961, 2025. 3

  48. [48]

    Harley, Leonidas Guibas, and Kostas Daniilidis

    Jiahui Lei, Yijia Weng, Adam W. Harley, Leonidas Guibas, and Kostas Daniilidis. MoSca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InCVPR, pages 6165–6177, June 2025. 3 10

  49. [49]

    Shape of motion: 4d reconstruction from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. InICCV, pages 9660– 9672, October 2025. 22

  50. [50]

    Gflow: Recovering 4d world from monoc- ular video

    Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, and Xinchao Wang. Gflow: Recovering 4d world from monoc- ular video. InAAAI. AAAI Press, 2025. ISBN 978-1-57735- 897-8. 3

  51. [51]

    SAM 2: Segment anything in images and videos, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos, 2024. 3, 5, 15, 17

  52. [52]

    Grounding DINO 1.5: Advance the “edge” of open-set object detection, 2024

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Pei- jun Tang, Kent Yu, and Lei Zhang. Grounding DINO 1.5: Advance the “edge” of open-set object detection, 2024

  53. [53]

    Grounded SAM: Assembling open-world models for diverse visual tasks, 2024

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: Assembling open-world models for diverse visual tasks, 2024. 3, 5, 15, 17

  54. [54]

    Col- laborative video diffusion: Consistent multi-video generation with camera control

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas J Guibas, and Gordon Wetzstein. Col- laborative video diffusion: Consistent multi-video generation with camera control. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Adv. Neural Inform. Process. Syst., volume 37, pages 16240– 16271....

  55. [55]

    Cameractrl: Enabling camera control for video diffusion models

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

  56. [56]

    Virtually being : Customizing camera-controllable video diffusion models with multi-view performance captures

    Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Lev- ent Tas ¸el, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, Rahul Garg, Paul Debevec, and Ning Yu. Virtually being : Customizing camera-controllable video diffusion models with multi-view performance captures. InSIGGRAPH Asia, 2025. 4, 5, 13

  57. [57]

    Superglue: Learning feature match- ing with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature match- ing with graph neural networks. InCVPR, June 2020. 4, 5

  58. [58]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InCVPR Workshops, June 2018. 5, 22

  59. [59]

    Pippo: High-resolution multi- view humans from a single image

    Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, and Timur Bagautdinov. Pippo: High-resolution multi- view humans from a single image. InCVPR, 2025. 4, 5, 22

  60. [60]

    Monocular dynamic view synthesis: A reality check

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Adv. Neural In- form. Process. Syst., volume 35, pages 33768–33780. Curran Associates, Inc., 2022. 4, 6, 7, 22

  61. [61]

    Vbench: Com- prehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, June 2024. 5, 6, 22

  62. [62]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,

  63. [63]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 5

  64. [64]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, October 2023. 5

  65. [65]

    OpenVid-1M: A large-scale high-quality dataset for text-to- video generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. OpenVid-1M: A large-scale high-quality dataset for text-to- video generation. InThe Thirteenth International Conference on Learning Representations, 2025. 5, 13

  66. [66]

    Recognize anything: A strong image tagging model

    Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize Anything: A strong image tagging model.arXiv preprint arXiv:2306.03514, 2023. 5, 15

  67. [67]

    The Llama 3 herd of models, 2024

    Llama Team. The Llama 3 herd of models, 2024. 5, 15

  68. [68]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, June 2016. 5, 17

  69. [69]

    Pexels: Free stock photos & videos, 2025

    Pexels. Pexels: Free stock photos & videos, 2025. 5, 17

  70. [70]

    Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise

    Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. InCVPR, pages 13–23, June 2025. 6

  71. [71]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Assoc...

  72. [72]

    To- wards accurate generative models of video: A new metric & challenges, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 6, 22

  73. [73]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Int. Conf. Machine Learn., volume 11 139 ofProceedings of Ma...

  74. [74]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025. 13

  75. [75]

    Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380,

    Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380,

  76. [76]

    CogVLM2: Visual language models for image and video understanding, 2024

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qing- song Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, and Jie Tang. CogVLM2: Visual language models for image and video understanding...

  77. [77]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

  78. [78]

    PySceneDetect, 2025

    Brandon Castellano. PySceneDetect, 2025. 15

  79. [79]

    Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities, 2025

    Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities, 2025. 17

  80. [80]

    Viser: Im- perative, web-based 3d visualization in python, 2025

    Brent Yi, Chung Min Kim, Justin Kerr, Gina Wu, Rebecca Feng, Anthony Zhang, Jonas Kulhanek, Hongsuk Choi, Yi Ma, Matthew Tancik, and Angjoo Kanazawa. Viser: Im- perative, web-based 3d visualization in python, 2025. 18, 20

Showing first 80 references.