Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion
Pith reviewed 2026-06-28 22:39 UTC · model grok-4.3
The pith
A complete 3D cache of foreground objects extracted from video supplies explicit geometric guidance that lets video diffusion models sustain consistency through large camera motions and heavy occlusions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By capturing the entire 3D volume of foreground entities rather than just their visible shells, the 3D cache extracted via lifting models such as SAM3D injects holistic spatial priors into the video diffusion model. A Soft Spatial-Aligned Injection mechanism together with minimal fine-tuning lets the model use this cache as geometric scaffolding while retaining its original priors. Masked normal maps serve as a cross-modal bridge for 3D-free data curation. The resulting framework produces videos that remain spatiotemporally consistent under large viewpoint shifts and severe occlusions, and it removes perspective ambiguities arising from structural holes, erroneous facades, reflections, and r
What carries the argument
The explicitly editable 3D cache produced by 3D lifting models, combined with the Soft Spatial-Aligned Injection mechanism that feeds it into the diffusion process.
If this is right
- Precise, decoupled control becomes possible over both camera trajectories and multi-entity motions.
- Spatiotemporal consistency is preserved under large camera shifts and severe occlusions where prior diffusion-only methods fail.
- Perspective ambiguities from structural holes, erroneous facades, reflections, and refractions are removed by separating geometry from appearance.
- The 3D cache serves as a scaffold that overcomes breakdowns caused by over-reliance on implicit diffusion priors.
Where Pith is reading between the lines
- Because the cache is explicitly editable, the same pipeline could support interactive video editing where a user modifies object shapes or positions before re-rendering.
- The separation of geometry and appearance might allow the 3D cache to be swapped between different video diffusion backbones without retraining the appearance model.
- If the lifting model can operate on partial or noisy input, the method could extend to real-time applications where only a few frames are available.
Load-bearing premise
The 3D lifting step produces an accurate, complete, and editable 3D representation of the scene objects that can be used directly as guidance without injecting its own geometric errors.
What would settle it
Generate videos from the same input sequences both with and without the 3D-cache injection in scenes containing rapid camera movement or full occlusions; if the version without the cache shows measurably fewer structural errors or collapses, the claim that the cache supplies essential complementary context would be falsified.
Figures
read the original abstract
While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Real2SAM2Real, a framework that uses 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache from input video as geometric scaffolding for Video Diffusion Models (VDMs). It introduces a Soft Spatial-Aligned Injection mechanism and a minimally invasive fine-tuning strategy, along with a masked normal map-based data curation pipeline, claiming to enable decoupled control over camera trajectories and multi-entity motions while overcoming structural collapse, perspective ambiguities, and inconsistencies under large camera shifts and occlusions.
Significance. If the central claims hold with proper validation, the use of generative 3D caches to supply holistic spatial priors could meaningfully advance controllable video synthesis by reducing over-reliance on implicit diffusion priors. The decoupling of geometry from appearance and the preservation of pre-trained VDM weights via minimal fine-tuning are conceptually attractive strengths that could influence follow-on work in 3D-aware generation.
major comments (2)
- [Abstract and §3] Abstract and method description: The strongest claim—that the 3D cache 'captures the entire 3D volume of foreground entities rather than just their visible shells' and 'eradicates perspective ambiguities caused by structural holes'—is load-bearing yet rests on the untested assumption that SAM3D produces accurate, complete, and error-free geometry; no quantitative reconstruction metrics, error bounds, or failure-case analysis (e.g., under occlusion or depth ambiguity) are supplied to show that injection errors remain below the threshold that would corrupt VDM conditioning.
- [Experiments] Experiments section: The abstract states that 'extensive experiments demonstrate the benefits' and that the method 'overcomes typical breakdowns,' but the manuscript supplies no quantitative results, baselines, ablation studies, or metrics (e.g., FVD, temporal consistency scores, or user studies) to support superiority or to isolate the contribution of the 3D cache versus the injection mechanism.
minor comments (2)
- [Method] The 'Soft Spatial-Aligned Injection' mechanism is described at a high level without an explicit equation, algorithm, or diagram showing how the 3D cache is aligned and injected into the diffusion process.
- [Implementation] Implementation details (e.g., exact SAM3D variant, cache resolution, fine-tuning hyperparameters, and data perturbation pipeline) are referenced but not specified, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and method description: The strongest claim—that the 3D cache 'captures the entire 3D volume of foreground entities rather than just their visible shells' and 'eradicates perspective ambiguities caused by structural holes'—is load-bearing yet rests on the untested assumption that SAM3D produces accurate, complete, and error-free geometry; no quantitative reconstruction metrics, error bounds, or failure-case analysis (e.g., under occlusion or depth ambiguity) are supplied to show that injection errors remain below the threshold that would corrupt VDM conditioning.
Authors: We agree that the claims regarding the 3D cache's completeness would be strengthened by direct validation of SAM3D outputs. The manuscript relies on SAM3D as a recent off-the-shelf lifting model without additional quantitative reconstruction metrics or explicit failure-case analysis. We will add a dedicated subsection in §3 (or a new appendix) reporting reconstruction metrics such as depth accuracy and completeness on held-out sequences, along with qualitative failure cases under occlusion and depth ambiguity, to bound the injection error. revision: yes
-
Referee: [Experiments] Experiments section: The abstract states that 'extensive experiments demonstrate the benefits' and that the method 'overcomes typical breakdowns,' but the manuscript supplies no quantitative results, baselines, ablation studies, or metrics (e.g., FVD, temporal consistency scores, or user studies) to support superiority or to isolate the contribution of the 3D cache versus the injection mechanism.
Authors: The current version emphasizes qualitative results and visual comparisons to demonstrate the framework's behavior. We acknowledge that quantitative metrics would better isolate contributions and support the claims. We will expand the Experiments section to include FVD scores, temporal consistency metrics, ablation studies on the injection mechanism, and a user study comparing against baselines. revision: yes
Circularity Check
No circularity detected; method relies on external components
full rationale
The paper describes a framework that extracts 3D caches using independent external 3D lifting models (e.g., SAM3D) and injects them into pre-trained VDMs via a new Soft Spatial-Aligned Injection mechanism and fine-tuning strategy. No equations, predictions, or central claims reduce by construction to fitted parameters or self-citations defined within the paper. The derivation chain is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or self-citation load-bearing patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 3D lifting models such as SAM3D can extract an explicitly editable 3D cache capturing the full volume of foreground entities from video input.
invented entities (1)
-
Soft Spatial-Aligned Injection mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)
2025
-
[2]
arXiv preprint arXiv:2412.07760 (2024)
Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., Hu, H., Wan, P., Zhang, D.: Syncam- master: Synchronizing multi-camera video generation from diverse viewpoints. arXiv preprint arXiv:2412.07760 (2024)
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bin, Y ., Hu, W., Wang, H., Chen, X., Wang, B.: Normalcrafter: Learning temporally consistent normals from video diffusion priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8330–8339 (2025)
2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y ., Richter, S.R., Koltun, V .: Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers
Cao, C., Zhou, J., Li, S., Liang, J., Yu, C., Wang, F., Xue, X., Fu, Y .: Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)
2025
-
[7]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y .T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V ., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
SAM 3D: 3Dfy Anything in Images
Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Cheng, T., Song, L., Ge, Y ., Liu, W., Wang, X., Shan, Y .: Yolo-world: Real-time open- vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024)
2024
-
[10]
arXiv preprint arXiv:2412.07759 (2024)
Fu, X., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. arXiv preprint arXiv:2412.07759 (2024)
-
[11]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y ., Rubinstein, M., Sun, C., et al.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1–12 (2025)
2025
-
[12]
https://github
Grupp, M.: evo: Python package for the evaluation of odometry and slam. https://github. com/MichaelGrupp/evo(2017)
2017
-
[13]
In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers
Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., et al.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025)
2025
-
[14]
LTX-Video: Realtime Video Latent Diffusion
HaCohen, Y ., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
He, H., Xu, Y ., Guo, Y ., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Iclr1(2), 3 (2022) 10
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 10
2022
-
[17]
arXiv preprint arXiv:2506.05554 (2025)
Hu, T., Peng, H., Liu, X., Ma, Y .: Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh. arXiv preprint arXiv:2506.05554 (2025)
-
[18]
ViPE: Video Pose Engine for 3D Geometric Perception
Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev, D., Lin, C.H., Ren, J., Xie, K., Biswas, J., Leal-Taixe, L., Fidler, S.: Vipe: Video pose engine for 3d geometric perception. In: NVIDIA Research Whitepapers arXiv:2508.10934 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)
2024
-
[20]
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Hunyuan3D, T., Yang, S., Yang, M., Feng, Y ., Huang, X., Zhang, S., He, Z., Luo, D., Liu, H., Zhao, Y ., et al.: Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J.H., Chen, D.Y ., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2411.14208 (2024)
Liu, K., Shao, L., Lu, S.: Novel view extrapolation with video diffusion priors. arXiv preprint arXiv:2411.14208 (2024)
-
[24]
In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers
Luo, Y ., Shi, X., Bai, J., Xia, M., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Camclone- master: Enabling reference-based camera control for video generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–10 (2025)
2025
-
[25]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2016–2029 (2025)
2016
-
[26]
In: Proceedings of the AAAI conference on artificial intelligence
Mou, C., Wang, X., Xie, L., Wu, Y ., Zhang, J., Qi, Z., Shan, Y .: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)
2024
-
[27]
In: European Conference on Computer Vision
Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L.: Global structure-from-motion revisited. In: European Conference on Computer Vision. pp. 58–77. Springer (2024)
2024
-
[28]
Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations
Pi, H., Guo, R., Shen, Z., Shuai, Q., Hu, Z., Wang, Z., Dong, Y ., Hu, R., Komura, T., Peng, S., et al.: Motion-2-to-3: Leveraging 2d motion data to boost 3d motion generation. arXiv preprint arXiv:2412.13111 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Piccinelli, L., Sakaridis, C., Yang, Y .H., Segu, M., Li, S., Abbeloos, W., Van Gool, L.: Unidepthv2: Universal monocular metric depth estimation made simpler. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
2025
-
[30]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Piccinelli, L., Yang, Y .H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: Unidepth: Universal monocular metric depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106–10116 (2024)
2024
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y ., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)
2025
-
[32]
arXiv preprint arXiv:2509.15130 (2025)
Song, C., Yang, Y ., Zhao, T., Li, R., Zhang, C.: Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance. arXiv preprint arXiv:2509.15130 (2025)
-
[33]
In: European Conference on Computer Vision
Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., V ondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024) 11
2024
-
[34]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geom- etry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
2025
-
[36]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Wang, R., Xu, S., Dong, Y ., Deng, Y ., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Ac- curate monocular geometry with metric scale and sharp details. arXiv preprint arXiv:2507.02546 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
In: ACM SIGGRAPH 2024 Conference Papers
Wang, Z., Yuan, Z., Wang, X., Li, Y ., Chen, T., Xia, M., Luo, P., Shan, Y .: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[38]
Direct3D-S2: Gigascale 3D generation made easy with spatial sparse attention
Wu, S., Lin, Y ., Zhang, F., Zeng, Y ., Yang, Y ., Bao, Y ., Qian, J., Zhu, S., Cao, X., Torr, P., et al.: Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention. arXiv preprint arXiv:2505.17412 (2025)
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xiang, J., Lv, Z., Xu, S., Deng, Y ., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)
2025
-
[40]
arXiv preprint arXiv:2411.19324 (2024)
Xiao, Z., Ouyang, W., Zhou, Y ., Yang, S., Yang, L., Si, J., Pan, X.: Trajectory attention for fine-grained video motion control. arXiv preprint arXiv:2411.19324 (2024)
-
[41]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ye, C., Wu, Y ., Lu, Z., Chang, J., Guo, X., Zhou, J., Zhao, H., Han, X.: Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25050–25061 (2025)
2025
-
[43]
arXiv preprint arXiv:2405.15364 (2024)
You, M., Zhu, Z., Liu, H., Hou, J.: Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. arXiv preprint arXiv:2405.15364 (2024)
-
[44]
In: Proceedings of the IEEE/CVF international conference on computer vision
Yu, M., Hu, W., Xing, J., Shan, Y .: Trajectorycrafter: Redirecting camera trajectory for monoc- ular videos via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 100–111 (2025)
2025
-
[45]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y ., Tian, Y .: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhang, D.J., Paiss, R., Zada, S., Karnad, N., Jacobs, D.E., Pritch, Y ., Mosseri, I., Shou, M.Z., Wadhwa, N., Ruiz, N.: Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2050–2062 (2025)
2050
-
[47]
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
2023
-
[48]
arXiv preprint arXiv:2601.05138 (2026) 12
Zheng, S., Yin, M., Hu, W., Li, X., Shan, Y ., Fu, Y .: Versecrafter: Dynamic realistic video world model with 4d geometric control. arXiv preprint arXiv:2601.05138 (2026) 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.