Recognition: unknown
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3
The pith
SASI achieves real-time HD stereo video inpainting at 25 FPS by sparsifying diffusion tokens after generating synthetic pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Gradient-Aware Parallax Warping (GAPW) that uses backward warping and coordinate gradients to produce continuous edges and smooth occlusion regions, Parallax-Based Dual Projection (PBDP) that builds geometrically consistent stereo inpainting pairs and masks from ordinary videos, and Sparsity-Aware Stereo Inpainting (SASI) that discards over 70 percent of redundant tokens to deliver a 10.7 times speedup and 25 FPS on 768 by 1280 frames.
What carries the argument
Sparsity-Aware Stereo Inpainting (SASI), a diffusion-based module that applies token reduction only to the occluded regions identified by parallax-derived masks.
If this is right
- Processes 768 x 1280 stereo videos at 25 FPS on one A100 GPU
- Matches quality of full-token diffusion while using 70 percent fewer tokens
- Trains without real stereo video datasets by synthesizing pairs from monocular input
Where Pith is reading between the lines
- The token-sparsity idea may transfer to other video tasks where only small regions change between frames, such as object removal or style transfer.
- Synthetic pair generation via parallax warping could help other stereo problems that lack paired training data.
- Real-time performance opens testing in live settings like mobile AR video editing.
Load-bearing premise
GAPW and PBDP generate occlusion masks and stereo pairs accurate enough that any small geometric errors can be corrected by the downstream inpainter.
What would settle it
Run the full pipeline on a video sequence with independently captured stereo ground truth and check whether the output shows visible warping artifacts or inconsistent fills in the occluded zones.
Figures
read the original abstract
Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address challenges in stereo video inpainting—scattered occluded regions and redundant computation—by proposing three components: Gradient-Aware Parallax Warping (GAPW) for continuous edges and smooth occlusions via gradient-guided backward warping, Parallax-Based Dual Projection (PBDP) to generate geometrically consistent stereo inpainting pairs and occlusion masks from monocular video, and Sparsity-Aware Stereo Inpainting (SASI) that prunes over 70% of redundant tokens in a diffusion model for a 10.7x inference speedup, enabling real-time 25 FPS processing of 768x1280 HD videos on a single A100 GPU with quality comparable to full computation.
Significance. If the efficiency and quality claims hold under rigorous verification, the work would be significant for real-time video applications in editing, VR, and graphics by tackling both data scarcity and computational waste in inpainting; the sparsity-aware diffusion approach and monocular-to-stereo synthesis pipeline could influence efficient generative models for video if supported by reproducible metrics and failure-case analysis.
major comments (3)
- [Abstract] Abstract: The headline claims of 'over 70% of redundant tokens', '10.7x speedup', 'comparable results', and '25 FPS on HD video' are stated without any reference to quantitative tables, error metrics (e.g., PSNR/SSIM), ablation studies, or experimental figures, leaving the central efficiency and quality assertions unverifiable from the provided text and undermining assessment of the SASI contribution.
- [Methods (GAPW/PBDP)] Methods description of GAPW and PBDP: The assumption that these steps reliably produce geometrically consistent stereo pairs and accurate occlusion masks from non-stereo input is load-bearing for the downstream SASI claim, yet no analysis of artifacts in disocclusion regions, specular surfaces, or rapid motion is provided; if inconsistencies arise here, the 'comparable quality' guarantee cannot hold regardless of token pruning.
- [Experiments] Experiments section (implied by claims): No ablation on the impact of the 70% token reduction on visual coherence or temporal consistency is described, nor are baseline comparisons or user studies mentioned, making it impossible to confirm that SASI preserves quality while achieving the reported speedup.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of 'over 70% of redundant tokens', '10.7x speedup', 'comparable results', and '25 FPS on HD video' are stated without any reference to quantitative tables, error metrics (e.g., PSNR/SSIM), ablation studies, or experimental figures, leaving the central efficiency and quality assertions unverifiable from the provided text and undermining assessment of the SASI contribution.
Authors: We agree that the abstract would be improved by explicit linkage to the supporting evidence. These headline results are directly backed by the quantitative evaluations in Section 4: Table 2 reports PSNR/SSIM comparisons against baselines, Table 3 presents the ablation on token pruning ratios (including the 70% level) with corresponding speedups and quality retention, and Figure 5 provides visual and temporal consistency examples for HD video at 25 FPS. In the revised manuscript, we will update the abstract to include a concise reference such as 'validated through PSNR/SSIM metrics and ablations demonstrating 10.7x speedup with comparable quality to full computation'. This addresses verifiability while preserving abstract length. revision: yes
-
Referee: [Methods (GAPW/PBDP)] Methods description of GAPW and PBDP: The assumption that these steps reliably produce geometrically consistent stereo pairs and accurate occlusion masks from non-stereo input is load-bearing for the downstream SASI claim, yet no analysis of artifacts in disocclusion regions, specular surfaces, or rapid motion is provided; if inconsistencies arise here, the 'comparable quality' guarantee cannot hold regardless of token pruning.
Authors: We acknowledge that a targeted robustness analysis for GAPW and PBDP would strengthen the paper. The current manuscript supports the geometric consistency through overall end-to-end metrics and visual results in Sections 3 and 4, but does not dedicate space to isolated failure-case examination under specular highlights, rapid motion, or complex disocclusions. In the revision, we will add a dedicated paragraph in Section 3.2 along with supplementary figures showing representative artifact examples, their occurrence rates, and how SASI mitigates residual inconsistencies. This will provide explicit evidence that the pipeline remains reliable for the claimed quality. revision: yes
-
Referee: [Experiments] Experiments section (implied by claims): No ablation on the impact of the 70% token reduction on visual coherence or temporal consistency is described, nor are baseline comparisons or user studies mentioned, making it impossible to confirm that SASI preserves quality while achieving the reported speedup.
Authors: The manuscript already contains ablation studies in Section 4.3 evaluating sparsity levels up to 70% pruning, with reported effects on PSNR, SSIM, and runtime, plus baseline comparisons in Table 1 and Figure 4 against prior inpainting methods. However, we agree that explicit quantification of visual coherence and temporal consistency (beyond aggregate metrics) and perceptual validation via user study would provide stronger confirmation. We will expand Section 4 to include temporal consistency metrics (e.g., frame-to-frame warping error) and a targeted user study on perceptual quality for the pruned versus full model. This revision will directly address the concern while building on the existing experimental framework. revision: partial
Circularity Check
No significant circularity; derivation is self-contained algorithmic composition
full rationale
The paper defines three components sequentially—GAPW for gradient-guided backward warping to produce continuous edges and occlusion regions, PBDP to generate stereo inpainting pairs and masks from monocular input using GAPW outputs, and SASI for token sparsity in diffusion inference—without any equations, fitted parameters, or self-citations that reduce the claimed 10.7x speedup, 25 FPS HD performance, or quality comparability back to the results themselves by construction. Each step is presented as an independent algorithmic contribution whose correctness can be evaluated externally against geometric consistency and inpainting benchmarks, rather than being forced by prior definitions or renamings within the paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diffusion-based inpainting can produce visually coherent content for small scattered occlusion regions when guided by accurate masks.
- domain assumption Parallax information from single-view warping suffices to generate geometrically consistent stereo pairs.
Reference graph
Works this paper leans on
-
[1]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[2]
Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix.arXiv preprint arXiv:2407.00367, 2024. 2
-
[3]
Depthify.ai.https://www.depthify
Depthify.ai. Depthify.ai.https://www.depthify. ai/, 2025. Accessed: 2025-07-31. 7
2025
-
[4]
Srinivasan, Jonathan T
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.Advances in Neural Information Processing Systems, 2024. 2
2024
-
[5]
Single- view view synthesis in the wild with learned adaptive multi- plane images
Yuxuan Han, Ruicheng Wang, and Jiaolong Yang. Single- view view synthesis in the wild with learned adaptive multi- plane images. InACM SIGGRAPH, 2022. 1, 2
2022
-
[6]
Headnerf: A real-time nerf-based parametric head model
Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juy- ong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 2
2022
-
[7]
Depthcrafter: Generating consistent long depth sequences for open-world videos
Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InCVPR, 2025. 3
2025
-
[8]
Restereo: Diffusion stereo video generation and restoration.arXiv preprint arXiv:2506.06023, 2025
Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Chris- tian Theobalt, Cengiz Oztireli, and Gurprit Singh. Restereo: Diffusion stereo video generation and restoration.arXiv preprint arXiv:2506.06023, 2025. 2, 3
-
[9]
Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Tim- merer, and Hadi Amirpour
M.H. Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Tim- merer, and Hadi Amirpour. Svd: Spatial video dataset. In ACM International Conference on Multimedia (ACM MM),
-
[10]
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 5
-
[11]
Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching
Junpeng Jing, Ye Mao, and Krystian Mikolajczyk. Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching. InEuropean Conference on Com- puter Vision, pages 415–432. Springer, 2024. 2
2024
-
[12]
Dy- namicstereo: Consistent dynamic depth from stereo videos
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. CVPR, 2023. 5
2023
-
[13]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[14]
Learning-based, automatic 2d-to-3d image and video conversion.IEEE Transactions on Image Processing, 22(9):3485–3496, 2013
Janusz Konrad, Meng Wang, Prakash Ishwar, Chen Wu, and Debargha Mukherjee. Learning-based, automatic 2d-to-3d image and video conversion.IEEE Transactions on Image Processing, 22(9):3485–3496, 2013. 1, 2
2013
-
[15]
Spatialdreamer: Self-supervised stereo video synthesis from monocular in- put
Zhen Lv, Yangqi Long, Congzhentao Huang, Cao Li, Chengfei Lv, Hao Ren, and Dian Zheng. Spatialdreamer: Self-supervised stereo video synthesis from monocular in- put. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 811–821, 2025. 2, 5, 3
2025
-
[16]
Steep parallax map- ping.I3D 2005 Poster, pages 23–24, 2005
Morgan McGuire and Max McGuire. Steep parallax map- ping.I3D 2005 Poster, pages 23–24, 2005. 3
2005
-
[17]
Stereo conversion with disparity-aware warp- ing, compositing and inpainting
Lukas Mehl, Andr ´es Bruhn, Markus Gross, and Christo- pher Schroers. Stereo conversion with disparity-aware warp- ing, compositing and inpainting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4260–4269, 2024. 2
2024
-
[18]
Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
2021
-
[19]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[20]
Softmax splatting for video frame interpolation
Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. InIEEE Conference on Computer Vision and Pattern Recognition, 2020. 3
2020
-
[21]
owl3d.https://www.owl3d.com/landing,
Owl3D. owl3d.https://www.owl3d.com/landing,
-
[22]
Accessed: 2025-07-31. 7
2025
-
[23]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[24]
D-nerf: Neural radiance fields for dynamic scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 2
2021
-
[25]
High-resolution image syn- thesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 2
2021
-
[26]
Immersepro: End- to-end stereo video synthesis via implicit disparity learning
Jian Shi, Zhenyu Li, and Peter Wonka. Immersepro: End- to-end stereo video synthesis via implicit disparity learning. arXiv preprint arXiv:2410.00262, 2024. 2
-
[27]
arXiv preprint arXiv:2505.16565 (2025)
Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion.arXiv preprint arXiv:2505.16565, 2025. 2, 5, 4
-
[28]
Single-view view synthe- sis with multiplane images
Richard Tucker and Noah Snavely. Single-view view synthe- sis with multiplane images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 551–560, 2020. 1, 2
2020
-
[29]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Stereodiffusion: Training-free stereo image generation using latent diffusion models
Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7416–7425, 2024. 2
2024
-
[31]
Zerostereo: Zero-shot stereo matching from single images
Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. Zerostereo: Zero-shot stereo matching from single images. arXiv preprint arXiv:2501.08654, 2025. 2, 7, 3
-
[32]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5
2004
-
[33]
Foundationstereo: Zero- shot stereo matching
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5249–5260,
-
[34]
Foundationstereo: Zero- shot stereo matching.CVPR, 2025
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025. 5
2025
-
[35]
Peak signal-to-noise ratio — Wikipedia, the free encyclopedia.https : / / en
Wikipedia contributors. Peak signal-to-noise ratio — Wikipedia, the free encyclopedia.https : / / en . wikipedia . org / w / index . php ? title = Peak _ signal - to - noise _ ratio & oldid = 1210897995,
-
[36]
[Online; accessed 4-March-2024]. 5
2024
-
[37]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2
2024
-
[38]
Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks
Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks. InEuropean conference on com- puter vision, pages 842–857. Springer, 2016. 2, 7
2016
-
[39]
Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024. 2
-
[40]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[41]
arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al
Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monoc- ular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025. 1, 2, 3, 4
-
[42]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[43]
Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, and Xin Tian. Spatialme: Stereo video conversion us- ing depth-warping and blend-inpainting.arXiv preprint arXiv:2412.11512, 2024. 1, 2
-
[44]
3d-tv content creation: automatic 2d-to-3d video conversion.IEEE Transactions on Broadcasting, 57(2):372–383, 2011
Liang Zhang, Carlos Vazquez, and Sebastian Knorr. 3d-tv content creation: automatic 2d-to-3d video conversion.IEEE Transactions on Broadcasting, 57(2):372–383, 2011. 1, 2
2011
-
[45]
Structural multiplane image: Bridging neural view synthesis and 3d reconstruction
Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei Huang, Yoichi Sato, and Yan Lu. Structural multiplane image: Bridging neural view synthesis and 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16707–16716, 2023. 1, 2
2023
-
[46]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5
2018
-
[47]
arXiv preprint arXiv:2409.07447 (2024)
Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xi- aoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos.arXiv preprint arXiv:2409.07447, 2024. 1, 2, 3, 7
-
[48]
Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024
Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024. 5, 1
2024
-
[49]
Propainter: Improving propagation and transformer for video inpainting
Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 5 DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos Supplementary Material Appendix This supplementary mate...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.