arxiv: 2604.12270 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

Yuan Huang , Sijie Zhao , Jing Cheng , Hao Xu , Shaohui Jiao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo inpaintingvideo inpaintingreal-time videodiffusion modelsparallax warpingocclusion maskstoken sparsity

0 comments

The pith

SASI achieves real-time HD stereo video inpainting at 25 FPS by sparsifying diffusion tokens after generating synthetic pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make stereo video inpainting practical by filling small occluded regions in warped frames while preserving visual and temporal coherence. It tackles scarce training data through synthetic pair creation and cuts computation by processing only the sparse changed areas instead of every pixel. A reader would care because this turns an otherwise slow, hardware-heavy task into one that runs live on a single GPU, supporting applications like 3D video editing and virtual reality content.

Core claim

We introduce Gradient-Aware Parallax Warping (GAPW) that uses backward warping and coordinate gradients to produce continuous edges and smooth occlusion regions, Parallax-Based Dual Projection (PBDP) that builds geometrically consistent stereo inpainting pairs and masks from ordinary videos, and Sparsity-Aware Stereo Inpainting (SASI) that discards over 70 percent of redundant tokens to deliver a 10.7 times speedup and 25 FPS on 768 by 1280 frames.

What carries the argument

Sparsity-Aware Stereo Inpainting (SASI), a diffusion-based module that applies token reduction only to the occluded regions identified by parallax-derived masks.

If this is right

Processes 768 x 1280 stereo videos at 25 FPS on one A100 GPU
Matches quality of full-token diffusion while using 70 percent fewer tokens
Trains without real stereo video datasets by synthesizing pairs from monocular input

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-sparsity idea may transfer to other video tasks where only small regions change between frames, such as object removal or style transfer.
Synthetic pair generation via parallax warping could help other stereo problems that lack paired training data.
Real-time performance opens testing in live settings like mobile AR video editing.

Load-bearing premise

GAPW and PBDP generate occlusion masks and stereo pairs accurate enough that any small geometric errors can be corrected by the downstream inpainter.

What would settle it

Run the full pipeline on a video sequence with independently captured stereo ground truth and check whether the output shows visible warping artifacts or inconsistent fills in the occluded zones.

Figures

Figures reproduced from arXiv: 2604.12270 by Hao Xu, Jing Cheng, Shaohui Jiao, Sijie Zhao, Yuan Huang.

**Figure 2.** Figure 2: Comparison of forward warping and our GAPW in pixel [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Illustration of the Parallax-Based Dual Projection, which utilizes Gradient-Aware Parallax Warping for reprojection to obtain the occlusion mask under input view. (b) Our proposed Sparsity-Aware Stereo Inpainting utilizes Mask-Based Token Selection to reduce the redundancy of visual tokens. Input Video & Disparity TrajectoryCrafter Ours [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of data construction between [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of stereo video inpainting on the HD-100 test set ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of 2D-to-3D conversion. Each group shows the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation on max disparity. Our method consistently de [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Inference pipeline of 2D-to-3D conversion. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Module-wise latency breakdown and reduction on [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on HD-100 (768×1280). This figure complements [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison with SpatialDreamer [15]. Highlighted regions show truncation artifacts in SpatialDreamer, whereas our results preserve complete and consistent stereo geometry [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison with M2SViD [26]. Our results show sharper details, larger disparities, and better color consistency, while M2SViD suffers from color drift and structural degradation. Ground Truth Ours ZeroStereo StereoCrafter Owl3D Depthify.ai Deep3D [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison on the SVD AVP test set (768×768). The second row shows per-pixel MSE maps, where blue denotes small errors and red large deviations from the ground truth. Our results exhibit the smallest errors and best overall consistency [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on the Dynamic Replica valid set(720×1280). Our method yields the most accurate and artifact-free stereo completions, consistent with quantitative improvements in Tab. 2 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Representative failure cases on transparent, reflective, and highly textured scenes from the SVD AVP test set. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

read the original abstract

Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete pipeline for real-time stereo video inpainting via synthetic pair generation and token pruning, but the speedup numbers rest on unshown experiments and fragile upstream geometry steps.

read the letter

The core idea is to fix two bottlenecks in stereo inpainting: lack of good training pairs and wasteful computation on mostly unchanged pixels. GAPW uses gradient-guided backward warping to get smoother edges and occlusion regions. PBDP then turns monocular video into geometrically consistent stereo pairs and masks without needing real stereo footage. SASI applies sparsity-aware token dropping in the diffusion stage, which the abstract says cuts over 70% of tokens for a 10.7x inference speedup and 25 FPS on HD frames with an A100. That last part is the practical payoff for VR or video editing pipelines where only boundary regions need filling. The combination of these three pieces is not a generic extension of existing diffusion inpainting work, so there is something specific here worth looking at. The approach is honest about the redundancy problem and tries to solve it at both data and inference levels. The main weakness is that the whole chain depends on GAPW and PBDP delivering clean, consistent conditioning. If the parallax warping creates boundary artifacts or bad occlusion masks in regions with motion or specular surfaces, SASI has no way to recover and the comparable-quality claim breaks. The abstract gives no tables, error metrics, or ablation numbers, so it is impossible to judge how often those upstream steps actually succeed. A reader would need the full experiments and failure cases to know whether the 10.7x figure survives real video. This is aimed at computer vision groups working on efficient video processing or AR pipelines rather than general theory. Someone already building inpainting systems could borrow the token-reduction trick or the synthetic-pair recipe if the geometry holds up. It is worth sending to peer review because the components are testable and the efficiency target is concrete, even though the paper will probably need stronger quantitative validation and more failure analysis before acceptance.

Referee Report

3 major / 0 minor

Summary. The paper claims to address challenges in stereo video inpainting—scattered occluded regions and redundant computation—by proposing three components: Gradient-Aware Parallax Warping (GAPW) for continuous edges and smooth occlusions via gradient-guided backward warping, Parallax-Based Dual Projection (PBDP) to generate geometrically consistent stereo inpainting pairs and occlusion masks from monocular video, and Sparsity-Aware Stereo Inpainting (SASI) that prunes over 70% of redundant tokens in a diffusion model for a 10.7x inference speedup, enabling real-time 25 FPS processing of 768x1280 HD videos on a single A100 GPU with quality comparable to full computation.

Significance. If the efficiency and quality claims hold under rigorous verification, the work would be significant for real-time video applications in editing, VR, and graphics by tackling both data scarcity and computational waste in inpainting; the sparsity-aware diffusion approach and monocular-to-stereo synthesis pipeline could influence efficient generative models for video if supported by reproducible metrics and failure-case analysis.

major comments (3)

[Abstract] Abstract: The headline claims of 'over 70% of redundant tokens', '10.7x speedup', 'comparable results', and '25 FPS on HD video' are stated without any reference to quantitative tables, error metrics (e.g., PSNR/SSIM), ablation studies, or experimental figures, leaving the central efficiency and quality assertions unverifiable from the provided text and undermining assessment of the SASI contribution.
[Methods (GAPW/PBDP)] Methods description of GAPW and PBDP: The assumption that these steps reliably produce geometrically consistent stereo pairs and accurate occlusion masks from non-stereo input is load-bearing for the downstream SASI claim, yet no analysis of artifacts in disocclusion regions, specular surfaces, or rapid motion is provided; if inconsistencies arise here, the 'comparable quality' guarantee cannot hold regardless of token pruning.
[Experiments] Experiments section (implied by claims): No ablation on the impact of the 70% token reduction on visual coherence or temporal consistency is described, nor are baseline comparisons or user studies mentioned, making it impossible to confirm that SASI preserves quality while achieving the reported speedup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 'over 70% of redundant tokens', '10.7x speedup', 'comparable results', and '25 FPS on HD video' are stated without any reference to quantitative tables, error metrics (e.g., PSNR/SSIM), ablation studies, or experimental figures, leaving the central efficiency and quality assertions unverifiable from the provided text and undermining assessment of the SASI contribution.

Authors: We agree that the abstract would be improved by explicit linkage to the supporting evidence. These headline results are directly backed by the quantitative evaluations in Section 4: Table 2 reports PSNR/SSIM comparisons against baselines, Table 3 presents the ablation on token pruning ratios (including the 70% level) with corresponding speedups and quality retention, and Figure 5 provides visual and temporal consistency examples for HD video at 25 FPS. In the revised manuscript, we will update the abstract to include a concise reference such as 'validated through PSNR/SSIM metrics and ablations demonstrating 10.7x speedup with comparable quality to full computation'. This addresses verifiability while preserving abstract length. revision: yes
Referee: [Methods (GAPW/PBDP)] Methods description of GAPW and PBDP: The assumption that these steps reliably produce geometrically consistent stereo pairs and accurate occlusion masks from non-stereo input is load-bearing for the downstream SASI claim, yet no analysis of artifacts in disocclusion regions, specular surfaces, or rapid motion is provided; if inconsistencies arise here, the 'comparable quality' guarantee cannot hold regardless of token pruning.

Authors: We acknowledge that a targeted robustness analysis for GAPW and PBDP would strengthen the paper. The current manuscript supports the geometric consistency through overall end-to-end metrics and visual results in Sections 3 and 4, but does not dedicate space to isolated failure-case examination under specular highlights, rapid motion, or complex disocclusions. In the revision, we will add a dedicated paragraph in Section 3.2 along with supplementary figures showing representative artifact examples, their occurrence rates, and how SASI mitigates residual inconsistencies. This will provide explicit evidence that the pipeline remains reliable for the claimed quality. revision: yes
Referee: [Experiments] Experiments section (implied by claims): No ablation on the impact of the 70% token reduction on visual coherence or temporal consistency is described, nor are baseline comparisons or user studies mentioned, making it impossible to confirm that SASI preserves quality while achieving the reported speedup.

Authors: The manuscript already contains ablation studies in Section 4.3 evaluating sparsity levels up to 70% pruning, with reported effects on PSNR, SSIM, and runtime, plus baseline comparisons in Table 1 and Figure 4 against prior inpainting methods. However, we agree that explicit quantification of visual coherence and temporal consistency (beyond aggregate metrics) and perceptual validation via user study would provide stronger confirmation. We will expand Section 4 to include temporal consistency metrics (e.g., frame-to-frame warping error) and a targeted user study on perceptual quality for the pruned versus full model. This revision will directly address the concern while building on the existing experimental framework. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained algorithmic composition

full rationale

The paper defines three components sequentially—GAPW for gradient-guided backward warping to produce continuous edges and occlusion regions, PBDP to generate stereo inpainting pairs and masks from monocular input using GAPW outputs, and SASI for token sparsity in diffusion inference—without any equations, fitted parameters, or self-citations that reduce the claimed 10.7x speedup, 25 FPS HD performance, or quality comparability back to the results themselves by construction. Each step is presented as an independent algorithmic contribution whose correctness can be evaluated externally against geometric consistency and inpainting benchmarks, rather than being forced by prior definitions or renamings within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Paper relies on standard computer-vision assumptions about diffusion models generating coherent content and parallax providing geometric consistency; introduces no new physical entities or free parameters beyond typical training hyperparameters.

axioms (2)

domain assumption Diffusion-based inpainting can produce visually coherent content for small scattered occlusion regions when guided by accurate masks.
Invoked implicitly by the use of diffusion inference in SASI.
domain assumption Parallax information from single-view warping suffices to generate geometrically consistent stereo pairs.
Central to PBDP strategy.

pith-pipeline@v0.9.0 · 5554 in / 1347 out tokens · 72447 ms · 2026-05-10T15:08:40.406882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review arXiv 2023
[2]

Svg: 3d stereoscopic video generation via denoising frame matrix.arXiv preprint arXiv:2407.00367, 2024

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix.arXiv preprint arXiv:2407.00367, 2024. 2

work page arXiv 2024
[3]

Depthify.ai.https://www.depthify

Depthify.ai. Depthify.ai.https://www.depthify. ai/, 2025. Accessed: 2025-07-31. 7

2025
[4]

Srinivasan, Jonathan T

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.Advances in Neural Information Processing Systems, 2024. 2

2024
[5]

Single- view view synthesis in the wild with learned adaptive multi- plane images

Yuxuan Han, Ruicheng Wang, and Jiaolong Yang. Single- view view synthesis in the wild with learned adaptive multi- plane images. InACM SIGGRAPH, 2022. 1, 2

2022
[6]

Headnerf: A real-time nerf-based parametric head model

Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juy- ong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 2

2022
[7]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InCVPR, 2025. 3

2025
[8]

Restereo: Diffusion stereo video generation and restoration.arXiv preprint arXiv:2506.06023, 2025

Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Chris- tian Theobalt, Cengiz Oztireli, and Gurprit Singh. Restereo: Diffusion stereo video generation and restoration.arXiv preprint arXiv:2506.06023, 2025. 2, 3

work page arXiv 2025
[9]

Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Tim- merer, and Hadi Amirpour

M.H. Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Tim- merer, and Hadi Amirpour. Svd: Spatial video dataset. In ACM International Conference on Multimedia (ACM MM),
[10]

CoRR , volume =

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 5

work page arXiv 2025
[11]

Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching

Junpeng Jing, Ye Mao, and Krystian Mikolajczyk. Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching. InEuropean Conference on Com- puter Vision, pages 415–432. Springer, 2024. 2

2024
[12]

Dy- namicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. CVPR, 2023. 5

2023
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
[14]

Learning-based, automatic 2d-to-3d image and video conversion.IEEE Transactions on Image Processing, 22(9):3485–3496, 2013

Janusz Konrad, Meng Wang, Prakash Ishwar, Chen Wu, and Debargha Mukherjee. Learning-based, automatic 2d-to-3d image and video conversion.IEEE Transactions on Image Processing, 22(9):3485–3496, 2013. 1, 2

2013
[15]

Spatialdreamer: Self-supervised stereo video synthesis from monocular in- put

Zhen Lv, Yangqi Long, Congzhentao Huang, Cao Li, Chengfei Lv, Hao Ren, and Dian Zheng. Spatialdreamer: Self-supervised stereo video synthesis from monocular in- put. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 811–821, 2025. 2, 5, 3

2025
[16]

Steep parallax map- ping.I3D 2005 Poster, pages 23–24, 2005

Morgan McGuire and Max McGuire. Steep parallax map- ping.I3D 2005 Poster, pages 23–24, 2005. 3

2005
[17]

Stereo conversion with disparity-aware warp- ing, compositing and inpainting

Lukas Mehl, Andr ´es Bruhn, Markus Gross, and Christo- pher Schroers. Stereo conversion with disparity-aware warp- ing, compositing and inpainting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4260–4269, 2024. 2

2024
[18]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

2021
[19]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 5

work page internal anchor Pith review arXiv 2024
[20]

Softmax splatting for video frame interpolation

Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. InIEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

2020
[21]

owl3d.https://www.owl3d.com/landing,

Owl3D. owl3d.https://www.owl3d.com/landing,
[22]

Accessed: 2025-07-31. 7

2025
[23]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
[24]

D-nerf: Neural radiance fields for dynamic scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 2

2021
[25]

High-resolution image syn- thesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 2

2021
[26]

Immersepro: End- to-end stereo video synthesis via implicit disparity learning

Jian Shi, Zhenyu Li, and Peter Wonka. Immersepro: End- to-end stereo video synthesis via implicit disparity learning. arXiv preprint arXiv:2410.00262, 2024. 2

work page arXiv 2024
[27]

arXiv preprint arXiv:2505.16565 (2025)

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion.arXiv preprint arXiv:2505.16565, 2025. 2, 5, 4

work page arXiv 2025
[28]

Single-view view synthe- sis with multiplane images

Richard Tucker and Noah Snavely. Single-view view synthe- sis with multiplane images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 551–560, 2020. 1, 2

2020
[29]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Stereodiffusion: Training-free stereo image generation using latent diffusion models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7416–7425, 2024. 2

2024
[31]

Zerostereo: Zero-shot stereo matching from single images

Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. Zerostereo: Zero-shot stereo matching from single images. arXiv preprint arXiv:2501.08654, 2025. 2, 7, 3

work page arXiv 2025
[32]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

2004
[33]

Foundationstereo: Zero- shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5249–5260,
[34]

Foundationstereo: Zero- shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025. 5

2025
[35]

Peak signal-to-noise ratio — Wikipedia, the free encyclopedia.https : / / en

Wikipedia contributors. Peak signal-to-noise ratio — Wikipedia, the free encyclopedia.https : / / en . wikipedia . org / w / index . php ? title = Peak _ signal - to - noise _ ratio & oldid = 1210897995,
[36]

[Online; accessed 4-March-2024]. 5

2024
[37]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2

2024
[38]

Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks

Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks. InEuropean conference on com- puter vision, pages 842–857. Springer, 2016. 2, 7

2016
[39]

SV4D: Dynamic 3d content genera- tion with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024. 2

work page arXiv 2024
[40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review arXiv 2024
[41]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monoc- ular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025. 1, 2, 3, 4

work page arXiv 2025
[42]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2

work page internal anchor Pith review arXiv 2024
[43]

Spatialme: Stereo video conversion us- ing depth-warping and blend-inpainting.arXiv preprint arXiv:2412.11512, 2024

Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, and Xin Tian. Spatialme: Stereo video conversion us- ing depth-warping and blend-inpainting.arXiv preprint arXiv:2412.11512, 2024. 1, 2

work page arXiv 2024
[44]

3d-tv content creation: automatic 2d-to-3d video conversion.IEEE Transactions on Broadcasting, 57(2):372–383, 2011

Liang Zhang, Carlos Vazquez, and Sebastian Knorr. 3d-tv content creation: automatic 2d-to-3d video conversion.IEEE Transactions on Broadcasting, 57(2):372–383, 2011. 1, 2

2011
[45]

Structural multiplane image: Bridging neural view synthesis and 3d reconstruction

Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei Huang, Yoichi Sato, and Yan Lu. Structural multiplane image: Bridging neural view synthesis and 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16707–16716, 2023. 1, 2

2023
[46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5

2018
[47]

arXiv preprint arXiv:2409.07447 (2024)

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xi- aoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos.arXiv preprint arXiv:2409.07447, 2024. 1, 2, 3, 7

work page arXiv 2024
[48]

Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024

Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024. 5, 1

2024
[49]

Propainter: Improving propagation and transformer for video inpainting

Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 5 DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos Supplementary Material Appendix This supplementary mate...

work page arXiv 2023