pith. sign in

arxiv: 2605.31466 · v1 · pith:7MCCMWDQnew · submitted 2026-05-29 · 💻 cs.CV

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

Pith reviewed 2026-06-28 23:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords single-view 3D reconstructionamodal completionvolumetric generationlatent diffusion3D VAEscene understandingtruncated distance functions
0
0 comments X

The pith

VolFill generates complete 3D scene geometry from one RGB image by denoising volumetric latents conditioned on foundation-model priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VolFill as a generative model that outputs the full 3D structure of a scene instead of regressing visible surfaces only. It encodes sparse truncated unsigned distance function grids with a hybrid 3D VAE into a compact latent space, then uses a latent Diffusion Transformer to recover the missing parts. Conditioning on geometry foundation models supplies spatial priors that guide inference where pixels give no direct evidence. The resulting representation allows direct surface extraction and occupancy queries at any scale. Experiments on SCRREAM and NRGB-D show clear gains over prior single-view baselines.

Core claim

VolFill predicts the 3D structure of the complete scene from a single RGB image by first compressing truncated unsigned distance function grids into a latent space with a hybrid 3D VAE and then denoising that latent with a Diffusion Transformer conditioned on geometry foundation models, yielding a structured volumetric output that supports surface extraction and occupancy queries.

What carries the argument

Hybrid 3D VAE plus latent Diffusion Transformer that compresses and denoises truncated unsigned distance function grids while conditioned on geometry foundation models.

If this is right

  • The output supports direct surface extraction and occupancy queries without additional per-ray or point-cloud processing.
  • The method scales to full scenes rather than being limited to object-centric reconstruction.
  • Performance gains on SCRREAM and NRGB-D indicate better handling of amodal completion than regression-based baselines.
  • The latent representation can be queried at arbitrary resolution after generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent space could be conditioned on additional inputs such as depth or text descriptions without retraining the VAE.
  • Real-time variants might be obtained by replacing the iterative denoising with a single forward pass through a distilled model.
  • The approach could be tested on outdoor driving scenes where foundation-model priors are weaker than in indoor datasets.

Load-bearing premise

Geometry foundation models already contain sufficiently accurate spatial priors to let the model correctly infer hidden structures even when the input image supplies no visual evidence for them.

What would settle it

A test set of single-view scenes whose hidden geometry is known from ground-truth scans but whose layout contradicts the priors of current geometry foundation models; if VolFill still reconstructs the true hidden parts, the conditioning assumption holds, otherwise it fails.

Figures

Figures reproduced from arXiv: 2605.31466 by Chuang Gan, Evangelos Kalogerakis, Tuan Duc Ngo.

Figure 1
Figure 1. Figure 1: VolFill synthesizes structured amodal 3D geometry from (a) a single-view image, recovering holistic scene layouts from partial visibility. (b) Pixel-aligned methods are restricted to visible surfaces. (c) Amodal baselines produce sparse, noisy or artifact-heavy geometry, yielding fragmented meshes. Our approach delivers clean, sharp point clouds and smooth, consistent meshes. Abstract Reconstructing the co… view at source ↗
Figure 2
Figure 2. Figure 2: 3D VAE architecture. The encoder compresses high-resolution sparse TUDF grids into a regularized dense latent via sparse convolutions. The decoder upsamples through dense layers, applies occupancy-guided sparsification, then restores the full-resolution TUDF via sparse convolutions. • Spatial scope: To establish the precise boundaries of our volumetric grid, we compute an axis￾aligned bounding box B based … view at source ↗
Figure 3
Figure 3. Figure 3: Latent DiT architecture. It operates in the compressed VAE latent space using a denoising transformer backbone with a flow-matching objective. We leverage a dual conditioning strategy, integrating high-level image tokens and explicit visible geometry, to guide the generative process and synthesize sharp, scene-consistent amodal structures. • Structure Prediction: At the 643 stage, a binary occupancy head p… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. VolFill synthesizes sharp, high-fidelity geometry, whereas LaRI produces layered artifacts (red circle) and holes, and NOVA3R yields noisy, unstructured point scatters (green circle). LaRI NOVA3R Ours LaRI NOVA3R Ours [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mesh reconstruction comparison. LaRI and NOVA3R produce fragmented and noisy meshes due to their unstructured outputs, whereas VolFill directly extracts clean, topologically consistent surfaces from the structured TUDF grid [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative conditioning ablation. ① Visible￾only geometry fails to complete the scene; ②, ③ image-only tokens result in distorted results; ④ our dual-conditioning synthesizes sharp, high-fidelity amodal geometry. overhead but compromises accuracy. By integrating both, our hybrid design achieves the best reconstruction quality alongside the lowest latency and memory footprint, validating its efficiency for… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the DiT block. A Technical appendices and supplementary material A.1 Volumetric Representation Design Choice We represent amodal 3D geometry as a Truncated Unsigned Distance Field (TUDF), where each voxel stores its distance to the nearest surface, clipped to a maximum of τ voxels. This choice is motivated by the limitations of the two below alternatives. Binary occupancy marks each voxel a… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with pixel-aligned approaches. Unlike MoGe2 [79] and DepthAnything3, which are restricted to visible surfaces and leave significant holes, VolFill recon￾structs complete, physically plausible amodal geometry. C Broader societal impacts. Our work advances single-image 3D scene reconstruction, which may benefit applications in robotics, assistive navigation, AR/VR, digital twins, archi… view at source ↗
Figure 9
Figure 9. Figure 9: Point cloud and mesh comparison. Our method produces cleaner point clouds and significantly more coherent meshes than LaRI and NOVA3R. Ground truth Input image VAE output [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of the hybrid 3D VAE. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of amodal geometry. Step-wise visualization of the denoising process. Our model rapidly converges on the coarse scene layout by t = 3 and recovers sharp, detailed structures around t = 8, followed by continuous refinement. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces VolFill, a generative framework for single-view amodal 3D scene reconstruction. It employs a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene, conditioned on geometry foundation models. The method claims to provide a structured representation supporting direct surface extraction and occupancy queries, with extensive experiments on the SCRREAM and NRGB-D datasets demonstrating significant outperformance over current baselines.

Significance. If the outperformance claims hold with rigorous quantitative support, the work could advance amodal reconstruction by moving beyond per-ray or point-cloud limitations to a volumetric generative approach that leverages external geometry priors for hidden structure inference.

major comments (1)
  1. [Abstract] Abstract: the central claim that the method 'significantly outperforms current baselines' on SCRREAM and NRGB-D is asserted without any quantitative metrics, error analysis, baseline details, or experimental protocol, rendering the empirical contribution impossible to evaluate from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'significantly outperforms current baselines' on SCRREAM and NRGB-D is asserted without any quantitative metrics, error analysis, baseline details, or experimental protocol, rendering the empirical contribution impossible to evaluate from the provided text.

    Authors: We agree the abstract would be clearer with concrete numbers. In the revision we will add the key quantitative results (e.g., the reported improvements in surface and occupancy metrics on both datasets), name the primary baselines, and briefly note the evaluation protocol. Full tables, error analysis, and protocol details remain in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The abstract and available description present VolFill as a generative model using a hybrid 3D VAE for latent compression of truncated unsigned distance functions, a latent Diffusion Transformer for denoising, and conditioning on external geometry foundation models. Performance claims rest on empirical outperformance versus baselines on the independent public datasets SCRREAM and NRGB-D. No equations, fitted-parameter predictions, self-citations, or uniqueness theorems are supplied that would reduce any central result to its own inputs by construction. The approach is therefore evaluated against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method description mentions standard components (VAE, diffusion transformer, foundation models) without detailing any ad-hoc additions or fitted values.

pith-pipeline@v0.9.1-grok · 5687 in / 1106 out tokens · 40182 ms · 2026-06-28T23:08:54.310952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 32 canonical work pages · 19 internal anchors

  1. [1]

    Building normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023

  2. [2]

    Neural rgb-d surface reconstruction

    Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6290–6301, 2022

  3. [3]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021

  4. [4]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Muller. Zoedepth: Zero-shot transfer by combining relative and metric depth.ArXiv, abs/2302.12288, 2023

  5. [5]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, AmaÃcGl Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

  6. [6]

    Monoscene: Monocular 3d semantic scene completion

    Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3981–3991, 2021

  7. [7]

    Occany: Generalized unconstrained urban 3d occupancy

    Anh-Quan Cao and Tuan-Hung Vu. Occany: Generalized unconstrained urban 3d occupancy. InCVPR, 2026

  8. [8]

    Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

    Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

  9. [9]

    Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Haoran Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

  10. [10]

    Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos

    Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  11. [11]

    Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction

    Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, and Daniel Cremers. Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction. InThe F ourteenth Interna- tional Conference on Learning Representations, 2026

  12. [12]

    Tulyakov, Alexander G

    Yen-Chi Cheng, Hsin-Ying Lee, S. Tulyakov, Alexander G. Schwing, and Liangyan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4456–4465, 2022

  13. [13]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3075–3084, 2019

  14. [14]

    Spconv: Spatially sparse convolution library

    Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

  15. [15]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

  16. [16]

    Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290,

    Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.ArXiv, abs/2507.16290, 2025. 11

  17. [17]

    3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10913–10922, 2020

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu qiong Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Helen Zhang. 3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pag...

  18. [18]

    Deep ordinal regression network for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018

  19. [19]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024

  20. [20]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.ArXiv, abs/2405.10314, 2024

  21. [21]

    Fine-tuning image-conditional diffusion models is easier than you think

    Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan De Geus, Alexander Hermans, and Bastian Leibe. Fine-tuning image-conditional diffusion models is easier than you think. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 753–762. IEEE, 2025

  22. [22]

    Submanifold Sparse Convolutional Networks

    Benjamin Graham and Laurens van der Maaten. Submanifold sparse convolutional networks. ArXiv, abs/1706.01307, 2017

  23. [23]

    Benoît Guillard, Federico Stella, and P. Fua. Meshudf: Fast and differentiable meshing of unsigned distance field networks. InEuropean Conference on Computer Vision, 2021

  24. [24]

    Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

    Vitor Campanholo Guizilini, Igor Vasiljevic, Di Chen, Rares Ambrus, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

  25. [25]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.ArXiv, abs/2404.02101, 2024

  26. [26]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

    Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

  27. [27]

    Gaussian error linear units (gelus).arXiv: Learning, 2016

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016

  28. [28]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Vivek Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings, 2020

  29. [29]

    Recovering surface layout from an image

    Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, 2007

  30. [30]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ArXiv, abs/2311.04400, 2023

  31. [31]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

  32. [32]

    Repurposing geometric foundation models for multi-view diffusion

    Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, and Sainan Liu. Repurposing geometric foundation models for multi-view diffusion. 2026. 12

  33. [33]

    Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

    Hyunjun Jung, Weihang Li, Shun cheng Wu, William Bittner, Nikolas Brasch, Jifei Song, Eduardo P’erez-Pellitero, Zhensong Zhang, Arthur Moreau, Nassir Navab, and Benjamin Busam. Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

  34. [34]

    Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

    Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

  35. [35]

    Kazhdan, Matthew Bolitho, and Hugues Hoppe

    Michael M. Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. InEurographics Symposium on Geometry Processing, 2006

  36. [36]

    Repurposing diffusion-based image generators for monocular depth estima- tion

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estima- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502, 2024

  37. [37]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel López-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruc- tion.Ar...

  38. [38]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

  39. [39]

    From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

    Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi- scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

  40. [40]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

  41. [41]

    LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

    Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, and Peter Wonka. Lari: Layered ray intersections for single-view 3d geometric reasoning.ArXiv, abs/2504.18424, 2025

  42. [42]

    Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehui Wang, Yuanzhi Liang, Zhipeng Yu, Xingchao Liu, Yuanchen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

  43. [43]

    V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

    Yiming Li, Zhiding Yu, Christopher Bongsoo Choy, Chaowei Xiao, José Manuel Álvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

  44. [44]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.ArXiv, abs/2511.10647, 2025

  45. [45]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022

  46. [46]

    Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

  47. [47]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.ArXiv, abs/2209.03003, 2022. 13

  48. [48]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Chu-Hsing Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image.ArXiv, abs/2309.03453, 2023

  49. [49]

    Neural volumes: Learning dynamic renderable volumes from images.ACM Trans

    Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM Trans. Graph., 38(4):65:1–65:14, July 2019

  50. [50]

    Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

    Xiaoxiao Long, Yuanchen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

  51. [51]

    Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

    Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, and Angela Dai. Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

  52. [52]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  53. [53]

    Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

    Norman Muller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

  54. [54]

    Dage: Dual-stream architecture for efficient and fine-grained geometry estimation

    Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kaloger- akis, Chuang Gan, and Joon-Young Lee. Dage: Dual-stream architecture for efficient and fine-grained geometry estimation. InCVPR, 2026

  55. [55]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.ArXiv, abs/2212.08751, 2022

  56. [56]

    Mescheder, Michael Oechsle, and Andreas Geiger

    Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3501– 3512, 2019

  57. [57]

    Peebles and Saining Xie

    William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022

  58. [58]

    Sharpdepth: Sharpening metric depth predictions using diffusion distillation

    Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, and Rang Nguyen. Sharpdepth: Sharpening metric depth predictions using diffusion distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17060–17069, 2025

  59. [59]

    Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

  60. [60]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  61. [61]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  62. [62]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

    Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

  63. [63]

    Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Muller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025. 14

  64. [64]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

  65. [65]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  66. [66]

    Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

    Ashutosh Saxena, Sung Chung, and Andrew Ng. Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

  67. [67]

    Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

    Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

  68. [68]

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and X. Yang. Mvdream: Multi-view diffusion for 3d generation.ArXiv, abs/2308.16512, 2023

  69. [69]

    Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

    Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

  70. [70]

    Chang, Manolis Savva, and Thomas A

    Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas A. Funkhouser. Semantic scene completion from a single depth image.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190–198, 2016

  71. [71]

    TorchSparse: Efficient Point Cloud Inference Engine

    Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song Han. TorchSparse: Efficient Point Cloud Inference Engine. InConference on Machine Learning and Systems (MLSys), 2022

  72. [72]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, 2024

  73. [73]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

    Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

  74. [74]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017

  75. [75]

    Learning depth from monocular videos using direct methods

    Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018

  76. [76]

    4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

    Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

  77. [77]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  78. [78]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

  79. [79]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

  80. [80]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 15

Showing first 80 references.