VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

Chuang Gan; Evangelos Kalogerakis; Tuan Duc Ngo

arxiv: 2605.31466 · v1 · pith:7MCCMWDQnew · submitted 2026-05-29 · 💻 cs.CV

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

Tuan Duc Ngo , Chuang Gan , Evangelos Kalogerakis This is my paper

Pith reviewed 2026-06-28 23:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords single-view 3D reconstructionamodal completionvolumetric generationlatent diffusion3D VAEscene understandingtruncated distance functions

0 comments

The pith

VolFill generates complete 3D scene geometry from one RGB image by denoising volumetric latents conditioned on foundation-model priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VolFill as a generative model that outputs the full 3D structure of a scene instead of regressing visible surfaces only. It encodes sparse truncated unsigned distance function grids with a hybrid 3D VAE into a compact latent space, then uses a latent Diffusion Transformer to recover the missing parts. Conditioning on geometry foundation models supplies spatial priors that guide inference where pixels give no direct evidence. The resulting representation allows direct surface extraction and occupancy queries at any scale. Experiments on SCRREAM and NRGB-D show clear gains over prior single-view baselines.

Core claim

VolFill predicts the 3D structure of the complete scene from a single RGB image by first compressing truncated unsigned distance function grids into a latent space with a hybrid 3D VAE and then denoising that latent with a Diffusion Transformer conditioned on geometry foundation models, yielding a structured volumetric output that supports surface extraction and occupancy queries.

What carries the argument

Hybrid 3D VAE plus latent Diffusion Transformer that compresses and denoises truncated unsigned distance function grids while conditioned on geometry foundation models.

If this is right

The output supports direct surface extraction and occupancy queries without additional per-ray or point-cloud processing.
The method scales to full scenes rather than being limited to object-centric reconstruction.
Performance gains on SCRREAM and NRGB-D indicate better handling of amodal completion than regression-based baselines.
The latent representation can be queried at arbitrary resolution after generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent space could be conditioned on additional inputs such as depth or text descriptions without retraining the VAE.
Real-time variants might be obtained by replacing the iterative denoising with a single forward pass through a distilled model.
The approach could be tested on outdoor driving scenes where foundation-model priors are weaker than in indoor datasets.

Load-bearing premise

Geometry foundation models already contain sufficiently accurate spatial priors to let the model correctly infer hidden structures even when the input image supplies no visual evidence for them.

What would settle it

A test set of single-view scenes whose hidden geometry is known from ground-truth scans but whose layout contradicts the priors of current geometry foundation models; if VolFill still reconstructs the true hidden parts, the conditioning assumption holds, otherwise it fails.

Figures

Figures reproduced from arXiv: 2605.31466 by Chuang Gan, Evangelos Kalogerakis, Tuan Duc Ngo.

**Figure 1.** Figure 1: VolFill synthesizes structured amodal 3D geometry from (a) a single-view image, recovering holistic scene layouts from partial visibility. (b) Pixel-aligned methods are restricted to visible surfaces. (c) Amodal baselines produce sparse, noisy or artifact-heavy geometry, yielding fragmented meshes. Our approach delivers clean, sharp point clouds and smooth, consistent meshes. Abstract Reconstructing the co… view at source ↗

**Figure 2.** Figure 2: 3D VAE architecture. The encoder compresses high-resolution sparse TUDF grids into a regularized dense latent via sparse convolutions. The decoder upsamples through dense layers, applies occupancy-guided sparsification, then restores the full-resolution TUDF via sparse convolutions. • Spatial scope: To establish the precise boundaries of our volumetric grid, we compute an axisaligned bounding box B based … view at source ↗

**Figure 3.** Figure 3: Latent DiT architecture. It operates in the compressed VAE latent space using a denoising transformer backbone with a flow-matching objective. We leverage a dual conditioning strategy, integrating high-level image tokens and explicit visible geometry, to guide the generative process and synthesize sharp, scene-consistent amodal structures. • Structure Prediction: At the 643 stage, a binary occupancy head p… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. VolFill synthesizes sharp, high-fidelity geometry, whereas LaRI produces layered artifacts (red circle) and holes, and NOVA3R yields noisy, unstructured point scatters (green circle). LaRI NOVA3R Ours LaRI NOVA3R Ours [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mesh reconstruction comparison. LaRI and NOVA3R produce fragmented and noisy meshes due to their unstructured outputs, whereas VolFill directly extracts clean, topologically consistent surfaces from the structured TUDF grid [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative conditioning ablation. ① Visibleonly geometry fails to complete the scene; ②, ③ image-only tokens result in distorted results; ④ our dual-conditioning synthesizes sharp, high-fidelity amodal geometry. overhead but compromises accuracy. By integrating both, our hybrid design achieves the best reconstruction quality alongside the lowest latency and memory footprint, validating its efficiency for… view at source ↗

**Figure 7.** Figure 7: Illustration of the DiT block. A Technical appendices and supplementary material A.1 Volumetric Representation Design Choice We represent amodal 3D geometry as a Truncated Unsigned Distance Field (TUDF), where each voxel stores its distance to the nearest surface, clipped to a maximum of τ voxels. This choice is motivated by the limitations of the two below alternatives. Binary occupancy marks each voxel a… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison with pixel-aligned approaches. Unlike MoGe2 [79] and DepthAnything3, which are restricted to visible surfaces and leave significant holes, VolFill reconstructs complete, physically plausible amodal geometry. C Broader societal impacts. Our work advances single-image 3D scene reconstruction, which may benefit applications in robotics, assistive navigation, AR/VR, digital twins, archi… view at source ↗

**Figure 9.** Figure 9: Point cloud and mesh comparison. Our method produces cleaner point clouds and significantly more coherent meshes than LaRI and NOVA3R. Ground truth Input image VAE output [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of the hybrid 3D VAE. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of amodal geometry. Step-wise visualization of the denoising process. Our model rapidly converges on the coarse scene layout by t = 3 and recovers sharp, detailed structures around t = 8, followed by continuous refinement. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VolFill frames single-view amodal reconstruction as latent volumetric flow matching via a 3D VAE plus Diffusion Transformer conditioned on geometry priors, but the abstract supplies no metrics to support the outperformance claim.

read the letter

The core of the paper is a generative setup that encodes sparse truncated unsigned distance function grids with a hybrid 3D VAE, then uses a latent Diffusion Transformer to perform volumetric flow matching and recover the full scene. Conditioning on geometry foundation models supplies the spatial priors needed for hidden structure. The output is a structured volume that supports direct surface extraction and occupancy queries, which avoids the per-ray limits of some earlier methods.

This framing is a reasonable way to handle amodal completion and the structured representation is a practical plus. The approach sits on public datasets and external foundation models rather than internal fitting, so there is no obvious circularity.

The main gap is the evidence. The abstract states significant outperformance on SCRREAM and NRGB-D yet gives no numbers, no baseline names, no error analysis, and no protocol details. Without those, it is impossible to judge whether the gains are real or how they vary with scene type. The reliance on foundation-model priors is also left untested in the visible text; if those priors are weak on certain occlusions, the method could degrade without the paper showing it.

The combination of VAE and latent diffusion is not entirely new, but the specific volumetric application to amodal scenes looks like a distinct engineering choice worth checking against the literature. The work is aimed at computer-vision groups that build 3D scene models for robotics or AR. Readers who need a complete, queryable volume from one image could extract useful ideas even if the experiments need tightening.

The paper is coherent on its own terms and addresses a real problem with a plausible pipeline, so it merits a serious referee who can demand the missing quantitative results and ablations.

Referee Report

1 major / 0 minor

Summary. The paper introduces VolFill, a generative framework for single-view amodal 3D scene reconstruction. It employs a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene, conditioned on geometry foundation models. The method claims to provide a structured representation supporting direct surface extraction and occupancy queries, with extensive experiments on the SCRREAM and NRGB-D datasets demonstrating significant outperformance over current baselines.

Significance. If the outperformance claims hold with rigorous quantitative support, the work could advance amodal reconstruction by moving beyond per-ray or point-cloud limitations to a volumetric generative approach that leverages external geometry priors for hidden structure inference.

major comments (1)

[Abstract] Abstract: the central claim that the method 'significantly outperforms current baselines' on SCRREAM and NRGB-D is asserted without any quantitative metrics, error analysis, baseline details, or experimental protocol, rendering the empirical contribution impossible to evaluate from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'significantly outperforms current baselines' on SCRREAM and NRGB-D is asserted without any quantitative metrics, error analysis, baseline details, or experimental protocol, rendering the empirical contribution impossible to evaluate from the provided text.

Authors: We agree the abstract would be clearer with concrete numbers. In the revision we will add the key quantitative results (e.g., the reported improvements in surface and occupancy metrics on both datasets), name the primary baselines, and briefly note the evaluation protocol. Full tables, error analysis, and protocol details remain in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The abstract and available description present VolFill as a generative model using a hybrid 3D VAE for latent compression of truncated unsigned distance functions, a latent Diffusion Transformer for denoising, and conditioning on external geometry foundation models. Performance claims rest on empirical outperformance versus baselines on the independent public datasets SCRREAM and NRGB-D. No equations, fitted-parameter predictions, self-citations, or uniqueness theorems are supplied that would reduce any central result to its own inputs by construction. The approach is therefore evaluated against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method description mentions standard components (VAE, diffusion transformer, foundation models) without detailing any ad-hoc additions or fitted values.

pith-pipeline@v0.9.1-grok · 5687 in / 1106 out tokens · 40182 ms · 2026-06-28T23:08:54.310952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 32 canonical work pages · 19 internal anchors

[1]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023

2023
[2]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6290–6301, 2022

2022
[3]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021

2021
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Muller. Zoedepth: Zero-shot transfer by combining relative and metric depth.ArXiv, abs/2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, AmaÃcGl Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Monoscene: Monocular 3d semantic scene completion

Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3981–3991, 2021

2022
[7]

Occany: Generalized unconstrained urban 3d occupancy

Anh-Quan Cao and Tuan-Hung Vu. Occany: Generalized unconstrained urban 3d occupancy. InCVPR, 2026

2026
[8]

Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

work page arXiv 2025
[9]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Haoran Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

2023
[10]

Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[11]

Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction

Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, and Daniel Cremers. Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction. InThe F ourteenth Interna- tional Conference on Learning Representations, 2026

2026
[12]

Tulyakov, Alexander G

Yen-Chi Cheng, Hsin-Ying Lee, S. Tulyakov, Alexander G. Schwing, and Liangyan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4456–4465, 2022

2023
[13]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3075–3084, 2019

2019
[14]

Spconv: Spatially sparse convolution library

Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

2022
[15]

Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

2014
[16]

Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290,

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.ArXiv, abs/2507.16290, 2025. 11

work page arXiv 2025
[17]

3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10913–10922, 2020

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu qiong Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Helen Zhang. 3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pag...

2021
[18]

Deep ordinal regression network for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018

2002
[19]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024

2024
[20]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.ArXiv, abs/2405.10314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Fine-tuning image-conditional diffusion models is easier than you think

Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan De Geus, Alexander Hermans, and Bastian Leibe. Fine-tuning image-conditional diffusion models is easier than you think. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 753–762. IEEE, 2025

2025
[22]

Submanifold Sparse Convolutional Networks

Benjamin Graham and Laurens van der Maaten. Submanifold sparse convolutional networks. ArXiv, abs/1706.01307, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Benoît Guillard, Federico Stella, and P. Fua. Meshudf: Fast and differentiable meshing of unsigned distance field networks. InEuropean Conference on Computer Vision, 2021

2021
[24]

Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

Vitor Campanholo Guizilini, Igor Vasiljevic, Di Chen, Rares Ambrus, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

2023
[25]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.ArXiv, abs/2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

work page arXiv 2024
[27]

Gaussian error linear units (gelus).arXiv: Learning, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016

2016
[28]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Vivek Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings, 2020

2020
[29]

Recovering surface layout from an image

Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, 2007

2007
[30]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ArXiv, abs/2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

2024
[32]

Repurposing geometric foundation models for multi-view diffusion

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, and Sainan Liu. Repurposing geometric foundation models for multi-view diffusion. 2026. 12

2026
[33]

Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

Hyunjun Jung, Weihang Li, Shun cheng Wu, William Bittner, Nikolas Brasch, Jifei Song, Eduardo P’erez-Pellitero, Zhensong Zhang, Arthur Moreau, Nassir Navab, and Benjamin Busam. Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

work page arXiv 2024
[34]

Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

2014
[35]

Kazhdan, Matthew Bolitho, and Hugues Hoppe

Michael M. Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. InEurographics Symposium on Geometry Processing, 2006

2006
[36]

Repurposing diffusion-based image generators for monocular depth estima- tion

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estima- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502, 2024

2024
[37]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel López-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruc- tion.Ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

2023
[39]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi- scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

work page arXiv 1907
[40]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

work page arXiv 2023
[41]

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, and Peter Wonka. Lari: Layered ray intersections for single-view 3d geometric reasoning.ArXiv, abs/2504.18424, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehui Wang, Yuanzhi Liang, Zhipeng Yu, Xingchao Liu, Yuanchen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

2025
[43]

V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

Yiming Li, Zhiding Yu, Christopher Bongsoo Choy, Chaowei Xiao, José Manuel Álvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

2023
[44]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.ArXiv, abs/2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

2023
[47]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.ArXiv, abs/2209.03003, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Chu-Hsing Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image.ArXiv, abs/2309.03453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Neural volumes: Learning dynamic renderable volumes from images.ACM Trans

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM Trans. Graph., 38(4):65:1–65:14, July 2019

2019
[50]

Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

Xiaoxiao Long, Yuanchen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

2024
[51]

Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, and Angela Dai. Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

2026
[52]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

2021
[53]

Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

Norman Muller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

2023
[54]

Dage: Dual-stream architecture for efficient and fine-grained geometry estimation

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kaloger- akis, Chuang Gan, and Joon-Young Lee. Dage: Dual-stream architecture for efficient and fine-grained geometry estimation. InCVPR, 2026

2026
[55]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.ArXiv, abs/2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Mescheder, Michael Oechsle, and Andreas Geiger

Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3501– 3512, 2019

2020
[57]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022

2023
[58]

Sharpdepth: Sharpening metric depth predictions using diffusion distillation

Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, and Rang Nguyen. Sharpdepth: Sharpening metric depth predictions using diffusion distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17060–17069, 2025

2025
[59]

Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

2024
[60]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

2020
[62]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

2024
[63]

Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Muller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025. 14

2025
[64]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

2022
[65]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

2021
[66]

Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

Ashutosh Saxena, Sung Chung, and Andrew Ng. Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

2005
[67]

Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

2008
[68]

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and X. Yang. Mvdream: Multi-view diffusion for 3d generation.ArXiv, abs/2308.16512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

2020
[70]

Chang, Manolis Savva, and Thomas A

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas A. Funkhouser. Semantic scene completion from a single depth image.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190–198, 2016

2017
[71]

TorchSparse: Efficient Point Cloud Inference Engine

Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song Han. TorchSparse: Efficient Point Cloud Inference Engine. InConference on Machine Learning and Systems (MLSys), 2022

2022
[72]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, 2024

2024
[73]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

2024
[74]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017

2017
[75]

Learning depth from monocular videos using direct methods

Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018

2022
[76]

4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

2025
[77]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[78]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

2025
[79]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 15

2024

Showing first 80 references.

[1] [1]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023

2023

[2] [2]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6290–6301, 2022

2022

[3] [3]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021

2021

[4] [4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Muller. Zoedepth: Zero-shot transfer by combining relative and metric depth.ArXiv, abs/2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, AmaÃcGl Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Monoscene: Monocular 3d semantic scene completion

Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3981–3991, 2021

2022

[7] [7]

Occany: Generalized unconstrained urban 3d occupancy

Anh-Quan Cao and Tuan-Hung Vu. Occany: Generalized unconstrained urban 3d occupancy. InCVPR, 2026

2026

[8] [8]

Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

work page arXiv 2025

[9] [9]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Haoran Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2416–2425, 2023

2023

[10] [10]

Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[11] [11]

Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction

Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, and Daniel Cremers. Nova3r: Non-pixel-aligned visual transformer for amodal 3d reconstruction. InThe F ourteenth Interna- tional Conference on Learning Representations, 2026

2026

[12] [12]

Tulyakov, Alexander G

Yen-Chi Cheng, Hsin-Ying Lee, S. Tulyakov, Alexander G. Schwing, and Liangyan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4456–4465, 2022

2023

[13] [13]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3075–3084, 2019

2019

[14] [14]

Spconv: Spatially sparse convolution library

Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

2022

[15] [15]

Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

2014

[16] [16]

Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290,

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lyu. Dens3r: A foundation model for 3d geometry prediction.ArXiv, abs/2507.16290, 2025. 11

work page arXiv 2025

[17] [17]

3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10913–10922, 2020

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu qiong Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Helen Zhang. 3d-front: 3d furnished rooms with layouts and semantics.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pag...

2021

[18] [18]

Deep ordinal regression network for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018

2002

[19] [19]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024

2024

[20] [20]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.ArXiv, abs/2405.10314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Fine-tuning image-conditional diffusion models is easier than you think

Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan De Geus, Alexander Hermans, and Bastian Leibe. Fine-tuning image-conditional diffusion models is easier than you think. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 753–762. IEEE, 2025

2025

[22] [22]

Submanifold Sparse Convolutional Networks

Benjamin Graham and Laurens van der Maaten. Submanifold sparse convolutional networks. ArXiv, abs/1706.01307, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Benoît Guillard, Federico Stella, and P. Fua. Meshudf: Fast and differentiable meshing of unsigned distance field networks. InEuropean Conference on Computer Vision, 2021

2021

[24] [24]

Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

Vitor Campanholo Guizilini, Igor Vasiljevic, Di Chen, Rares Ambrus, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9199–9209, 2023

2023

[25] [25]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.ArXiv, abs/2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

work page arXiv 2024

[27] [27]

Gaussian error linear units (gelus).arXiv: Learning, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016

2016

[28] [28]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Vivek Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings, 2020

2020

[29] [29]

Recovering surface layout from an image

Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, 2007

2007

[30] [30]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ArXiv, abs/2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

2024

[32] [32]

Repurposing geometric foundation models for multi-view diffusion

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, and Sainan Liu. Repurposing geometric foundation models for multi-view diffusion. 2026. 12

2026

[33] [33]

Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

Hyunjun Jung, Weihang Li, Shun cheng Wu, William Bittner, Nikolas Brasch, Jifei Song, Eduardo P’erez-Pellitero, Zhensong Zhang, Arthur Moreau, Nassir Navab, and Benjamin Busam. Scrream : Scan, register, render and map:a framework for annotating accurate and dense 3d indoor scenes with a benchmark.ArXiv, abs/2410.22715, 2024

work page arXiv 2024

[34] [34]

Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth Transfer: Depth Extraction from Video Us- ing Non-Parametric Sampling .IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(11):2144–2158, November 2014

2014

[35] [35]

Kazhdan, Matthew Bolitho, and Hugues Hoppe

Michael M. Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. InEurographics Symposium on Geometry Processing, 2006

2006

[36] [36]

Repurposing diffusion-based image generators for monocular depth estima- tion

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estima- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502, 2024

2024

[37] [37]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel López-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruc- tion.Ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

2023

[39] [39]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi- scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

work page arXiv 1907

[40] [40]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.ArXiv, abs/2311.06214, 2023

work page arXiv 2023

[41] [41]

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, and Peter Wonka. Lari: Layered ray intersections for single-view 3d geometric reasoning.ArXiv, abs/2504.18424, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehui Wang, Yuanzhi Liang, Zhipeng Yu, Xingchao Liu, Yuanchen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE transactions on pattern analysis and machine intelligence, PP, 2025

2025

[43] [43]

V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

Yiming Li, Zhiding Yu, Christopher Bongsoo Choy, Chaowei Xiao, José Manuel Álvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023

2023

[44] [44]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.ArXiv, abs/2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9264–9275, 2023

2023

[47] [47]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.ArXiv, abs/2209.03003, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Chu-Hsing Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image.ArXiv, abs/2309.03453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Neural volumes: Learning dynamic renderable volumes from images.ACM Trans

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM Trans. Graph., 38(4):65:1–65:14, July 2019

2019

[50] [50]

Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

Xiaoxiao Long, Yuanchen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. Wonder3d: Single image to 3d using cross-domain diffusion.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9970–9980, 2023

2024

[51] [51]

Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, and Angela Dai. Seen2scene: Completing realistic 3d scenes with visibility-guided flow, 2026

2026

[52] [52]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

2021

[53] [53]

Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

Norman Muller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022

2023

[54] [54]

Dage: Dual-stream architecture for efficient and fine-grained geometry estimation

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kaloger- akis, Chuang Gan, and Joon-Young Lee. Dage: Dual-stream architecture for efficient and fine-grained geometry estimation. InCVPR, 2026

2026

[55] [55]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.ArXiv, abs/2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Mescheder, Michael Oechsle, and Andreas Geiger

Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3501– 3512, 2019

2020

[57] [57]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022

2023

[58] [58]

Sharpdepth: Sharpening metric depth predictions using diffusion distillation

Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, and Rang Nguyen. Sharpdepth: Sharpening metric depth predictions using diffusion distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17060–17069, 2025

2025

[59] [59]

Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024

2024

[60] [60]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

2020

[62] [62]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4209–4219, 2023

2024

[63] [63]

Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Muller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025. 14

2025

[64] [64]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

2022

[65] [65]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

2021

[66] [66]

Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

Ashutosh Saxena, Sung Chung, and Andrew Ng. Learning depth from single monocular images.Advances in neural information processing systems, 18, 2005

2005

[67] [67]

Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image.IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2008

2008

[68] [68]

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and X. Yang. Mvdream: Multi-view diffusion for 3d generation.ArXiv, abs/2308.16512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.Advances in neural informa- tion processing systems, 33:7462–7473, 2020

2020

[70] [70]

Chang, Manolis Savva, and Thomas A

Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas A. Funkhouser. Semantic scene completion from a single depth image.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190–198, 2016

2017

[71] [71]

TorchSparse: Efficient Point Cloud Inference Engine

Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song Han. TorchSparse: Efficient Point Cloud Inference Engine. InConference on Machine Learning and Systems (MLSys), 2022

2022

[72] [72]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, 2024

2024

[73] [73]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to- 3d generation, 2024

2024

[74] [74]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017

2017

[75] [75]

Learning depth from monocular videos using direct methods

Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018

2022

[76] [76]

4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real- video: Learning generalizable photo-realistic 4d video diffusion.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17723–17732, 2024

2025

[77] [77]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[78] [78]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

2025

[79] [79]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 15

2024