pith. sign in

arxiv: 2605.16807 · v1 · pith:X43KUQV7new · submitted 2026-05-16 · 💻 cs.CV

DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

Pith reviewed 2026-05-19 21:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionsingle viewdiffusion modelsobject decompositionscene meshnovel view synthesisdifferentiable rendering
0
0 comments X

The pith

DecoRec reconstructs 3D scenes from single-view images by diffusing objects individually then refining their merge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DecoRec aims to create accurate 3D scene meshes from a single 2D photo by breaking the scene into separate objects. Each object is reconstructed using diffusion models trained for single-view object reconstruction. These pieces are then assembled and refined with differentiable rendering and diffusion guidance to fix geometry and appearance issues. This matters because direct scene reconstruction often fails due to poor datasets and coarse methods, while this object-wise approach promises better fidelity for tasks like designing virtual rooms.

Core claim

The central claim is that by reconstructing each object in the scene separately with diffusion-based single-view methods and then merging them through a refinement pipeline that uses differentiable rendering and diffusion guidance, high-quality 3D scene reconstruction and novel view synthesis become possible from just one image.

What carries the argument

The decomposition of the scene into individual objects reconstructed via diffusion models, followed by a refinement pipeline for merging using differentiable rendering and diffusion-guided adjustments.

Load-bearing premise

That the refinement pipeline can reliably resolve any geometric or appearance inconsistencies introduced when merging the separately reconstructed objects.

What would settle it

Observing a case where individually accurate object reconstructions lead to a merged scene with uncorrectable errors in surface alignment or texture continuity despite the refinement steps.

Figures

Figures reproduced from arXiv: 2605.16807 by Cheng Lin, Jianyi Zheng, Jia Pan, Junhui Hou, Peng Wang, Xiaoxiao Long, Xin Li, Yuan Liu, Yuhan Ping.

Figure 1
Figure 1. Figure 1: Given a single viewpoint photo of an indoor scene or an image of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Refinement comparison. Single-view reconstruction method cannot [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our pipeline. Given an input image and object masks, our method first performs a coarse decomposition and reconstruction for both [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Background reconstruction. We reconstruct a complete background [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with other methods on the real-world dataset. From top to bottom, we show one novel-view rendering and the corresponding [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with other methods on the synthetic dataset. From top to bottom, we show two novel views and the corresponding 3D [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with Gen3DSR on stylized inputs. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation studies on different loss settings. Without any designed losses, it will cause issues like texture blurring, structure distortion, and black borders. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Flexibility of scene editing. Our method enables decomposed 3D lifting of 2D images and thus, users can perform different editing operations [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional results with implementation of Hunyuan3D-2 and [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More comparisons between coarse and refinement stages. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparisons of background reconstruction results. The middle and [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗
Figure 16
Figure 16. Figure 16: Result of our method on more types of scenes, including kitchen, [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison with Gen3DSR by giving the same mask as our method. [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
read the original abstract

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DecoRec, a system for single-view 3D scene reconstruction that decomposes the input into individual objects, reconstructs each using existing diffusion-based single-view object methods, estimates poses to place them, and then applies a refinement pipeline with differentiable rendering and diffusion guidance to produce a coherent scene mesh. The central claim is that this yields high-quality geometry and novel-view synthesis superior to direct scene-level approaches, without requiring large scene datasets.

Significance. If validated, the decomposed approach would be a useful contribution to single-view scene reconstruction by leveraging strong object-level priors and a post-merging refinement stage. Credit is given for the modular design that reuses external diffusion models and for proposing an independent merging/refinement stage rather than end-to-end scene generation. The emphasis on both geometry and appearance consistency via differentiable rendering is a reasonable direction for addressing under-constrained single-view problems.

major comments (2)
  1. [Results / Experiments] The abstract states that results demonstrate high-quality geometry and novel-view synthesis, yet the supplied text contains no quantitative metrics (e.g., Chamfer distance, IoU, or PSNR), ablation studies, or error analysis. This is load-bearing for the central claim; the results section must include controlled comparisons to baselines and targeted failure-case analysis (occlusions, contacts) to substantiate superiority.
  2. [Method / Refinement pipeline] The refinement pipeline (described after object reconstruction) relies on differentiable rendering plus diffusion guidance to resolve inter-object inconsistencies. Without explicit 3D consistency terms (e.g., contact or penetration losses) or ablation on scenes with touching/occluding objects, it is unclear whether residual geometric drift or appearance seams are reliably eliminated; this directly affects the coherence guarantee.
minor comments (2)
  1. [Method] Notation for the merging stage (pose estimation, rigid placement) should be formalized with equations or a clear algorithmic outline to improve reproducibility.
  2. [Abstract] The abstract would benefit from naming the specific object-reconstruction diffusion models used and the scene categories evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Results / Experiments] The abstract states that results demonstrate high-quality geometry and novel-view synthesis, yet the supplied text contains no quantitative metrics (e.g., Chamfer distance, IoU, or PSNR), ablation studies, or error analysis. This is load-bearing for the central claim; the results section must include controlled comparisons to baselines and targeted failure-case analysis (occlusions, contacts) to substantiate superiority.

    Authors: We agree that the absence of quantitative metrics weakens the central claim. The current manuscript emphasizes qualitative results partly because of the scarcity of standardized large-scale scene benchmarks for single-view reconstruction. In the revision we will add controlled quantitative comparisons using Chamfer distance and IoU for geometry as well as PSNR and SSIM for novel-view synthesis, together with ablations and a dedicated failure-case analysis focused on occlusions and object contacts. revision: yes

  2. Referee: [Method / Refinement pipeline] The refinement pipeline (described after object reconstruction) relies on differentiable rendering plus diffusion guidance to resolve inter-object inconsistencies. Without explicit 3D consistency terms (e.g., contact or penetration losses) or ablation on scenes with touching/occluding objects, it is unclear whether residual geometric drift or appearance seams are reliably eliminated; this directly affects the coherence guarantee.

    Authors: The refinement stage optimizes the merged scene via differentiable rendering under diffusion guidance from the input and generated novel views; the strong object-level priors embedded in the diffusion model are intended to reduce inter-object drift and seams without hand-crafted contact losses. We acknowledge that this mechanism would be clearer with targeted evidence. In the revision we will therefore include ablations on scenes containing touching and occluding objects, reporting both qualitative coherence and quantitative consistency metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method builds on external priors with independent refinement stage

full rationale

The paper describes a pipeline that applies existing diffusion-based single-view object reconstruction methods to individual objects, followed by a separate refinement stage using differentiable rendering and diffusion guidance to merge them. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs are present in the abstract or described approach. The derivation is self-contained against external benchmarks (prior diffusion models) and does not rely on self-definitional steps or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, mathematical axioms, or newly postulated entities; the approach relies on existing diffusion models for single objects and standard differentiable rendering without stating additional unproven assumptions beyond the effectiveness of the proposed merging stage.

pith-pipeline@v0.9.0 · 5718 in / 1162 out tokens · 62634 ms · 2026-05-19T21:06:34.650006+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 9 internal anchors

  1. [1]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022

  2. [2]

    Monoscene: Monocular 3d semantic scene completion,

    A.-Q. Cao and R. De Charette, “Monoscene: Monocular 3d semantic scene completion,” inCVPR, 2022

  3. [3]

    Corenet: Coherent 3d scene reconstruction from a single rgb image,

    S. Popov, P. Bauszat, and V . Ferrari, “Corenet: Coherent 3d scene reconstruction from a single rgb image,” inECCV, 2020. MANUSCRIPT SUBMITTED TO IEEE TVCG 11

  4. [4]

    To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,

    Y . Nie, X. Han, S. Guo, Y . Zheng, J. Chang, and J. J. Zhang, “To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,” inCVPR, 2020

  5. [5]

    Psdr-room: Single photo to scene using differentiable rendering,

    K. Yan, F. Luan, M. Ha ˇsan, T. Groueix, V . Deschaintre, and S. Zhao, “Psdr-room: Single photo to scene using differentiable rendering,” in SIGGRAPH Asia 2023, 2023

  6. [6]

    Roca: Robust cad model retrieval and alignment from a single image,

    C. G ¨umeli, A. Dai, and M. Nießner, “Roca: Robust cad model retrieval and alignment from a single image,” inCVPR, 2022

  7. [7]

    Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,

    W. Kuo, A. Angelova, T.-Y . Lin, and A. Dai, “Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,” inICCV, 2021

  8. [8]

    Generalizing single- view 3d shape retrieval to occlusions and unseen objects,

    Q. Wu, D. Ritchie, M. Savva, and A. X. Chang, “Generalizing single- view 3d shape retrieval to occlusions and unseen objects,”arXiv preprint arXiv:2401.00405, 2023

  9. [9]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022

  10. [10]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Y . Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang, “Syncdreamer: Generating multiview-consistent images from a single- view image,”arXiv preprint arXiv:2309.03453, 2023

  11. [11]

    Zero-1-to-3: Zero-shot one image to 3d object,

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” inCVPR, 2023

  12. [12]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

  13. [13]

    Wonder3d: Single image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023

    X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,”arXiv preprint arXiv:2310.15008, 2023

  14. [14]

    Objaverse: A universe of annotated 3d objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153

  15. [15]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inCVPR, 2023

  16. [16]

    Crm: Single image to 3d textured mesh with convolutional reconstruction model,

    Z. Wang, Y . Wang, Y . Chen, C. Xiang, S. Chen, D. Yu, C. Li, H. Su, and J. Zhu, “Crm: Single image to 3d textured mesh with convolutional reconstruction model,”arXiv preprint arXiv:2403.05034, 2024

  17. [17]

    Modular primitives for high-performance differentiable rendering,

    S. Laine, J. Hellsten, T. Karras, Y . Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Transactions on Graphics, vol. 39, no. 6, 2020

  18. [18]

    Instructpix2pix: Learning to follow image editing instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inCVPR, 2023

  19. [19]

    Pixel- wise view selection for unstructured multi-view stereo,

    J. L. Sch ¨onberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixel- wise view selection for unstructured multi-view stereo,” inEuropean Conference on Computer Vision (ECCV), 2016

  20. [20]

    Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,

    S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davisonet al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,” inProceedings of the 24th annual ACM symposium on User interface software and technology, 2011, pp. 559–568

  21. [21]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  22. [22]

    Superprimitive: Scene recon- struction at a primitive level,

    K. Mazur, G. Bae, and A. J. Davison, “Superprimitive: Scene recon- struction at a primitive level,”arXiv preprint arXiv:2312.05889, 2023

  23. [23]

    Panoptic 3d scene reconstruction from a single rgb image,

    M. Dahnert, J. Hou, M. Nießner, and A. Dai, “Panoptic 3d scene reconstruction from a single rgb image,”Advances in Neural Information Processing Systems, vol. 34, pp. 8282–8293, 2021

  24. [24]

    Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,

    R. Li, T. Fischer, M. Segu, M. Pollefeys, L. Van Gool, and F. Tombari, “Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,”arXiv preprint arXiv:2404.03658, 2024

  25. [25]

    Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,

    J. Yao and J. Zhang, “Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,”arXiv preprint arXiv:2311.17084, 2023

  26. [26]

    Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,

    J. Yao, C. Li, K. Sun, Y . Cai, H. Li, W. Ouyang, and H. Li, “Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023, pp. 9421– 9431

  27. [27]

    Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,

    A.-Q. Cao and R. de Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9387–9398

  28. [28]

    Rico: Regularizing the unobservable for indoor compositional reconstruction,

    Z. Li, X. Lyu, Y . Ding, M. Wang, Y . Liao, and Y . Liu, “Rico: Regularizing the unobservable for indoor compositional reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 761–17 771

  29. [29]

    Holistic 3d scene understanding from a single image with implicit representa- tion,

    C. Zhang, Z. Cui, Y . Zhang, B. Zeng, M. Pollefeys, and S. Liu, “Holistic 3d scene understanding from a single image with implicit representa- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8833–8842

  30. [30]

    Towards high-fidelity single-view holistic reconstruction of indoor scenes,

    H. Liu, Y . Zheng, G. Chen, S. Cui, and X. Han, “Towards high-fidelity single-view holistic reconstruction of indoor scenes,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 429–446

  31. [31]

    Single- view 3d scene reconstruction with high-fidelity shape and texture,

    Y . Chen, J. Ni, N. Jiang, Y . Zhang, Y . Zhu, and S. Huang, “Single- view 3d scene reconstruction with high-fidelity shape and texture,” in 2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 1456–1467

  32. [32]

    Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,

    T. Chu, P. Zhang, Q. Liu, and J. Wang, “Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4937–4946

  33. [33]

    Uni-3d: A universal model for panoptic 3d scene reconstruction,

    X. Zhang, Z. Chen, F. Wei, and Z. Tu, “Uni-3d: A universal model for panoptic 3d scene reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9256–9266

  34. [34]

    Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,

    F. Chu, Y . Cong, and R. Chen, “Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  35. [35]

    Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,

    D. Gao, D. Rozenberszki, S. Leutenegger, and A. Dai, “Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,”arXiv preprint arXiv:2311.18610, 2023

  36. [36]

    Generalizable 3d scene recon- struction via divide and conquer from a single view,

    A. Dogaru, M. ¨Ozer, and B. Egger, “Generalizable 3d scene recon- struction via divide and conquer from a single view,”arXiv preprint arXiv:2404.03421, 2024

  37. [37]

    Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,

    Y . Chen, T. Wang, T. Wu, X. Pan, K. Jia, and Z. Liu, “Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,”arXiv preprint arXiv:2403.12409, 2024

  38. [38]

    3d cinemagraphy from a single image,

    X. Li, Z. Cao, H. Sun, J. Zhang, K. Xian, and G. Lin, “3d cinemagraphy from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4595–4605

  39. [39]

    Sine: Semantic-driven image-based nerf editing with prior- guided editing field,

    C. Bao, Y . Zhang, B. Yang, T. Fan, Z. Yang, H. Bao, G. Zhang, and Z. Cui, “Sine: Semantic-driven image-based nerf editing with prior- guided editing field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 919–20 929

  40. [40]

    Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,

    C. Wang, Y .-P. Wang, and D. Manocha, “Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 841–10 851

  41. [41]

    Perf: Panoramic neural radiance field from a single panorama,

    G. Wang, P. Wang, Z. Chen, W. Wang, C. C. Loy, and Z. Liu, “Perf: Panoramic neural radiance field from a single panorama,”arXiv preprint arXiv:2310.16831, 2023

  42. [42]

    As-deformable-as- possible single-image-based view synthesis without depth prior,

    C. Zhang, C. Lin, K. Liao, L. Nie, and Y . Zhao, “As-deformable-as- possible single-image-based view synthesis without depth prior,”IEEE Transactions on Circuits and Systems for Video Technology, 2023

  43. [43]

    Sinmpi: Novel view synthesis from a single image with expanded multiplane images,

    G. Pu, P.-S. Wang, and Z. Lian, “Sinmpi: Novel view synthesis from a single image with expanded multiplane images,” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10

  44. [44]

    Single-view view synthesis in the wild with learned adaptive multiplane images,

    Y . Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” inACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–8

  45. [45]

    Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

    J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,”arXiv preprint arXiv:2311.13384, 2023

  46. [46]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srini- vasan, J. T. Barron, and B. Poole, “Cat3d: Create anything in 3d with multi-view diffusion models,”arXiv preprint arXiv:2405.10314, 2024

  47. [47]

    Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

    J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199, 2024

  48. [48]

    Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,

    A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levinshtein, “Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 669–20 679

  49. [49]

    Bolt3d: Generating 3d scenes in seconds,

    S. Szymanowicz, J. Y . Zhang, P. Srinivasan, R. Gao, A. Brussee, A. Holynski, R. Martin-Brualla, J. T. Barron, and P. Henzler, “Bolt3d: Generating 3d scenes in seconds,”arXiv preprint arXiv:2503.14445, 2025

  50. [50]

    Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  51. [51]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  52. [52]

    Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,

    J. Tang, Y . Nie, L. Markhasin, A. Dai, J. Thies, and M. Nießner, “Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024

  53. [53]

    Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,

    G. Zhai, E. P. ¨Ornek, S.-C. Wu, Y . Di, F. Tombari, N. Navab, and B. Busam, “Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,”Advances in Neural Information Processing Systems, vol. 36, 2024

  54. [54]

    arXiv preprint arXiv:2402.04717 , year=

    C. Lin and Y . Mu, “Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior,”arXiv preprint arXiv:2402.04717, 2024

  55. [55]

    Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,

    G. Zhang, Y . Wang, C. Luo, S. Xu, J. Peng, Z. Zhang, and M. Zhang, “Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,”arXiv preprint arXiv:2401.03470, 2024

  56. [56]

    Echoscene: Indoor scene generation via information echo over scene graph diffusion,

    G. Zhai, E. P. ¨Ornek, D. Z. Chen, R. Liao, Y . Di, N. Navab, F. Tombari, and B. Busam, “Echoscene: Indoor scene generation via information echo over scene graph diffusion,”arXiv preprint arXiv:2405.00915, 2024

  57. [57]

    Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,

    Z. Wu, Y . Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Liet al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,”arXiv preprint arXiv:2401.17053, 2024

  58. [58]

    Scenewiz3d: Towards text-guided 3d scene composition,

    Q. Zhang, C. Wang, A. Siarohin, P. Zhuang, Y . Xu, C. Yang, D. Lin, B. Zhou, S. Tulyakov, and H.-Y . Lee, “Scenewiz3d: Towards text-guided 3d scene composition,”arXiv preprint arXiv:2312.08885, 2023

  59. [59]

    Graphdreamer: Compositional 3d scene synthesis from scene graphs,

    G. Gao, W. Liu, A. Chen, A. Geiger, and B. Sch ¨olkopf, “Graphdreamer: Compositional 3d scene synthesis from scene graphs,”arXiv preprint arXiv:2312.00093, 2023

  60. [60]

    Text2scene: Text-driven in- door scene stylization with part-aware details,

    I. Hwang, H. Kim, and Y . M. Kim, “Text2scene: Text-driven in- door scene stylization with part-aware details,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1890–1899

  61. [61]

    Text2nerf: Text-driven 3d scene generation with neural radiance fields,

    J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text-driven 3d scene generation with neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, 2024

  62. [62]

    Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,

    S. Zhou, Z. Fan, D. Xu, H. Chang, P. Chari, T. Bharadwaj, S. You, Z. Wang, and A. Kadambi, “Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,”arXiv preprint arXiv:2404.06903, 2024

  63. [63]

    Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

    H. Li, H. Shi, W. Zhang, W. Wu, Y . Liao, L. Wang, L.-h. Lee, and P. Zhou, “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,”arXiv preprint arXiv:2404.03575, 2024

  64. [64]

    Text2immersion: Generative immersive scene with 3d gaussians,

    H. Ouyang, K. Heal, S. Lombardi, and T. Sun, “Text2immersion: Generative immersive scene with 3d gaussians,”arXiv preprint arXiv:2312.09242, 2023

  65. [65]

    Controlroom3d: Room generation using semantic proxy rooms,

    J. Schult, S. Tsai, L. H ¨ollein, B. Wu, J. Wang, C.-Y . Ma, K. Li, X. Wang, F. Wimbauer, Z. Heet al., “Controlroom3d: Room generation using semantic proxy rooms,”arXiv preprint arXiv:2312.05208, 2023

  66. [66]

    Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,

    C. Fang, X. Hu, K. Luo, and P. Tan, “Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,”arXiv preprint arXiv:2310.03602, 2023

  67. [67]

    3d-scenedreamer: Text-driven 3d-consistent scene generation,

    F. Zhang, Y . Zhang, Q. Zheng, R. Ma, W. Hua, H. Bao, W. Xu, and C. Zou, “3d-scenedreamer: Text-driven 3d-consistent scene generation,” arXiv preprint arXiv:2403.09439, 2024

  68. [68]

    Showroom3d: Text to high-quality 3d room generation using 3d priors,

    W. Mao, Y .-P. Cao, J.-W. Liu, Z. Xu, and M. Z. Shou, “Showroom3d: Text to high-quality 3d room generation using 3d priors,”arXiv preprint arXiv:2312.13324, 2023

  69. [69]

    Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,

    Y . Ma, D. Zhan, and Z. Jin, “Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,”arXiv preprint arXiv:2405.05768, 2024

  70. [70]

    360dvd: Controllable panorama video generation with 360-degree video diffusion model,

    Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” arXiv preprint arXiv:2401.06578, 2024

  71. [71]

    MVDream: Multi-view Diffusion for 3D Generation

    Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,”arXiv preprint arXiv:2308.16512, 2023

  72. [72]

    arXiv preprint arXiv:2312.02201 , year=

    P. Wang and Y . Shi, “Imagedream: Image-prompt multi-view diffusion for 3d generation,”arXiv preprint arXiv:2312.02201, 2023

  73. [73]

    Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,

    Y . Lu, J. Zhang, S. Li, T. Fang, D. McKinnon, Y . Tsin, L. Quan, X. Cao, and Y . Yao, “Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,”arXiv preprint arXiv:2311.15980, 2023

  74. [74]

    Shap-E: Generating Conditional 3D Implicit Functions

    H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,”arXiv preprint arXiv:2305.02463, 2023

  75. [75]

    Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers.arXiv preprint arXiv:2312.09147, 2023

    Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang, “Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers,”arXiv preprint arXiv:2312.09147, 2023

  76. [76]

    Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,

    S. Li, C. Li, W. Zhu, B. Yu, Y . Zhao, C. Wan, H. You, H. Shi, and Y . Lin, “Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

  77. [77]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,”arXiv preprint arXiv:2309.16653, 2023

  78. [78]

    Objaverse-xl: A uni- verse of 10m+ 3d objects,

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadreet al., “Objaverse-xl: A uni- verse of 10m+ 3d objects,”Advances in Neural Information Processing Systems, vol. 36, 2024

  79. [79]

    Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,

    S. Tang, F. Zhang, J. Chen, P. Wang, and F. Yasutaka, “Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,”arXiv preprint 2307.01097, 2023

  80. [80]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

    X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,”arXiv preprint arXiv:2403.12013, 2024

Showing first 80 references.