SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

Chuanyu Pan; Hieu Trung Nguyen; Jeonghwan Kim; Xingang Pan; Yongwei Chen; Yushi Lan

arxiv: 2606.08402 · v3 · pith:UKKPVCZPnew · submitted 2026-06-07 · 💻 cs.CV · cs.AI· cs.MA

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

Jeonghwan Kim , Yushi Lan , Yongwei Chen , Hieu Trung Nguyen , Chuanyu Pan , Xingang Pan This is my paper

Pith reviewed 2026-06-27 19:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA

keywords 3D scene generationsingle imagemulti-agent orchestrationlayout predictiongeometric priorsscene refinementspatial consistency

0 comments

The pith

A multi-agent orchestration framework decomposes single-image 3D scene generation into initialization, environment construction, and planner-driven refinement stages to achieve higher geometric accuracy and consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that breaking 3D scene creation from one photo into three coordinated stages produces more consistent results than prior methods that handle everything together. Initialization pulls object masks and an initial layout from the image, environment construction adds surfaces and lighting using point maps, and refinement uses one planner agent to flag problems while specialist agents fix them locally before reassembly. This structure matters because single images leave many spatial relationships ambiguous, and current pipelines require heavy scene-level labels that limit real-world use. A geometry-aware layout predictor trained only on sparse geometric clues from point maps further cuts the data needed. If the approach works as described, it would let systems generate usable 3D environments from ordinary photos with less manual annotation and fewer global errors.

Core claim

The paper claims that its multi-agent orchestration framework, consisting of scene initialization from image-derived object masks and a geometry-aware layout predictor, environment construction that builds supporting surfaces and illumination from point-map geometry, and a refinement stage in which a planner agent identifies inconsistencies and dispatches specialist agents for localized revisions that are reintegrated globally, produces scenes with superior geometric accuracy, spatial consistency, and perceptual realism on benchmarks while training the layout predictor from segmentation-level data rather than full scene annotations.

What carries the argument

The multi-agent orchestration framework that separates initialization, environment construction, and refinement, with a planner agent identifying inconsistencies and specialist agents performing localized revisions reintegrated into the global scene, plus the geometry-aware layout predictor supervised by sparse geometric priors from point maps.

If this is right

The layout predictor generalizes to diverse real-world scenes when trained only on segmentation-level data and sparse geometric priors.
The staged pipeline reduces reliance on extensive scene-level annotations during training.
Benchmark results show consistent gains in geometric accuracy, spatial consistency, and perceptual realism over prior holistic methods.
Localized revisions by specialist agents can be reintegrated into the global scene without breaking overall coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged orchestration could be extended to video input by adding a temporal-consistency specialist agent that enforces frame-to-frame stability.
Lower supervision requirements might allow the same decomposition to support casual photo-to-3D workflows in consumer applications such as interior design visualization.
The planner-plus-specialist pattern suggests that other complex generative tasks could benefit from explicit inconsistency detection before local fixes rather than end-to-end generation.

Load-bearing premise

The planner agent can reliably detect structural and visual inconsistencies while specialist agents can revise them locally and reintegrate the changes without creating new errors.

What would settle it

A controlled test on scenes containing subtle geometric mismatches where the planner misses at least half the inconsistencies and the final outputs show no measurable gain or a drop in accuracy metrics compared with prior single-pipeline baselines.

Figures

Figures reproduced from arXiv: 2606.08402 by Chuanyu Pan, Hieu Trung Nguyen, Jeonghwan Kim, Xingang Pan, Yongwei Chen, Yushi Lan.

**Figure 2.** Figure 2: Overall structure of the geometry-aware layout predictor. The model takes object instances [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of our geometry-aware layout prediction against SceneGen and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of each stage in our SceneConductor and VIGA on MIT-Indoor-67. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative ablation results. Adding geometry loss, floor rotation, and segmentation data [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent staging and sparse-prior layout predictor are the concrete moves here, but the abstract's performance claims lack any supporting numbers or setup.

read the letter

The paper's main contribution is a three-stage pipeline that splits single-image 3D scene generation into initialization (masks plus layout), environment construction from point maps, and then a planner-plus-specialists refinement loop. The geometry-aware layout predictor trained only on sparse priors from point maps is the clearest practical step; it sidesteps the need for full scene-level labels and should generalize better to real scenes than fully supervised alternatives.

The decomposition itself is reasonable. Entangling layout, geometry, materials, and lighting in one model has been a recurring pain point, so handing inconsistencies to a planner that can call targeted agents makes sense on paper. If the reintegration step works without cascading errors, it could be a useful engineering pattern.

The obvious gap is that the abstract asserts consistent gains in geometric accuracy, spatial consistency, and perceptual realism with no metrics, baselines, datasets, or ablations attached. The planner's reliability at spotting structural problems and the specialists' ability to fix them locally are load-bearing assumptions that are simply stated rather than demonstrated in the provided text. Without those details it is difficult to tell whether the gains come from the orchestration or from other unmentioned factors.

This is for people already working on single-image scene synthesis or multi-agent vision pipelines. A reader looking for a new framing that reduces annotation burden would find the layout predictor worth examining. It is worth sending to referees because the supervision reduction is a real constraint in the area and the staged approach is clearly articulated, even though the current evidence is still at the level of a proposal.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SceneConductor, a multi-agent orchestration framework for 3D scene generation from a single image. It decomposes the task into scene initialization (extracting object masks, building 3D representations, and predicting layout via a geometry-aware predictor supervised by sparse point-map priors), environment construction (building scaffolds for surfaces, boundaries, materials, and illumination), and multi-agent refinement (planner identifies inconsistencies and dispatches specialist agents for localized revisions). The central claim is that this approach reduces reliance on scene-level supervision, generalizes robustly, and consistently outperforms prior methods in geometric accuracy, spatial consistency, and perceptual realism on benchmark datasets.

Significance. If the empirical claims hold, the work could advance single-image 3D scene generation by offering a modular alternative to holistic pipelines, with the geometry-aware layout predictor providing a concrete mechanism for training from segmentation-level data rather than full annotations. The multi-agent refinement stage addresses a recognized challenge in maintaining global consistency. These elements, if validated, would be a positive contribution to the field.

major comments (2)

[Abstract] Abstract: The central empirical claim that the method 'consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism' is presented without any metrics, baselines, dataset names, error bars, or ablation results. This assertion is load-bearing for the paper's contribution but cannot be evaluated from the provided text.
[Abstract] Abstract, refinement stage description: The assumption that the planner agent reliably identifies structural/visual inconsistencies and that specialist agents perform revisions that reintegrate without introducing new errors is stated at a high level with no discussion of validation, failure cases, or integration mechanism. This assumption underpins the claimed robustness of the multi-agent pipeline.

minor comments (1)

[Abstract] The phrase 'point-map geometry' is used without a brief definition or citation; adding one would improve clarity for readers unfamiliar with the term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. Both points highlight opportunities to strengthen the abstract, and we will revise it accordingly while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim that the method 'consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism' is presented without any metrics, baselines, dataset names, error bars, or ablation results. This assertion is load-bearing for the paper's contribution but cannot be evaluated from the provided text.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately assess the claims. The full manuscript contains quantitative results, baselines, datasets, and ablations in the Experiments section, but the abstract itself does not reference them. In the revision we will update the abstract to name the primary benchmark datasets, report key metric improvements (with error bars where applicable), and briefly note the main baselines, while keeping the text within length limits. This change will make the contribution more transparent. revision: yes
Referee: [Abstract] Abstract, refinement stage description: The assumption that the planner agent reliably identifies structural/visual inconsistencies and that specialist agents perform revisions that reintegrate without introducing new errors is stated at a high level with no discussion of validation, failure cases, or integration mechanism. This assumption underpins the claimed robustness of the multi-agent pipeline.

Authors: The abstract presents the multi-agent refinement at a summary level, as is conventional. The full manuscript elaborates the planner's inconsistency detection, agent dispatching, and reintegration procedure in the dedicated refinement section, supported by qualitative examples. However, the abstract does not mention validation or failure modes. We will revise the abstract to include a concise clause on empirical validation via consistency metrics and will add a short discussion of failure cases and integration safeguards to the main text or supplementary material to address this concern directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description outline a multi-agent pipeline (initialization, environment construction, refinement) and a geometry-aware layout predictor trained on sparse geometric priors from point maps. No equations, derivations, or first-principles results appear that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The predictor is explicitly distinguished from fully supervised alternatives via its training data source, with no evidence of renaming known results or smuggling ansatzes. Empirical claims of outperformance rest on benchmark experiments rather than internal reductions. This is a standard methodological contribution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or axioms; the framework implicitly assumes multi-agent correction integrates without side effects and that point-map priors suffice for robust generalization.

axioms (2)

domain assumption Multi-agent refinement can identify inconsistencies and produce reintegrated corrections that improve global consistency.
Invoked in the refinement stage description as the mechanism for handling structural and visual issues.
domain assumption The geometry-aware layout predictor generalizes robustly to diverse real-world scenes when trained on segmentation-level data.
Stated directly in the abstract as a property of the predictor.

pith-pipeline@v0.9.1-grok · 5816 in / 1335 out tokens · 16800 ms · 2026-06-27T19:06:30.781839+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 8 linked inside Pith

[1]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019

2019
[2]

Pix2vox: Context- aware 3d reconstruction from single and multi-view images

Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen Zhou, and Shengping Zhang. Pix2vox: Context- aware 3d reconstruction from single and multi-view images. InProceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019

2019
[3]

Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Pith/arXiv arXiv 2022
[4]

Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

2022
[5]

Get3d: A generative model of high quality 3d textured shapes learned from images

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022

2022
[6]

Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Pith/arXiv arXiv 2023
[7]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025
[8]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragki- adaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

arXiv 2025
[9]

Scenegen: Single-image 3d scene generation in one feedforward pass.arXiv preprint arXiv:2508.15769, 2025

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass.arXiv preprint arXiv:2508.15769, 2025

arXiv 2025
[10]

Scenetransporter: Optimal transport-guided compositional latent diffusion for single-image structured 3d scene generation.arXiv preprint arXiv:2602.22785, 2026

Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, et al. Scenetransporter: Optimal transport-guided compositional latent diffusion for single-image structured 3d scene generation.arXiv preprint arXiv:2602.22785, 2026

arXiv 2026
[11]

Sam 3d: 3dfy anything in images, 2025

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images, 2025

2025
[12]

3d-fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, and Jin Xie. 3d-fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

Pith/arXiv arXiv 2026
[13]

Shaper: Robust conditional 3d shape generation from casual captures, 2026

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel. Shaper: Robust conditional 3d shape generation from casual captures, 2026

2026
[14]

3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129:3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129:3313–3337, 2021

2021
[15]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the International Conference on Computer Vision, 2021

2021
[16]

Edit-as-act: Goal-regressive planning for open-vocabulary 3d indoor scene editing.arXiv preprint arXiv:2603.17583, 2026

Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, and HyeongYeop Kang. Edit-as-act: Goal-regressive planning for open-vocabulary 3d indoor scene editing.arXiv preprint arXiv:2603.17583, 2026

arXiv 2026
[17]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

2025
[18]

3d-generalist: Self-improving vision-language-action models for crafting 3d worlds, 2025

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, and Nick Haber. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds, 2025. 10

2025
[19]

Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Darrell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

Pith/arXiv arXiv 2026
[20]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

2016
[21]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[22]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[23]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In Proceedings of the European Conference on Computer Vision, 2024

2024
[24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025
[25]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the International Conference on Computer Vision, 2025

2025
[26]

Aether: Geometric-aware unified world modeling

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling. InProceedings of the International Conference on Computer Vision, 2025

2025
[27]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

2025
[28]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InProceedings of the International Conference on Learning Representations, 2024

2024
[29]

Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191, 2024

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191, 2024

Pith/arXiv arXiv 2024
[30]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10072–10083, 2024

2024
[31]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 22246–22256, 2023

2023
[32]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InEuropean Conference on Computer Vision, pages 112–130. Springer, 2024

2024
[33]

Gaussiananything: Interactive point cloud latent diffusion for 3d generation

Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffusion for 3d generation. In International Conference on Learning Representations, 2025

2025
[34]

Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Pith/arXiv arXiv 2024
[35]

3dtopia-xl: High-quality 3d pbr asset generation via primitive diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, Liang Pan, Dahua Lin, and Ziwei Liu. 3dtopia-xl: High-quality 3d pbr asset generation via primitive diffusion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 11

2025
[36]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025
[37]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Pith/arXiv arXiv 2025
[38]

Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae

Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, and Xingang Pan. Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025
[39]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation.arXiv preprint arXiv:2411.02293, 2024

Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, et al. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation.arXiv preprint arXiv:2411.02293, 2024

arXiv 2024
[40]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022

arXiv 2022
[41]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Pith/arXiv arXiv 2023
[42]

Ultra3d: Efficient and high-fidelity 3d generation with part attention, 2025

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high-fidelity 3d generation with part attention, 2025

2025
[43]

Fullpart: Generating each 3d part at full resolution

Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, and Tianfan Xue. Fullpart: Generating each 3d part at full resolution. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[44]

X-part: high fidelity and structure coherent shape decomposition, 2025

Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, and Chunchao Guo. X-part: high fidelity and structure coherent shape decomposition, 2025

2025
[45]

Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980, 2025

Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, and Tsung-Yi Lin. Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980, 2025

arXiv 2025
[46]

3d-re-gen: 3d reconstruction of indoor scenes with a generative framework.arXiv preprint arXiv:2512.17459, 2025

Tobias Sautter, Jan-Niklas Dihlmann, and Hendrik Lensch. 3d-re-gen: 3d reconstruction of indoor scenes with a generative framework.arXiv preprint arXiv:2512.17459, 2025

arXiv 2025
[47]

Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang. Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

arXiv 2025
[48]

Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image

Tao Chu, Pan Zhang, Qiong Liu, and Jiaqi Wang. Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4937–4946, 2023

2023
[49]

Metascenes: Towards automated replica creation for real-world 3d scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

2025
[50]

Cast: Component-aligned 3d scene reconstruction from an rgb image

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image. InACM SIGGRAPH Conference, 2025

2025
[51]

Hiscene: creating hierarchical 3d scenes with isometric view generation.arXiv preprint arXiv:2504.13072, 2025

Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, and Zhaopeng Cui. Hiscene: creating hierarchical 3d scenes with isometric view generation.arXiv preprint arXiv:2504.13072, 2025

arXiv 2025
[52]

Towards geometric and textural consistency 3d scene genera- tion via single image-guided model generation and layout optimization.arXiv preprint arXiv:2507.14841, 2025

Xiang Tang, Ruotong Li, and Xiaopeng Fan. Towards geometric and textural consistency 3d scene genera- tion via single image-guided model generation and layout optimization.arXiv preprint arXiv:2507.14841, 2025. 12

arXiv 2025
[53]

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion

Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. InConference on Neural Information Processing Systems, 2023

2023
[54]

Artiscene: Language-driven artistic 3d scene generation through image intermediary

Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. Artiscene: Language-driven artistic 3d scene generation through image intermediary. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025
[55]

Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

arXiv 2025
[56]

Layoutgpt: Compositional visual planning and generation with large language models

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. InConference on Neural Information Processing Systems, 2023

2023
[57]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[58]

Agentic 3d scene generation with spatially contextualized vlms.arXiv preprint arXiv:2505.20129, 2025

Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Agentic 3d scene generation with spatially contextualized vlms.arXiv preprint arXiv:2505.20129, 2025

arXiv 2025
[59]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

2025
[60]

Blenderalchemy: Editing 3d graphics with vision- language models

Ian Huang, Guandao Yang, and Leonidas Guibas. Blenderalchemy: Editing 3d graphics with vision- language models. InEuropean Conference on Computer Vision, pages 297–314. Springer, 2024

2024
[61]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024
[62]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

2017
[63]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014

2014
[64]

Claude [large language model], 2025

Anthropic. Claude [large language model], 2025

2025
[65]

Codex [large language model], 2026

OpenAI. Codex [large language model], 2026

2026
[66]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[67]

Recognizing indoor scenes

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009

2009
[68]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021. 13

2021

[1] [1]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019

2019

[2] [2]

Pix2vox: Context- aware 3d reconstruction from single and multi-view images

Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen Zhou, and Shengping Zhang. Pix2vox: Context- aware 3d reconstruction from single and multi-view images. InProceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019

2019

[3] [3]

Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Pith/arXiv arXiv 2022

[4] [4]

Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

2022

[5] [5]

Get3d: A generative model of high quality 3d textured shapes learned from images

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022

2022

[6] [6]

Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Pith/arXiv arXiv 2023

[7] [7]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025

[8] [8]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragki- adaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

arXiv 2025

[9] [9]

Scenegen: Single-image 3d scene generation in one feedforward pass.arXiv preprint arXiv:2508.15769, 2025

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass.arXiv preprint arXiv:2508.15769, 2025

arXiv 2025

[10] [10]

Scenetransporter: Optimal transport-guided compositional latent diffusion for single-image structured 3d scene generation.arXiv preprint arXiv:2602.22785, 2026

Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, et al. Scenetransporter: Optimal transport-guided compositional latent diffusion for single-image structured 3d scene generation.arXiv preprint arXiv:2602.22785, 2026

arXiv 2026

[11] [11]

Sam 3d: 3dfy anything in images, 2025

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images, 2025

2025

[12] [12]

3d-fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, and Jin Xie. 3d-fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

Pith/arXiv arXiv 2026

[13] [13]

Shaper: Robust conditional 3d shape generation from casual captures, 2026

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel. Shaper: Robust conditional 3d shape generation from casual captures, 2026

2026

[14] [14]

3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129:3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129:3313–3337, 2021

2021

[15] [15]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the International Conference on Computer Vision, 2021

2021

[16] [16]

Edit-as-act: Goal-regressive planning for open-vocabulary 3d indoor scene editing.arXiv preprint arXiv:2603.17583, 2026

Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, and HyeongYeop Kang. Edit-as-act: Goal-regressive planning for open-vocabulary 3d indoor scene editing.arXiv preprint arXiv:2603.17583, 2026

arXiv 2026

[17] [17]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

2025

[18] [18]

3d-generalist: Self-improving vision-language-action models for crafting 3d worlds, 2025

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, and Nick Haber. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds, 2025. 10

2025

[19] [19]

Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Darrell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

Pith/arXiv arXiv 2026

[20] [20]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

2016

[21] [21]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[22] [22]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[23] [23]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In Proceedings of the European Conference on Computer Vision, 2024

2024

[24] [24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025

[25] [25]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the International Conference on Computer Vision, 2025

2025

[26] [26]

Aether: Geometric-aware unified world modeling

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling. InProceedings of the International Conference on Computer Vision, 2025

2025

[27] [27]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

2025

[28] [28]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InProceedings of the International Conference on Learning Representations, 2024

2024

[29] [29]

Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191, 2024

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191, 2024

Pith/arXiv arXiv 2024

[30] [30]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10072–10083, 2024

2024

[31] [31]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 22246–22256, 2023

2023

[32] [32]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InEuropean Conference on Computer Vision, pages 112–130. Springer, 2024

2024

[33] [33]

Gaussiananything: Interactive point cloud latent diffusion for 3d generation

Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffusion for 3d generation. In International Conference on Learning Representations, 2025

2025

[34] [34]

Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

Pith/arXiv arXiv 2024

[35] [35]

3dtopia-xl: High-quality 3d pbr asset generation via primitive diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, Liang Pan, Dahua Lin, and Ziwei Liu. 3dtopia-xl: High-quality 3d pbr asset generation via primitive diffusion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 11

2025

[36] [36]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025

[37] [37]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Pith/arXiv arXiv 2025

[38] [38]

Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae

Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, and Xingang Pan. Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025

[39] [39]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation.arXiv preprint arXiv:2411.02293, 2024

Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, et al. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation.arXiv preprint arXiv:2411.02293, 2024

arXiv 2024

[40] [40]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022

arXiv 2022

[41] [41]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Pith/arXiv arXiv 2023

[42] [42]

Ultra3d: Efficient and high-fidelity 3d generation with part attention, 2025

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high-fidelity 3d generation with part attention, 2025

2025

[43] [43]

Fullpart: Generating each 3d part at full resolution

Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, and Tianfan Xue. Fullpart: Generating each 3d part at full resolution. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[44] [44]

X-part: high fidelity and structure coherent shape decomposition, 2025

Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, and Chunchao Guo. X-part: high fidelity and structure coherent shape decomposition, 2025

2025

[45] [45]

Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980, 2025

Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, and Tsung-Yi Lin. Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980, 2025

arXiv 2025

[46] [46]

3d-re-gen: 3d reconstruction of indoor scenes with a generative framework.arXiv preprint arXiv:2512.17459, 2025

Tobias Sautter, Jan-Niklas Dihlmann, and Hendrik Lensch. 3d-re-gen: 3d reconstruction of indoor scenes with a generative framework.arXiv preprint arXiv:2512.17459, 2025

arXiv 2025

[47] [47]

Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang. Scenemaker: Open-set 3d scene generation with decoupled de-occlusion and pose estimation model.arXiv preprint arXiv:2512.10957, 2025

arXiv 2025

[48] [48]

Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image

Tao Chu, Pan Zhang, Qiong Liu, and Jiaqi Wang. Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4937–4946, 2023

2023

[49] [49]

Metascenes: Towards automated replica creation for real-world 3d scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

2025

[50] [50]

Cast: Component-aligned 3d scene reconstruction from an rgb image

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image. InACM SIGGRAPH Conference, 2025

2025

[51] [51]

Hiscene: creating hierarchical 3d scenes with isometric view generation.arXiv preprint arXiv:2504.13072, 2025

Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, and Zhaopeng Cui. Hiscene: creating hierarchical 3d scenes with isometric view generation.arXiv preprint arXiv:2504.13072, 2025

arXiv 2025

[52] [52]

Towards geometric and textural consistency 3d scene genera- tion via single image-guided model generation and layout optimization.arXiv preprint arXiv:2507.14841, 2025

Xiang Tang, Ruotong Li, and Xiaopeng Fan. Towards geometric and textural consistency 3d scene genera- tion via single image-guided model generation and layout optimization.arXiv preprint arXiv:2507.14841, 2025. 12

arXiv 2025

[53] [53]

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion

Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. InConference on Neural Information Processing Systems, 2023

2023

[54] [54]

Artiscene: Language-driven artistic 3d scene generation through image intermediary

Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. Artiscene: Language-driven artistic 3d scene generation through image intermediary. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

2025

[55] [55]

Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

arXiv 2025

[56] [56]

Layoutgpt: Compositional visual planning and generation with large language models

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. InConference on Neural Information Processing Systems, 2023

2023

[57] [57]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[58] [58]

Agentic 3d scene generation with spatially contextualized vlms.arXiv preprint arXiv:2505.20129, 2025

Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Agentic 3d scene generation with spatially contextualized vlms.arXiv preprint arXiv:2505.20129, 2025

arXiv 2025

[59] [59]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

2025

[60] [60]

Blenderalchemy: Editing 3d graphics with vision- language models

Ian Huang, Guandao Yang, and Leonidas Guibas. Blenderalchemy: Editing 3d graphics with vision- language models. InEuropean Conference on Computer Vision, pages 297–314. Springer, 2024

2024

[61] [61]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024

[62] [62]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

2017

[63] [63]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014

2014

[64] [64]

Claude [large language model], 2025

Anthropic. Claude [large language model], 2025

2025

[65] [65]

Codex [large language model], 2026

OpenAI. Codex [large language model], 2026

2026

[66] [66]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[67] [67]

Recognizing indoor scenes

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009

2009

[68] [68]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021. 13

2021