GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

Dingkun Wei; Hongyu Zhou; Jiahao Sun; Liang Li; Yujun Shen; Zehong Shen

arxiv: 2606.03921 · v1 · pith:LCR7C7OMnew · submitted 2026-06-02 · 💻 cs.CV

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

Jiahao Sun , Dingkun Wei , Zehong Shen , Hongyu Zhou , Yujun Shen , Liang Li This is my paper

Pith reviewed 2026-06-28 10:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords gravity alignmentscene factorizationrigid body reconstructionphysics simulationRGB to 3Ddisentangled environmentsmulti-view reconstruction

0 comments

The pith

Gravity alignment from RGB images produces disentangled 3D scenes with explicit rigid bodies for direct physics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to convert multi-view RGB images into 3D environments that support physical simulation. It addresses the problem of monolithic reconstructions that lack explicit physical structure and are rotated arbitrarily. By using gravity as a prior, the approach aligns the scene, extracts object meshes with 6-DoF poses, and separates them from the background. This avoids reliance on CAD model retrieval and maintains geometric accuracy. The result is a hybrid representation ready for both rendering and simulation.

Core claim

GARDEN reformulates reconstruction as physically-grounded scene factorization using gravity as a universal prior: it aligns to a Gravity-View frame to resolve gauge ambiguity, recovers object-centric rigid meshes with accurate placement, and removes duplicate geometry from the background via conditional 3D point classification, yielding explicit rigid bodies decoupled from background geometry.

What carries the argument

Gravity-View frame alignment combined with conditional 3D point classification to disentangle rigid objects from background.

If this is right

Object placement becomes more reliable in reconstructed scenes.
Disentanglement quality improves over retrieval-based methods.
Rendering and simulation efficiency increases due to the structured representation.
Direct physics simulation is possible without replacing objects with CAD assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scenes with strong vertical structure would benefit most from this gravity prior.
The method could extend to videos by tracking gravity-aligned objects over time.
Integration with other physical priors like friction might further enhance simulation fidelity.

Load-bearing premise

Gravity direction can be reliably estimated from RGB images alone and acts as a universal prior that resolves scene rotation ambiguity.

What would settle it

A set of multi-view RGB images from a scene where gravity estimation from images fails to match the true direction, leading to misaligned objects that cannot be simulated stably.

Figures

Figures reproduced from arXiv: 2606.03921 by Dingkun Wei, Hongyu Zhou, Jiahao Sun, Liang Li, Yujun Shen, Zehong Shen.

**Figure 1.** Figure 1: GARDEN factorizes unstructured RGB images into a structured, gravityaligned representation. By decoupling scene-specific rigid objects from the background as meshes, the proposed method achieves high-fidelity reconstruction that enables interactive downstream simulation without the requirement of external CAD models. explicit gravity direction. Without this physically grounded global frame, rigidbody an… view at source ↗

**Figure 2.** Figure 2: Overview of GARDEN. Given multi-view RGB images, we first perform (1) multi-view reconstruction with gravity alignment to establish a physically grounded coordinate system. Next, we conduct (2) target-driven object generation and layout estimation to reconstruct standalone rigid objects from user- or VLM-provided boxes. This is followed by (3) point-conditional background disentanglement to remove duplic… view at source ↗

**Figure 3.** Figure 3: Motivation for Gravity-View (GV) alignment. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Layout Estimation and background disentanglement in GARDEN. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with LiteReality [13]. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-dataset qualitative comparison against SAM3D [15] and [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Architecture of the GV prediction branch. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: VLM automatic target selection. VLM-proposed target boxes directly replace manual user-click boxes and are passed to the same downstream GARDEN pipeline. User Interface GARDEN Auto Pipeline Input RGB Images User Prompt Box Auto Reconstruction Box Selection Load/Save Box Run Configuration Load Configuration [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Interactive workflow and user interface of GARDEN. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Additional Qualitative comparison against SAM-3D [15] and [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Physics simulation in gravity-aligned reconstructions. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

read the original abstract

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GARDEN adds gravity alignment and conditional background cleaning to produce object-plus-background scene models from RGB, but the abstract gives no numbers or method details so the practical payoff remains unproven.

read the letter

The main thing to know is that this paper describes a three-stage RGB pipeline: align the scene to a gravity frame, recover object meshes with 6-DoF poses, then use conditional point classification to strip duplicate object geometry out of the background mesh. The output is meant to be directly usable for physics simulation while keeping the original geometry.

What is new is the explicit ordering that starts with gravity alignment to remove rotational gauge freedom, followed by the conditional classification step for disentanglement. Prior work either leaves scenes monolithic or swaps objects for CAD models; this tries to keep scene-specific shape without the retrieval step.

The approach is reasonable on paper and targets a real bottleneck in robotics and sim pipelines. Using an external physical prior like gravity instead of learned parameters is a clean move.

The soft spots are straightforward. The abstract claims better object placement, disentanglement quality, and efficiency than retrieval baselines on simulated and real scenes, yet supplies zero numbers, error bars, datasets, or ablations. Without those it is impossible to tell whether the gains are real or whether the disentanglement actually survives contact. The gravity estimation step is presented as reliable from RGB alone, but the stress-test concern holds: standard monocular estimators often fail in textureless or indoor scenes without strong vertical lines. If that first step is off, the rest of the pipeline becomes ill-posed. The full paper would need to show how gravity is computed and how often it succeeds across varied environments.

This is for 3D vision researchers who build reconstruction pipelines for downstream simulation. A reader who wants concrete ideas for adding physical structure to NeRF-style or mesh outputs would get something useful to try.

It deserves a serious referee because the problem is well-posed and the pipeline is concrete, even though the current write-up leaves the central claims unverified.

Referee Report

2 major / 0 minor

Summary. The paper proposes GARDEN, an RGB-only pipeline that factorizes multi-view scene reconstructions into explicit rigid foreground objects and a decoupled background by first aligning to a gravity-defined frame (to resolve gauge ambiguity), then recovering object-centric meshes with 6-DoF poses, and finally applying conditional 3D point classification to remove duplicate geometry. The resulting hybrid representation is intended to support direct physics simulation while preserving visual realism, with claimed gains over retrieval-based baselines on simulated and real multi-view data.

Significance. If the quantitative claims hold, the work would offer a meaningful step toward physically grounded, simulation-ready scene representations from images alone, avoiding the geometric compromises of CAD replacement while addressing rotational gauge freedom via an external physical prior. The gravity-alignment idea is a clean way to inject structure without learned parameters, and the disentanglement step could improve downstream robotics and AR applications if robust.

major comments (2)

[Abstract] Abstract (key idea paragraph): The pipeline's first step requires reliable gravity-direction estimation from RGB alone as a 'universal physical prior' independent of scene content. Standard monocular estimators (horizon or vertical-line detection) are known to degrade in textureless, indoor, or natural scenes lacking consistent vertical structure; if this alignment fails, the subsequent 6-DoF placement and conditional point classification become ill-posed, directly undermining the central claim of stable physics-ready output.
[Abstract] Abstract: The manuscript asserts 'improvements in object placement reliability, disentanglement quality, and rendering-simulation efficiency' yet supplies no quantitative metrics, error bars, dataset sizes, ablation studies, or baseline comparisons. Without these, the experimental support for the central claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the gravity alignment assumption and the presentation of experimental claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (key idea paragraph): The pipeline's first step requires reliable gravity-direction estimation from RGB alone as a 'universal physical prior' independent of scene content. Standard monocular estimators (horizon or vertical-line detection) are known to degrade in textureless, indoor, or natural scenes lacking consistent vertical structure; if this alignment fails, the subsequent 6-DoF placement and conditional point classification become ill-posed, directly undermining the central claim of stable physics-ready output.

Authors: We agree that gravity estimation from RGB is a critical prerequisite and that monocular methods can be unreliable in textureless or unstructured scenes. The manuscript presents gravity alignment as an external physical prior obtained via standard RGB-based estimators (e.g., vertical vanishing points). In the evaluated simulated and real multi-view scenes, these estimates were sufficiently accurate to enable stable 6-DoF placement and classification. We will add a limitations paragraph discussing estimator failure modes and their potential impact on the pipeline. revision: partial
Referee: [Abstract] Abstract: The manuscript asserts 'improvements in object placement reliability, disentanglement quality, and rendering-simulation efficiency' yet supplies no quantitative metrics, error bars, dataset sizes, ablation studies, or baseline comparisons. Without these, the experimental support for the central claim cannot be evaluated.

Authors: The abstract is intended as a concise summary. Full quantitative results—including metrics with error bars, dataset sizes, ablation studies, and comparisons to retrieval-based baselines on simulated and real data—are reported in the Experiments section. We will revise the abstract to incorporate one or two key quantitative highlights to better support the claims at the summary level. revision: yes

Circularity Check

0 steps flagged

No circularity; gravity prior is external and independent of outputs

full rationale

The paper presents gravity alignment as an external physical prior applied to RGB inputs to resolve gauge ambiguity before factorization into rigid objects and background. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result to the method's own outputs by construction. The derivation chain remains self-contained against external benchmarks because the key step invokes a scene-independent prior rather than a quantity defined from the reconstruction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gravity direction is observable and consistent enough to serve as an alignment prior; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Gravity direction can be estimated from RGB images and used to resolve reconstruction gauge ambiguity
Invoked as the key idea to align the reconstruction to a unified Gravity-View frame.

pith-pipeline@v0.9.1-grok · 5749 in / 1218 out tokens · 28967 ms · 2026-06-28T10:38:09.219303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 11 linked inside Pith

[1]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[2]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[3]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation- Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

Pith/arXiv arXiv 2025
[4]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[5]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

2022
[6]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[7]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

2022
[8]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[9]

Isaac gym: High performance gpu-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021

Pith/arXiv arXiv 2021
[10]

Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5134–5143, 2017

2017
[11]

Mask2cad: 3d shape prediction by learning to segment and retrieve

Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. Mask2cad: 3d shape prediction by learning to segment and retrieve. InEuropean Conference on Computer Vision, pages 260–277. Springer, 2020. GARDEN17

2020
[12]

Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

arXiv 2022
[13]

Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[14]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024
[15]

Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

Pith/arXiv arXiv 2025
[16]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

2024
[17]

Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera

Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera. InProceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011

2011
[18]

Poisson surface recon- struction

Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface recon- struction. InProceedings of the fourth Eurographics symposium on Geometry pro- cessing, volume 7, 2006

2006
[19]

Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

2013
[20]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

2021
[21]

Zip-nerf: Anti-aliased grid-based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023

2023
[22]

Mip- splatting: Alias-free 3d gaussian splatting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip- splatting: Alias-free 3d gaussian splatting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 19447–19456, 2024

2024
[23]

Neuralangelo: High-fidelity neural surface reconstruction

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8456–8465, 2023

2023
[24]

Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen- ping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

Pith/arXiv arXiv 2021
[25]

Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021

LiorYariv,JiataoGu,YoniKasten,andYaronLipman. Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021. 18 J. Sun et al

2021
[26]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Re- vaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

2024
[28]

Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

Manuel Dahnert, Ji Hou, Matthias Nießner, and Angela Dai. Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

2021
[29]

Frodo: From detectionsto3dobjects

Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, et al. Frodo: From detectionsto3dobjects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14720–14729, 2020

2020
[30]

Revealnet: Seeing behind objects in rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. Revealnet: Seeing behind objects in rgb-d scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2098–2107, 2020

2098
[31]

Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017
[32]

Habitat 3.0: A co-habitat for humans, avatars and robots

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724, 2023

arXiv 2023
[33]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023
[34]

Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

2022
[35]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024
[36]

Phone2proc: Bringing robust robots into our chaotic world

Matt Deitke, Rose Hendrix, Ali Farhadi, Kiana Ehsani, and Aniruddha Kembhavi. Phone2proc: Bringing robust robots into our chaotic world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9665– 9675, 2023

2023
[37]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

arXiv 2024
[38]

Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

Hongchi Xia, Chih-Hao Lin, Hao-Yu Hsu, Quentin Leboutet, Katelyn Gao, Michael Paulitsch, Benjamin Ummenhofer, and Shenlong Wang. Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

2026
[39]

Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view. In2025 Interna- tional Conference on 3D Vision (3DV), pages 616–626. IEEE, 2025. GARDEN19

2025
[40]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi- Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025
[41]

Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

2025
[42]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016
[43]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drewHuang,JieLei,TengyuMa,BaishanGuo,ArpitKalla,MarkusMarks,Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Ri...

2025
[44]

Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

Pith/arXiv arXiv 2025
[45]

Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision,pages10912–10922, 2021

2021
[46]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020

2020
[47]

Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M Seitz. Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

Pith/arXiv arXiv 2018
[48]

Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, and Dahua Lin. Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

2024
[49]

Geocalib: Learning single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuro- pean Conference on Computer Vision, pages 1–20. Springer, 2024

2024
[50]

Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

Pith/arXiv arXiv 2001
[51]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

2022
[52]

A multi-view stereo benchmark 20 J

ThomasSchops,JohannesLSchonberger,SilvanoGalliani,TorstenSattler,Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark 20 J. Sun et al. with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017
[53]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[54]

Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

Pith/arXiv arXiv 2017
[55]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Pith/arXiv arXiv 2021
[56]

Rio: 3d object instance re-localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

2019
[57]

# "Context Transformer Decoder Rotation Head !

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelli- gence, 13(4):376–380, 2002. GARDEN21 Supplementary Material Overview This supplementary material provides additional technical details and experi- mental results for GARDEN. Section A describes the arc...

2002

[1] [1]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[2] [2]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[3] [3]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation- Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

Pith/arXiv arXiv 2025

[4] [4]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021

[5] [5]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

2022

[6] [6]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[7] [7]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

2022

[8] [8]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012

[9] [9]

Isaac gym: High performance gpu-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021

Pith/arXiv arXiv 2021

[10] [10]

Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5134–5143, 2017

2017

[11] [11]

Mask2cad: 3d shape prediction by learning to segment and retrieve

Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. Mask2cad: 3d shape prediction by learning to segment and retrieve. InEuropean Conference on Computer Vision, pages 260–277. Springer, 2020. GARDEN17

2020

[12] [12]

Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

arXiv 2022

[13] [13]

Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[14] [14]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024

[15] [15]

Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

Pith/arXiv arXiv 2025

[16] [16]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

2024

[17] [17]

Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera

Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera. InProceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011

2011

[18] [18]

Poisson surface recon- struction

Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface recon- struction. InProceedings of the fourth Eurographics symposium on Geometry pro- cessing, volume 7, 2006

2006

[19] [19]

Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

2013

[20] [20]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

2021

[21] [21]

Zip-nerf: Anti-aliased grid-based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023

2023

[22] [22]

Mip- splatting: Alias-free 3d gaussian splatting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip- splatting: Alias-free 3d gaussian splatting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 19447–19456, 2024

2024

[23] [23]

Neuralangelo: High-fidelity neural surface reconstruction

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8456–8465, 2023

2023

[24] [24]

Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen- ping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

Pith/arXiv arXiv 2021

[25] [25]

Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021

LiorYariv,JiataoGu,YoniKasten,andYaronLipman. Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021. 18 J. Sun et al

2021

[26] [26]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Re- vaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024

[27] [27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

2024

[28] [28]

Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

Manuel Dahnert, Ji Hou, Matthias Nießner, and Angela Dai. Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

2021

[29] [29]

Frodo: From detectionsto3dobjects

Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, et al. Frodo: From detectionsto3dobjects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14720–14729, 2020

2020

[30] [30]

Revealnet: Seeing behind objects in rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. Revealnet: Seeing behind objects in rgb-d scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2098–2107, 2020

2098

[31] [31]

Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017

[32] [32]

Habitat 3.0: A co-habitat for humans, avatars and robots

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724, 2023

arXiv 2023

[33] [33]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023

[34] [34]

Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

2022

[35] [35]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024

[36] [36]

Phone2proc: Bringing robust robots into our chaotic world

Matt Deitke, Rose Hendrix, Ali Farhadi, Kiana Ehsani, and Aniruddha Kembhavi. Phone2proc: Bringing robust robots into our chaotic world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9665– 9675, 2023

2023

[37] [37]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

arXiv 2024

[38] [38]

Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

Hongchi Xia, Chih-Hao Lin, Hao-Yu Hsu, Quentin Leboutet, Katelyn Gao, Michael Paulitsch, Benjamin Ummenhofer, and Shenlong Wang. Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

2026

[39] [39]

Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view. In2025 Interna- tional Conference on 3D Vision (3DV), pages 616–626. IEEE, 2025. GARDEN19

2025

[40] [40]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi- Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025

[41] [41]

Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

2025

[42] [42]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016

[43] [43]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drewHuang,JieLei,TengyuMa,BaishanGuo,ArpitKalla,MarkusMarks,Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Ri...

2025

[44] [44]

Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

Pith/arXiv arXiv 2025

[45] [45]

Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision,pages10912–10922, 2021

2021

[46] [46]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020

2020

[47] [47]

Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M Seitz. Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

Pith/arXiv arXiv 2018

[48] [48]

Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, and Dahua Lin. Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

2024

[49] [49]

Geocalib: Learning single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuro- pean Conference on Computer Vision, pages 1–20. Springer, 2024

2024

[50] [50]

Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

Pith/arXiv arXiv 2001

[51] [51]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

2022

[52] [52]

A multi-view stereo benchmark 20 J

ThomasSchops,JohannesLSchonberger,SilvanoGalliani,TorstenSattler,Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark 20 J. Sun et al. with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017

[53] [53]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017

[54] [54]

Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

Pith/arXiv arXiv 2017

[55] [55]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Pith/arXiv arXiv 2021

[56] [56]

Rio: 3d object instance re-localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

2019

[57] [57]

# "Context Transformer Decoder Rotation Head !

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelli- gence, 13(4):376–380, 2002. GARDEN21 Supplementary Material Overview This supplementary material provides additional technical details and experi- mental results for GARDEN. Section A describes the arc...

2002