pith. sign in

arxiv: 2606.03921 · v1 · pith:LCR7C7OMnew · submitted 2026-06-02 · 💻 cs.CV

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

Pith reviewed 2026-06-28 10:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords gravity alignmentscene factorizationrigid body reconstructionphysics simulationRGB to 3Ddisentangled environmentsmulti-view reconstruction
0
0 comments X

The pith

Gravity alignment from RGB images produces disentangled 3D scenes with explicit rigid bodies for direct physics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to convert multi-view RGB images into 3D environments that support physical simulation. It addresses the problem of monolithic reconstructions that lack explicit physical structure and are rotated arbitrarily. By using gravity as a prior, the approach aligns the scene, extracts object meshes with 6-DoF poses, and separates them from the background. This avoids reliance on CAD model retrieval and maintains geometric accuracy. The result is a hybrid representation ready for both rendering and simulation.

Core claim

GARDEN reformulates reconstruction as physically-grounded scene factorization using gravity as a universal prior: it aligns to a Gravity-View frame to resolve gauge ambiguity, recovers object-centric rigid meshes with accurate placement, and removes duplicate geometry from the background via conditional 3D point classification, yielding explicit rigid bodies decoupled from background geometry.

What carries the argument

Gravity-View frame alignment combined with conditional 3D point classification to disentangle rigid objects from background.

If this is right

  • Object placement becomes more reliable in reconstructed scenes.
  • Disentanglement quality improves over retrieval-based methods.
  • Rendering and simulation efficiency increases due to the structured representation.
  • Direct physics simulation is possible without replacing objects with CAD assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scenes with strong vertical structure would benefit most from this gravity prior.
  • The method could extend to videos by tracking gravity-aligned objects over time.
  • Integration with other physical priors like friction might further enhance simulation fidelity.

Load-bearing premise

Gravity direction can be reliably estimated from RGB images alone and acts as a universal prior that resolves scene rotation ambiguity.

What would settle it

A set of multi-view RGB images from a scene where gravity estimation from images fails to match the true direction, leading to misaligned objects that cannot be simulated stably.

Figures

Figures reproduced from arXiv: 2606.03921 by Dingkun Wei, Hongyu Zhou, Jiahao Sun, Liang Li, Yujun Shen, Zehong Shen.

Figure 1
Figure 1. Figure 1: GARDEN factorizes unstructured RGB images into a structured, gravity￾aligned representation. By decoupling scene-specific rigid objects from the background as meshes, the proposed method achieves high-fidelity reconstruction that enables in￾teractive downstream simulation without the requirement of external CAD models. explicit gravity direction. Without this physically grounded global frame, rigid￾body an… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GARDEN. Given multi-view RGB images, we first perform (1) multi-view reconstruction with gravity alignment to establish a physically grounded coordinate system. Next, we conduct (2) target-driven object generation and layout es￾timation to reconstruct standalone rigid objects from user- or VLM-provided boxes. This is followed by (3) point-conditional background disentanglement to remove dupli￾c… view at source ↗
Figure 3
Figure 3. Figure 3: Motivation for Gravity-View (GV) alignment. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layout Estimation and background disentanglement in GARDEN. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with LiteReality [13]. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-dataset qualitative comparison against SAM3D [15] and [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Architecture of the GV prediction branch. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: VLM automatic target selection. VLM-proposed target boxes directly replace manual user-click boxes and are passed to the same downstream GARDEN pipeline. User Interface GARDEN Auto Pipeline Input RGB Images User Prompt Box Auto Reconstruction Box Selection Load/Save Box Run Configuration Load Configuration [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interactive workflow and user interface of GARDEN. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional Qualitative comparison against SAM-3D [15] and [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Physics simulation in gravity-aligned reconstructions. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes GARDEN, an RGB-only pipeline that factorizes multi-view scene reconstructions into explicit rigid foreground objects and a decoupled background by first aligning to a gravity-defined frame (to resolve gauge ambiguity), then recovering object-centric meshes with 6-DoF poses, and finally applying conditional 3D point classification to remove duplicate geometry. The resulting hybrid representation is intended to support direct physics simulation while preserving visual realism, with claimed gains over retrieval-based baselines on simulated and real multi-view data.

Significance. If the quantitative claims hold, the work would offer a meaningful step toward physically grounded, simulation-ready scene representations from images alone, avoiding the geometric compromises of CAD replacement while addressing rotational gauge freedom via an external physical prior. The gravity-alignment idea is a clean way to inject structure without learned parameters, and the disentanglement step could improve downstream robotics and AR applications if robust.

major comments (2)
  1. [Abstract] Abstract (key idea paragraph): The pipeline's first step requires reliable gravity-direction estimation from RGB alone as a 'universal physical prior' independent of scene content. Standard monocular estimators (horizon or vertical-line detection) are known to degrade in textureless, indoor, or natural scenes lacking consistent vertical structure; if this alignment fails, the subsequent 6-DoF placement and conditional point classification become ill-posed, directly undermining the central claim of stable physics-ready output.
  2. [Abstract] Abstract: The manuscript asserts 'improvements in object placement reliability, disentanglement quality, and rendering-simulation efficiency' yet supplies no quantitative metrics, error bars, dataset sizes, ablation studies, or baseline comparisons. Without these, the experimental support for the central claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the gravity alignment assumption and the presentation of experimental claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (key idea paragraph): The pipeline's first step requires reliable gravity-direction estimation from RGB alone as a 'universal physical prior' independent of scene content. Standard monocular estimators (horizon or vertical-line detection) are known to degrade in textureless, indoor, or natural scenes lacking consistent vertical structure; if this alignment fails, the subsequent 6-DoF placement and conditional point classification become ill-posed, directly undermining the central claim of stable physics-ready output.

    Authors: We agree that gravity estimation from RGB is a critical prerequisite and that monocular methods can be unreliable in textureless or unstructured scenes. The manuscript presents gravity alignment as an external physical prior obtained via standard RGB-based estimators (e.g., vertical vanishing points). In the evaluated simulated and real multi-view scenes, these estimates were sufficiently accurate to enable stable 6-DoF placement and classification. We will add a limitations paragraph discussing estimator failure modes and their potential impact on the pipeline. revision: partial

  2. Referee: [Abstract] Abstract: The manuscript asserts 'improvements in object placement reliability, disentanglement quality, and rendering-simulation efficiency' yet supplies no quantitative metrics, error bars, dataset sizes, ablation studies, or baseline comparisons. Without these, the experimental support for the central claim cannot be evaluated.

    Authors: The abstract is intended as a concise summary. Full quantitative results—including metrics with error bars, dataset sizes, ablation studies, and comparisons to retrieval-based baselines on simulated and real data—are reported in the Experiments section. We will revise the abstract to incorporate one or two key quantitative highlights to better support the claims at the summary level. revision: yes

Circularity Check

0 steps flagged

No circularity; gravity prior is external and independent of outputs

full rationale

The paper presents gravity alignment as an external physical prior applied to RGB inputs to resolve gauge ambiguity before factorization into rigid objects and background. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result to the method's own outputs by construction. The derivation chain remains self-contained against external benchmarks because the key step invokes a scene-independent prior rather than a quantity defined from the reconstruction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gravity direction is observable and consistent enough to serve as an alignment prior; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Gravity direction can be estimated from RGB images and used to resolve reconstruction gauge ambiguity
    Invoked as the key idea to align the reconstruction to a unified Gravity-View frame.

pith-pipeline@v0.9.1-grok · 5749 in / 1218 out tokens · 28967 ms · 2026-06-28T10:38:09.219303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 linked inside Pith

  1. [1]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  2. [2]

    Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  3. [3]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation- Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

  4. [4]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  5. [5]

    Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

  6. [6]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  7. [7]

    A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

  8. [8]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  9. [9]

    Isaac gym: High performance gpu-based physics simulation for robot learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021

  10. [10]

    Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5134–5143, 2017

  11. [11]

    Mask2cad: 3d shape prediction by learning to segment and retrieve

    Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. Mask2cad: 3d shape prediction by learning to segment and retrieve. InEuropean Conference on Computer Vision, pages 260–277. Springer, 2020. GARDEN17

  12. [12]

    Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

    Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Sparc: Sparse render-and-compare for cad model alignment in a single rgb image.arXiv preprint arXiv:2210.01044, 2022

  13. [13]

    Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans

    Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. Litereality: Graphic-ready 3d scene reconstruction from rgb-d scans. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  14. [14]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  15. [15]

    Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  16. [16]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

  17. [17]

    Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera

    Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depthcamera. InProceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011

  18. [18]

    Poisson surface recon- struction

    Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface recon- struction. InProceedings of the fourth Eurographics symposium on Geometry pro- cessing, volume 7, 2006

  19. [19]

    Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

    Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real- time3dreconstructionatscaleusingvoxelhashing.ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

  20. [20]

    Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021

  21. [21]

    Zip-nerf: Anti-aliased grid-based neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023

  22. [22]

    Mip- splatting: Alias-free 3d gaussian splatting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip- splatting: Alias-free 3d gaussian splatting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 19447–19456, 2024

  23. [23]

    Neuralangelo: High-fidelity neural surface reconstruction

    Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8456–8465, 2023

  24. [24]

    Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen- ping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi- view reconstruction.arXiv preprint arXiv:2106.10689, 2021

  25. [25]

    Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021

    LiorYariv,JiataoGu,YoniKasten,andYaronLipman. Volumerenderingofneural implicit surfaces.Advances in neural information processing systems, 34:4805– 4815, 2021. 18 J. Sun et al

  26. [26]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Re- vaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  27. [27]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

  28. [28]

    Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

    Manuel Dahnert, Ji Hou, Matthias Nießner, and Angela Dai. Panoptic 3d scene reconstruction from a single rgb image.Advances in Neural Information Processing Systems, 34:8282–8293, 2021

  29. [29]

    Frodo: From detectionsto3dobjects

    Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, et al. Frodo: From detectionsto3dobjects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14720–14729, 2020

  30. [30]

    Revealnet: Seeing behind objects in rgb-d scans

    Ji Hou, Angela Dai, and Matthias Nießner. Revealnet: Seeing behind objects in rgb-d scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2098–2107, 2020

  31. [31]

    Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  32. [32]

    Habitat 3.0: A co-habitat for humans, avatars and robots

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724, 2023

  33. [33]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday ac- tivities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

  34. [34]

    Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor:Large-scaleembodiedaiusingproceduralgeneration.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

  35. [35]

    Holodeck: Language guided generation of 3d embodied ai environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

  36. [36]

    Phone2proc: Bringing robust robots into our chaotic world

    Matt Deitke, Rose Hendrix, Ali Farhadi, Kiana Ehsani, and Aniruddha Kembhavi. Phone2proc: Bringing robust robots into our chaotic world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9665– 9675, 2023

  37. [37]

    Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

    Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

  38. [38]

    Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

    Hongchi Xia, Chih-Hao Lin, Hao-Yu Hsu, Quentin Leboutet, Katelyn Gao, Michael Paulitsch, Benjamin Ummenhofer, and Shenlong Wang. Holoscene: Simulation- ready interactive 3d worlds from a single video.Advances in Neural Information Processing Systems, 38:32501–32524, 2026

  39. [39]

    Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view

    Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view. In2025 Interna- tional Conference on 3D Vision (3DV), pages 616–626. IEEE, 2025. GARDEN19

  40. [40]

    Midi: Multi-instance diffusion for single image to 3d scene generation

    Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi- Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

  41. [41]

    Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

    Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

  42. [42]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

  43. [43]

    Sam 3: Segment anything with concepts, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drewHuang,JieLei,TengyuMa,BaishanGuo,ArpitKalla,MarkusMarks,Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Ri...

  44. [44]

    Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

    Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulat- able indoor scene dataset with realistic layouts.arXiv preprint arXiv:2509.10813, 2025

  45. [45]

    Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A pho- torealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision,pages10912–10922, 2021

  46. [46]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020

  47. [47]

    Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

    Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M Seitz. Pho- toshape: Photorealistic materials for large-scale shape collections.arXiv preprint arXiv:1809.09761, 2018

  48. [48]

    Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

    Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, and Dahua Lin. Make-it-real: Unleashing large multimodal model for painting 3d objects with realistic materials.Advances in Neural Information Processing Systems, 37:99262– 99298, 2024

  49. [49]

    Geocalib: Learning single-image calibration with geometric optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuro- pean Conference on Computer Vision, pages 1–20. Springer, 2024

  50. [50]

    Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

  51. [51]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

  52. [52]

    A multi-view stereo benchmark 20 J

    ThomasSchops,JohannesLSchonberger,SilvanoGalliani,TorstenSattler,Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark 20 J. Sun et al. with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

  53. [53]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  54. [54]

    Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learn- ing from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

  55. [55]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  56. [56]

    Rio: 3d object instance re-localization in changing indoor environments

    Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re-localization in changing indoor environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019

  57. [57]

    # "Context Transformer Decoder Rotation Head !

    Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelli- gence, 13(4):376–380, 2002. GARDEN21 Supplementary Material Overview This supplementary material provides additional technical details and experi- mental results for GARDEN. Section A describes the arc...