pith. sign in

arxiv: 2605.18451 · v1 · pith:K2T5QKQRnew · submitted 2026-05-18 · 💻 cs.CV · cs.GR

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords 3D room generationBlender code synthesisagentic frameworktop-down imageindoor scene synthesisMLLM agentcode-based modelingvirtual environment
0
0 comments X

The pith

A multi-stage MLLM agent parses top-down room images and outputs executable Blender code to build 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code-as-Room, a framework in which a multimodal large language model acts as an agent to turn a single top-down image into Blender Python scripts that define room geometry, materials, and lighting. It proceeds through a structured sequence of image parsing, relationship extraction, and code writing, with a cross-stage memory module that carries forward earlier decisions to avoid losing context. Existing image-conditioned agents often become unstable or loop indefinitely, while text prompts lose exact layout details, so this code-first route aims to produce functional, editable 3D rooms for design, VR, gaming, and embodied AI. A dedicated benchmark lets the authors compare the new pipeline against prior agent methods.

Core claim

Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline equipped with a cross-stage memory module and structured execution harness.

What carries the argument

The multi-stage agentic pipeline that maintains cross-stage memory while generating Blender code from parsed image elements and spatial relations.

If this is right

  • The generated code runs directly in Blender to produce complete, renderable 3D rooms.
  • The execution harness and memory module prevent the infinite loops and instability reported for earlier image-conditioned agents.
  • A new benchmark supplies standard evaluation protocols for code-based room synthesis methods.
  • Precise control over geometry, materials, and lighting becomes possible because the output is editable source code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The editable nature of the code could let users iteratively refine scenes by editing the script rather than restarting from the image.
  • The same staged parsing-plus-code approach might transfer to generating 3D models of outdoor scenes or non-room interiors from different reference views.
  • Because the code is executable, it could be combined with simulation engines to test embodied AI agents inside the synthesized rooms without additional modeling steps.

Load-bearing premise

The multimodal model can extract accurate object identities and spatial relationships from one top-down image without omissions or errors that would make the generated code fail to match the scene.

What would settle it

Execute the output Blender code on a held-out set of top-down images and measure whether the resulting 3D models, when viewed from matching angles, reproduce the original object counts, positions, and approximate sizes within a small tolerance.

Figures

Figures reproduced from arXiv: 2605.18451 by Jinghao Yan, Jinkun Hao, Junru Lu, Wanshui Gan, Xudong Xu, Yixuan Yang, Zhaoyang Lyu, Zhen Luo.

Figure 1
Figure 1. Figure 1: Code-as-Room brings diverse interactive 3D scenes from a single top-down view image. We design an agentic system with a structured execution harness and activate the MLLMs’ ability to understand, design, and code the 3D rooms in Blender. Abstract Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Code-as-Room pipeline. A single top-down view image is progressively transformed into a fully renderable 3D scene through a sequence of specialized MLLM agent stages, organized into five phases: image-based scene structuring, layout code generation, layout-grounded object profiling, object-level code generation, and interior decoration code generation. Arrows de￾note data flow through the c… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons corresponding to the benchmark results in Table 1. With our [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative qualitative results corresponding to the benchmarks presented in Table 1. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detailed qualitative analysis and performance comparisons relative to the benchmarks in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparative performance analysis of Gemini 3.1-Pro integrated with CaR (ours) versus [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual Enhancement Comparison: From Base 3D Scenes (left) to Realistic Re-rendering [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study results evaluating the impact of the memory system and visual feedback [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Code-as-Room, an MLLM-based agentic framework that generates 3D indoor rooms from top-down view images by synthesizing executable Blender code. It parses the reference image to extract scene elements and spatial relationships, then uses a multi-stage pipeline for geometry, materials, and lighting synthesis, augmented by a cross-stage memory module to mitigate context forgetting. The work also introduces a dedicated benchmark for code-based 3D room synthesis and conducts comparisons against existing agent-based methods.

Significance. If the central claims hold, the framework could improve stability and precision in image-conditioned 3D room generation compared to prior MLLM agents, offering advantages in interpretability through code output and applicability to interior design, VR, gaming, and embodied AI. The introduction of a structured execution harness and a new benchmark provides concrete tools for future work in this area.

major comments (2)
  1. [Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.
  2. [Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.
minor comments (2)
  1. [Abstract] The abstract states that comparisons validate the harness but does not preview any key quantitative metrics or error rates, which would strengthen the summary for readers.
  2. [Method] Clarify the exact interface between the cross-stage memory module and the code execution environment to avoid ambiguity in how context is preserved across stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.

    Authors: We agree that the image parsing stage is foundational and that top-down views present inherent challenges due to missing depth and occlusion information. The manuscript highlights the multi-stage pipeline and cross-stage memory module as mechanisms to maintain consistency and reduce error propagation across stages. However, we acknowledge that explicit verification, rollback procedures, and quantitative metrics for parsing accuracy are not reported in the current version. In the revised manuscript, we will add a new subsection under the method that includes human-evaluated parsing accuracy metrics (e.g., element detection precision and spatial relationship correctness on a subset of the benchmark) along with examples of how the agentic framework can detect and mitigate hallucinations through iterative code refinement. revision: yes

  2. Referee: [Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.

    Authors: We thank the referee for this observation. The benchmark and evaluation protocols focus on end-to-end metrics such as code executability, visual fidelity, and functional correctness to enable fair comparisons with prior agent-based methods. We recognize that additional component-level analysis would strengthen the claims. In the revised manuscript, we will expand the experiments section with ablations that isolate the execution harness (comparing runs with and without it to quantify stability gains) and include proxy metrics for parsing fidelity where feasible, such as correlation between parsing quality and final room quality scores. This will clarify the sources of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with no derivations or fitted predictions

full rationale

The paper presents Code-as-Room as an MLLM-based agentic framework that parses top-down images and synthesizes Blender code via a multi-stage pipeline with a cross-stage memory module. No equations, parameters, or first-principles derivations appear in the abstract or description. The approach is an explicit engineering construction whose claims are evaluated on a new benchmark rather than reducing to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are invoked. The central pipeline is externally falsifiable through execution and benchmark metrics, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an applied engineering system rather than a mathematical derivation; no free parameters, axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1150 out tokens · 24712 ms · 2026-05-20T11:46:12.001471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    System card:claude opus 4.6.Anthropic Technical Report, 2026

    Anthropic. System card:claude opus 4.6.Anthropic Technical Report, 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-12

  2. [2]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016

  3. [3]

    A generalized semantic representation for procedural generation of rooms

    J Timothy Balint and Rafael Bidarra. A generalized semantic representation for procedural generation of rooms. InProceedings of the 14th International Conference on the F oundations of Digital Games, pages 1–8, 2019

  4. [4]

    I- design: Personalized llm interior designer

    Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I- design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

  5. [5]

    Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

  6. [6]

    Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

    Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

  7. [7]

    Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes

    Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

  8. [8]

    Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026

    Google Gemini Team. Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf. Accessed: 2026-05-12

  9. [9]

    Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026

    Google Gemini Team. Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-05-12

  10. [10]

    Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

    R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

  11. [11]

    InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

    Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset.arXiv preprint arXiv:1809.00716, 2018

  12. [12]

    Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

    Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

  13. [13]

    Zero- 1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  14. [14]

    Sync- dreamer: Generating multiview-consistent images from a single-view image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Sync- dreamer: Generating multiview-consistent images from a single-view image. InInternational conference on learning representations, volume 2024, pages 27676–27697, 2024

  15. [15]

    Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

    Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, and Rana Hanocka. Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

  16. [16]

    Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

    Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, and Yanwei Fu. Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

  17. [17]

    URLhttps://arxiv.org/abs/2605.16137

  18. [18]

    Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

    Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

  19. [19]

    Gpt-5.5 model card and system card.OpenAI Technical Report, 2026

    OpenAI. Gpt-5.5 model card and system card.OpenAI Technical Report, 2026. URLhttps: //deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf. Accessed: 2026-05-12. 13

  20. [20]

    arXiv preprint arXiv:2602.09153 , year=

    Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

  21. [21]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026

    Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URLhttps: //qwen.ai/blog?id=qwen3.6-27b

  22. [22]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

  23. [23]

    3d-gpt: Proce- dural 3d modeling with large language models

    Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Proce- dural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

  24. [24]

    Layoutvlm: Differentiable optimization of 3d layout via vision-language models

    Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

  25. [25]

    3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

    Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, et al. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

  26. [26]

    arXiv preprint arXiv:2602.10116 , year=

    Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei- Chiu Ma, Shenlong Wang, et al. Sage: Scalable agentic 3d scene generation for embodied ai.arXiv preprint arXiv:2602.10116, 2026

  27. [27]

    Constraint-based automatic placement for scene composition

    Ken Xu, James Stewart, and Eugene Fiume. Constraint-based automatic placement for scene composition. InGraphics Interface, volume 2, pages 25–34, 2002

  28. [28]

    Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

    Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

  29. [29]

    Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent

    Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024

  30. [30]

    Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng. Optiscene: Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization.Advances in Neural Information Processing Systems, 38:42499– 42529, 2026

  31. [31]

    Holodeck: Language guided genera- tion of 3d embodied AI environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Kiana Ehsani, and Eric Kolve. Holodeck: Language guided genera- tion of 3d embodied AI environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  32. [32]

    Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

    Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

  33. [33]

    Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

    Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Dar- rell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

  34. [34]

    Make it home: Automatic optimization of furniture arrangement.ACM Trans

    Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F Chan, and Stanley J Osher. Make it home: Automatic optimization of furniture arrangement.ACM Trans. Graph., 30(4):86, 2011

  35. [35]

    The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015

    Lap-Fai Yu, Sai-Kit Yeung, and Demetri Terzopoulos. The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015. 14