Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Jinghao Yan; Jinkun Hao; Junru Lu; Wanshui Gan; Xudong Xu; Yixuan Yang; Zhaoyang Lyu; Zhen Luo

arxiv: 2605.18451 · v1 · pith:K2T5QKQRnew · submitted 2026-05-18 · 💻 cs.CV · cs.GR

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Yixuan Yang , Zhen Luo , Wanshui Gan , Jinkun Hao , Junru Lu , Jinghao Yan , Zhaoyang Lyu , Xudong Xu This is my paper

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords 3D room generationBlender code synthesisagentic frameworktop-down imageindoor scene synthesisMLLM agentcode-based modelingvirtual environment

0 comments

The pith

A multi-stage MLLM agent parses top-down room images and outputs executable Blender code to build 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code-as-Room, a framework in which a multimodal large language model acts as an agent to turn a single top-down image into Blender Python scripts that define room geometry, materials, and lighting. It proceeds through a structured sequence of image parsing, relationship extraction, and code writing, with a cross-stage memory module that carries forward earlier decisions to avoid losing context. Existing image-conditioned agents often become unstable or loop indefinitely, while text prompts lose exact layout details, so this code-first route aims to produce functional, editable 3D rooms for design, VR, gaming, and embodied AI. A dedicated benchmark lets the authors compare the new pipeline against prior agent methods.

Core claim

Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline equipped with a cross-stage memory module and structured execution harness.

What carries the argument

The multi-stage agentic pipeline that maintains cross-stage memory while generating Blender code from parsed image elements and spatial relations.

If this is right

The generated code runs directly in Blender to produce complete, renderable 3D rooms.
The execution harness and memory module prevent the infinite loops and instability reported for earlier image-conditioned agents.
A new benchmark supplies standard evaluation protocols for code-based room synthesis methods.
Precise control over geometry, materials, and lighting becomes possible because the output is editable source code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The editable nature of the code could let users iteratively refine scenes by editing the script rather than restarting from the image.
The same staged parsing-plus-code approach might transfer to generating 3D models of outdoor scenes or non-room interiors from different reference views.
Because the code is executable, it could be combined with simulation engines to test embodied AI agents inside the synthesized rooms without additional modeling steps.

Load-bearing premise

The multimodal model can extract accurate object identities and spatial relationships from one top-down image without omissions or errors that would make the generated code fail to match the scene.

What would settle it

Execute the output Blender code on a held-out set of top-down images and measure whether the resulting 3D models, when viewed from matching angles, reproduce the original object counts, positions, and approximate sizes within a small tolerance.

Figures

Figures reproduced from arXiv: 2605.18451 by Jinghao Yan, Jinkun Hao, Junru Lu, Wanshui Gan, Xudong Xu, Yixuan Yang, Zhaoyang Lyu, Zhen Luo.

**Figure 1.** Figure 1: Code-as-Room brings diverse interactive 3D scenes from a single top-down view image. We design an agentic system with a structured execution harness and activate the MLLMs’ ability to understand, design, and code the 3D rooms in Blender. Abstract Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied… view at source ↗

**Figure 2.** Figure 2: Overview of the Code-as-Room pipeline. A single top-down view image is progressively transformed into a fully renderable 3D scene through a sequence of specialized MLLM agent stages, organized into five phases: image-based scene structuring, layout code generation, layout-grounded object profiling, object-level code generation, and interior decoration code generation. Arrows denote data flow through the c… view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons corresponding to the benchmark results in Table 1. With our [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Representative qualitative results corresponding to the benchmarks presented in Table 1. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Detailed qualitative analysis and performance comparisons relative to the benchmarks in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Comparative performance analysis of Gemini 3.1-Pro integrated with CaR (ours) versus [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Visual Enhancement Comparison: From Base 3D Scenes (left) to Realistic Re-rendering [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study results evaluating the impact of the memory system and visual feedback [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a structured agentic pipeline for turning top-down images into Blender code for rooms, with a harness and memory to cut looping, plus a new benchmark, but the abstract shows no numbers to back the gains.

read the letter

The main point is that Code-as-Room uses an MLLM to parse a top-down room image and then write Blender code across stages for geometry, materials, and lighting, with an execution harness and cross-stage memory to keep things stable and avoid infinite loops or context loss. This is presented as a fix for instability in earlier image-conditioned agents. The specific combination of code synthesis, harness, memory module, and a dedicated benchmark for top-down to Blender tasks looks like the fresh part relative to the cited agent literature. The pipeline itself is a reasonable engineering response to the problem of generating controllable 3D scenes for VR or training data. The benchmark is a concrete addition that could let others run systematic comparisons on code-based outputs. If the full experiments show clear improvements in success rate or reduced errors, that would be the useful output. The soft spot is the missing quantitative evidence. The abstract describes the stages and claims the harness helps, but there are no results, ablations, or error breakdowns visible here. Top-down images already drop depth and occlusion cues, so the first parsing step by the MLLM is where mistakes in object identity or layout are most likely to start. The memory module keeps context but does not appear to add verification or rollback for those upstream errors, and the benchmark is end-to-end rather than isolating parsing accuracy. That leaves open whether the claimed stability comes from the harness or from other factors. This is for researchers working on LLM agents for graphics, embodied AI scene synthesis, or synthetic data pipelines. A reader who wants practical code-generation setups or a new evaluation set for this task would get value from the benchmark and pipeline description. It deserves a serious referee to examine the actual numbers, the benchmark details, and whether the harness delivers measurable robustness. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Code-as-Room, an MLLM-based agentic framework that generates 3D indoor rooms from top-down view images by synthesizing executable Blender code. It parses the reference image to extract scene elements and spatial relationships, then uses a multi-stage pipeline for geometry, materials, and lighting synthesis, augmented by a cross-stage memory module to mitigate context forgetting. The work also introduces a dedicated benchmark for code-based 3D room synthesis and conducts comparisons against existing agent-based methods.

Significance. If the central claims hold, the framework could improve stability and precision in image-conditioned 3D room generation compared to prior MLLM agents, offering advantages in interpretability through code output and applicability to interior design, VR, gaming, and embodied AI. The introduction of a structured execution harness and a new benchmark provides concrete tools for future work in this area.

major comments (2)

[Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.
[Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.

minor comments (2)

[Abstract] The abstract states that comparisons validate the harness but does not preview any key quantitative metrics or error rates, which would strengthen the summary for readers.
[Method] Clarify the exact interface between the cross-stage memory module and the code execution environment to avoid ambiguity in how context is preserved across stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.

Authors: We agree that the image parsing stage is foundational and that top-down views present inherent challenges due to missing depth and occlusion information. The manuscript highlights the multi-stage pipeline and cross-stage memory module as mechanisms to maintain consistency and reduce error propagation across stages. However, we acknowledge that explicit verification, rollback procedures, and quantitative metrics for parsing accuracy are not reported in the current version. In the revised manuscript, we will add a new subsection under the method that includes human-evaluated parsing accuracy metrics (e.g., element detection precision and spatial relationship correctness on a subset of the benchmark) along with examples of how the agentic framework can detect and mitigate hallucinations through iterative code refinement. revision: yes
Referee: [Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.

Authors: We thank the referee for this observation. The benchmark and evaluation protocols focus on end-to-end metrics such as code executability, visual fidelity, and functional correctness to enable fair comparisons with prior agent-based methods. We recognize that additional component-level analysis would strengthen the claims. In the revised manuscript, we will expand the experiments section with ablations that isolate the execution harness (comparing runs with and without it to quantify stability gains) and include proxy metrics for parsing fidelity where feasible, such as correlation between parsing quality and final room quality scores. This will clarify the sources of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with no derivations or fitted predictions

full rationale

The paper presents Code-as-Room as an MLLM-based agentic framework that parses top-down images and synthesizes Blender code via a multi-stage pipeline with a cross-stage memory module. No equations, parameters, or first-principles derivations appear in the abstract or description. The approach is an explicit engineering construction whose claims are evaluated on a new benchmark rather than reducing to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are invoked. The central pipeline is externally falsifiable through execution and benchmark metrics, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an applied engineering system rather than a mathematical derivation; no free parameters, axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1150 out tokens · 24712 ms · 2026-05-20T11:46:12.001471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

System card:claude opus 4.6.Anthropic Technical Report, 2026

Anthropic. System card:claude opus 4.6.Anthropic Technical Report, 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-12

work page 2026
[2]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016

work page 2016
[3]

A generalized semantic representation for procedural generation of rooms

J Timothy Balint and Rafael Bidarra. A generalized semantic representation for procedural generation of rooms. InProceedings of the 14th International Conference on the F oundations of Digital Games, pages 1–8, 2019

work page 2019
[4]

I- design: Personalized llm interior designer

Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I- design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

work page 2024
[5]

Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

work page 2022
[6]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

work page 2023
[7]

Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes

Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

work page 2024
[8]

Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026

Google Gemini Team. Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf. Accessed: 2026-05-12

work page 2026
[9]

Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026

Google Gemini Team. Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-05-12

work page 2026
[10]

Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

work page 2020
[11]

InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset.arXiv preprint arXiv:1809.00716, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

work page arXiv 2007
[13]

Zero- 1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

work page 2023
[14]

Sync- dreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Sync- dreamer: Generating multiview-consistent images from a single-view image. InInternational conference on learning representations, volume 2024, pages 27676–27697, 2024

work page 2024
[15]

Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, and Rana Hanocka. Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

work page arXiv 2025
[16]

Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, and Yanwei Fu. Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

work page
[17]

URLhttps://arxiv.org/abs/2605.16137

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

work page 2011
[19]

Gpt-5.5 model card and system card.OpenAI Technical Report, 2026

OpenAI. Gpt-5.5 model card and system card.OpenAI Technical Report, 2026. URLhttps: //deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf. Accessed: 2026-05-12. 13

work page 2026
[20]

arXiv preprint arXiv:2602.09153 , year=

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

work page arXiv 2026
[21]

Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026

Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URLhttps: //qwen.ai/blog?id=qwen3.6-27b

work page 2026
[22]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

work page 2012
[23]

3d-gpt: Proce- dural 3d modeling with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Proce- dural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

work page 2025
[24]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

work page 2025
[25]

3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, et al. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

work page arXiv 2025
[26]

arXiv preprint arXiv:2602.10116 , year=

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei- Chiu Ma, Shenlong Wang, et al. Sage: Scalable agentic 3d scene generation for embodied ai.arXiv preprint arXiv:2602.10116, 2026

work page arXiv 2026
[27]

Constraint-based automatic placement for scene composition

Ken Xu, James Stewart, and Eugene Fiume. Constraint-based automatic placement for scene composition. InGraphics Interface, volume 2, pages 25–34, 2002

work page 2002
[28]

Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

work page 2026
[29]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent

Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024

work page arXiv 2024
[30]

Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng. Optiscene: Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization.Advances in Neural Information Processing Systems, 38:42499– 42529, 2026

work page 2026
[31]

Holodeck: Language guided genera- tion of 3d embodied AI environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Kiana Ehsani, and Eric Kolve. Holodeck: Language guided genera- tion of 3d embodied AI environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[32]

Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

work page 2025
[33]

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Dar- rell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Make it home: Automatic optimization of furniture arrangement.ACM Trans

Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F Chan, and Stanley J Osher. Make it home: Automatic optimization of furniture arrangement.ACM Trans. Graph., 30(4):86, 2011

work page 2011
[35]

The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015

Lap-Fai Yu, Sai-Kit Yeung, and Demetri Terzopoulos. The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015. 14

work page 2015

[1] [1]

System card:claude opus 4.6.Anthropic Technical Report, 2026

Anthropic. System card:claude opus 4.6.Anthropic Technical Report, 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-12

work page 2026

[2] [2]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016

work page 2016

[3] [3]

A generalized semantic representation for procedural generation of rooms

J Timothy Balint and Rafael Bidarra. A generalized semantic representation for procedural generation of rooms. InProceedings of the 14th International Conference on the F oundations of Digital Games, pages 1–8, 2019

work page 2019

[4] [4]

I- design: Personalized llm interior designer

Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I- design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

work page 2024

[5] [5]

Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

work page 2022

[6] [6]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

work page 2023

[7] [7]

Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes

Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

work page 2024

[8] [8]

Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026

Google Gemini Team. Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf. Accessed: 2026-05-12

work page 2026

[9] [9]

Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026

Google Gemini Team. Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-05-12

work page 2026

[10] [10]

Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

work page 2020

[11] [11]

InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset.arXiv preprint arXiv:1809.00716, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020

work page arXiv 2007

[13] [13]

Zero- 1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

work page 2023

[14] [14]

Sync- dreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Sync- dreamer: Generating multiview-consistent images from a single-view image. InInternational conference on learning representations, volume 2024, pages 27676–27697, 2024

work page 2024

[15] [15]

Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, and Rana Hanocka. Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025

work page arXiv 2025

[16] [16]

Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, and Yanwei Fu. Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,

work page

[17] [17]

URLhttps://arxiv.org/abs/2605.16137

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011

work page 2011

[19] [19]

Gpt-5.5 model card and system card.OpenAI Technical Report, 2026

OpenAI. Gpt-5.5 model card and system card.OpenAI Technical Report, 2026. URLhttps: //deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf. Accessed: 2026-05-12. 13

work page 2026

[20] [20]

arXiv preprint arXiv:2602.09153 , year=

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

work page arXiv 2026

[21] [21]

Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026

Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URLhttps: //qwen.ai/blog?id=qwen3.6-27b

work page 2026

[22] [22]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

work page 2012

[23] [23]

3d-gpt: Proce- dural 3d modeling with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Proce- dural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

work page 2025

[24] [24]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

work page 2025

[25] [25]

3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, et al. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025

work page arXiv 2025

[26] [26]

arXiv preprint arXiv:2602.10116 , year=

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei- Chiu Ma, Shenlong Wang, et al. Sage: Scalable agentic 3d scene generation for embodied ai.arXiv preprint arXiv:2602.10116, 2026

work page arXiv 2026

[27] [27]

Constraint-based automatic placement for scene composition

Ken Xu, James Stewart, and Eugene Fiume. Constraint-based automatic placement for scene composition. InGraphics Interface, volume 2, pages 25–34, 2002

work page 2002

[28] [28]

Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026

work page 2026

[29] [29]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent

Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024

work page arXiv 2024

[30] [30]

Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng. Optiscene: Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization.Advances in Neural Information Processing Systems, 38:42499– 42529, 2026

work page 2026

[31] [31]

Holodeck: Language guided genera- tion of 3d embodied AI environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Kiana Ehsani, and Eric Kolve. Holodeck: Language guided genera- tion of 3d embodied AI environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[32] [32]

Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

work page 2025

[33] [33]

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Dar- rell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Make it home: Automatic optimization of furniture arrangement.ACM Trans

Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F Chan, and Stanley J Osher. Make it home: Automatic optimization of furniture arrangement.ACM Trans. Graph., 30(4):86, 2011

work page 2011

[35] [35]

The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015

Lap-Fai Yu, Sai-Kit Yeung, and Demetri Terzopoulos. The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015. 14

work page 2015