arxiv: 2604.27555 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation

Song Tang , Kaiyong Zhao , Yuliang Li , Qingsong Yan , Penglei Sun , Junyi Zou , Qiang Wang , Xiaowen Chu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords 3D scene generationdomain-specific languagelarge language modelsindoor environmentsspatial reasoningphysical constraintsembodied AIvirtual reality

0 comments

The pith

SpatialGrammar encodes 3D indoor scenes as grid placements that compile deterministically to valid geometry, letting LLMs produce layouts with fewer spatial errors and collisions than raw coordinates or code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialGrammar to fix a core weakness in text-to-3D generation: language models often place objects in ways that collide or ignore physical rules because coordinates and code are hard to reason over. The new language turns room layouts into top-down grid positions that automatically convert into correct 3D models while allowing easy checks for overlaps and support. This foundation supports two tools: an agent that loops back to fix problems using compiler messages, and a small model trained only on validated examples. Experiments across many room types show both tools reduce mistakes compared with earlier direct-generation methods, which matters for building usable virtual spaces in games, simulations, and robot training.

Core claim

SpatialGrammar represents gravity-aligned indoor layouts as bird's-eye-view grid placements with deterministic compilation to 3D geometry. This representation supports verifiable constraint checking, which in turn powers SG-Agent's closed-loop refinement that enforces collision-free results and SG-Mini's training on synthetic validated data. Across 159 test scenes in five scenarios, the approach yields higher spatial fidelity and physical plausibility than prior LLM baselines, with the compact model remaining competitive in single-pass generation.

What carries the argument

SpatialGrammar, the domain-specific language that encodes indoor scenes as gravity-aligned bird's-eye-view grid placements with automatic deterministic compilation to 3D geometry for constraint verification.

If this is right

SG-Agent can iteratively correct scenes using direct compiler feedback on collisions and constraints.
SG-Mini trained solely on compiler-validated data matches larger models on single-shot generation tasks.
The grid representation directly reduces spatial errors and physical violations across scenes of varying complexity.
Deterministic compilation from the DSL guarantees that output geometry satisfies the checked constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grid-plus-compiler pattern could be applied to other structured generation problems such as 2D floor-plan layout or robotic task planning.
Small specialized models trained on validated synthetic data may lower the cost of creating large collections of virtual environments for downstream AI training.
If the grid assumption proves broad enough, the method could simplify integration between language-based scene generators and physics simulators used in embodied AI.

Load-bearing premise

Typical indoor scenes can be fully captured by placing objects on a flat gravity-aligned grid without losing important spatial relationships or physical constraints.

What would settle it

A benchmark containing scenes with non-grid-aligned furniture, multi-level stacking, or objects that require precise 3D overhangs or rotations would show whether the generated outputs degrade sharply in fidelity or validity compared with coordinate-based baselines.

Figures

Figures reproduced from arXiv: 2604.27555 by Junyi Zou, Kaiyong Zhao, Penglei Sun, Qiang Wang, Qingsong Yan, Song Tang, Xiaowen Chu, Yuliang Li.

**Figure 1.** Figure 1: The LLM agent generates complete 3D indoor scenes, view at source ↗

**Figure 2.** Figure 2: Each row shows DSL code (left), semantic interme view at source ↗

**Figure 3.** Figure 3: Sub-layout examples for hierarchical scene con view at source ↗

**Figure 4.** Figure 4: Modular asset composition based on our sub-layout view at source ↗

**Figure 5.** Figure 5: Overview of the SpatialGrammar agentic workflow. A user issues a request in the chat terminal, then the agent uses view at source ↗

**Figure 6.** Figure 6: Visual comparison of generation results across different methods in various scenarios. To ensure fairness, we used a view at source ↗

**Figure 7.** Figure 7: Qualitative example of DPO training effects: in view at source ↗

**Figure 8.** Figure 8: Our Agent system supports contextual memory-based iterative scene refinement, progressively evolving a simple wall view at source ↗

**Figure 9.** Figure 9: Architectural generation examples showing wall view at source ↗

**Figure 10.** Figure 10: Diverse display of scenes generated by our model and GT of instance segmentation & depth. view at source ↗

**Figure 11.** Figure 11: DRFR evaluation for Task 1 (Single Object Placement). Visual comparison between SG-Mini and SceneTeller, showing view at source ↗

**Figure 12.** Figure 12: DRFR evaluation for Task 2 (Multiple Object Combination). Visual comparison between SG-Agent and Respace on a view at source ↗

**Figure 13.** Figure 13: DRFR evaluation for Task 3 (Multi-turn Conversational Editing). This example validates whether incremental view at source ↗

**Figure 14.** Figure 14: DRFR evaluation for Task 4 (Hierarchical Placement). The evaluation focuses on verifying hierarchical constraints view at source ↗

**Figure 15.** Figure 15: DRFR evaluation for Task 5 (Architectural Generation). The checkpoints verify structural integrity (e.g., closed view at source ↗

**Figure 16.** Figure 16: Training loss curves for the Pre-training (left) and SFT (right) stages. view at source ↗

**Figure 18.** Figure 18: Architectural generation results produced by SG-Agent using the LLMSLB DSL. view at source ↗

**Figure 17.** Figure 17: Grid size ablation results. DRFR (solid line, left axis) view at source ↗

**Figure 19.** Figure 19: Demonstration of Multi-turn Conversational Editing (Task 3). The figure presents a sequential visualization of a view at source ↗

read the original abstract

Automatically generating interactive 3D indoor scenes from natural language is crucial for virtual reality, gaming, and embodied AI. However, existing LLM-based approaches often suffer from spatial errors and collisions, in part because common scene representations-raw coordinates or verbose code-are difficult for models to reason about 3D spatial relationships and physical constraints. We propose SpatialGrammar, a domain-specific language that represents gravity-aligned indoor layouts as BEV grid placements with deterministic compilation to valid 3D geometry, enabling verifiable constraint checking. Building on this representation, we develop (1) SG-Agent, a closed-loop system that uses compiler feedback to iteratively refine scenes and enforce collision constraints, and (2) SG-Mini, a 104M-parameter model trained entirely on compiler-validated synthetic data. Across 159 test scenes spanning five scenarios of different complexity, SG-Agent improves spatial fidelity and physical plausibility over prior methods, while SG-Mini performs competitively against larger LLM-based baselines on single-shot generation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialGrammar gives LLMs a BEV-grid DSL and compiler feedback loop that cuts spatial errors in indoor scenes, with a small model trained on validated data holding up against larger baselines.

read the letter

The paper's core move is a domain-specific language that encodes gravity-aligned indoor layouts as bird's-eye-view grid placements, then compiles them deterministically into 3D geometry with built-in constraint checks. From there it builds SG-Agent, which feeds compiler errors back to the LLM for iterative fixes, and SG-Mini, a 104M-parameter model trained only on scenes that passed those checks. Across 159 test scenes in five scenarios, the agent version lifts spatial fidelity and physical plausibility over earlier LLM approaches, while the small model stays competitive in single-shot generation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SpatialGrammar, a domain-specific language that represents gravity-aligned indoor layouts as BEV grid placements with deterministic compilation to valid 3D geometry, enabling verifiable constraint checking. It develops SG-Agent, a closed-loop LLM system that uses compiler feedback to iteratively refine scenes and enforce collision constraints, and SG-Mini, a 104M-parameter model trained entirely on compiler-validated synthetic data. Across 159 test scenes spanning five scenarios of varying complexity, the paper claims SG-Agent improves spatial fidelity and physical plausibility over prior methods, while SG-Mini performs competitively against larger LLM-based baselines in single-shot generation.

Significance. If the results hold, the work provides a structured approach to mitigating spatial errors in LLM-based 3D scene generation through a verifiable DSL and feedback mechanism, which is relevant for VR, gaming, and embodied AI. The closed-loop refinement and training of a compact model on synthetic data demonstrate efficiency advantages. However, significance is conditional on whether the BEV representation adequately captures the full range of indoor spatial constraints; limitations here would restrict generalizability of the reported gains.

major comments (2)

[§3 (SpatialGrammar DSL definition)] §3 (SpatialGrammar DSL definition): The central claim that the representation enables enforcement of all relevant physical constraints rests on gravity-aligned BEV grid placements with deterministic compilation. This may omit stacking and surface-support relations (e.g., objects on tables or shelves), as only floor-level (x,y) positions and orientations with fixed heights appear to be encoded. If so, the compiler feedback loop cannot detect or correct these constraints, limiting the physical plausibility improvements to a subset of the 159 scenes. The paper must clarify the grammar's expressiveness limits and confirm whether test scenes include hierarchical placements.
[§5 (Experiments and evaluation)] §5 (Experiments and evaluation): The abstract reports quantitative gains on 159 scenes, but without details on exact metrics for spatial fidelity and physical plausibility, baseline implementations, error bars, statistical significance tests, or data exclusion criteria, the robustness of the improvements cannot be verified. This directly affects the soundness assessment; please add per-scenario tables, ablation results on the feedback loop, and full metric definitions.

minor comments (2)

[Abstract] The abstract refers to 'five scenarios of different complexity' without naming or briefly describing them; this should be clarified in the abstract or introduction for immediate context.
[§3] Notation and examples for the SpatialGrammar DSL could be expanded with additional concrete scene examples in the main text to improve readability for readers unfamiliar with the representation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us strengthen the clarity and rigor of the manuscript. We address each major comment below, providing point-by-point responses and indicating the revisions made to the paper.

read point-by-point responses

Referee: [§3 (SpatialGrammar DSL definition)] §3 (SpatialGrammar DSL definition): The central claim that the representation enables enforcement of all relevant physical constraints rests on gravity-aligned BEV grid placements with deterministic compilation. This may omit stacking and surface-support relations (e.g., objects on tables or shelves), as only floor-level (x,y) positions and orientations with fixed heights appear to be encoded. If so, the compiler feedback loop cannot detect or correct these constraints, limiting the physical plausibility improvements to a subset of the 159 scenes. The paper must clarify the grammar's expressiveness limits and confirm whether test scenes include hierarchical placements.

Authors: We appreciate the referee's observation on the scope of the DSL. SpatialGrammar is explicitly designed for gravity-aligned, floor-level indoor layouts represented as BEV grid placements, where each object receives an (x, y) grid coordinate and orientation, with height and vertical extent determined deterministically by the compiler from object category metadata. This enables verifiable floor-plane collision and boundary constraint checking. The grammar does not encode stacking or surface-support relations (e.g., objects resting on tables or shelves), as these would require a hierarchical scene representation beyond the current flat grid model. We have revised §3 to include an explicit subsection stating the grammar's expressiveness limits and assumptions. All 159 test scenes across the five scenarios consist exclusively of floor-level placements with no hierarchical stacking or surface supports, as defined in the scenario specifications in §5. The reported gains in physical plausibility therefore apply fully within this floor-level scope. We have also added a limitations paragraph discussing extensions to hierarchical placements as future work. revision: yes
Referee: [§5 (Experiments and evaluation)] §5 (Experiments and evaluation): The abstract reports quantitative gains on 159 scenes, but without details on exact metrics for spatial fidelity and physical plausibility, baseline implementations, error bars, statistical significance tests, or data exclusion criteria, the robustness of the improvements cannot be verified. This directly affects the soundness assessment; please add per-scenario tables, ablation results on the feedback loop, and full metric definitions.

Authors: We agree that expanded experimental reporting is required for full verification. In the revised manuscript we have augmented §5 with the following: (1) complete mathematical definitions of spatial fidelity (collision rate, overlap ratio, boundary adherence) and physical plausibility (stability score, support validity) metrics in a new §5.1 subsection; (2) per-scenario tables (new Table 3) reporting means and standard deviations for each metric across all five scenarios, with corresponding error bars added to Figure 4; (3) results of paired t-tests with p-values to establish statistical significance; (4) detailed descriptions of baseline implementations in §5.2, including prompting strategies and model versions for prior methods; (5) ablation studies on the compiler feedback loop (new Table 4) comparing SG-Agent performance with and without iterative refinement; and (6) explicit statement that no scenes were excluded—all 159 test scenes were evaluated. Evaluation code and metric implementations have been added to the supplementary material to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; compiler feedback and held-out evaluation remain independent of model outputs.

full rationale

The paper defines SpatialGrammar as an external DSL whose grammar rules and deterministic compiler are specified independently of any LLM outputs or fitted parameters. SG-Agent's iterative refinement loop receives constraint violations from this compiler rather than from quantities defined in terms of the agent's own generations. SG-Mini is trained on synthetic scenes that have already passed compiler validation, but the test set of 159 scenes is held out and the reported metrics (spatial fidelity, physical plausibility) are computed against ground-truth layouts using the same external compiler. No equation or claim reduces a prediction to a fitted input by construction, and no uniqueness theorem or ansatz is smuggled via self-citation. The only minor self-reference is the authors' own prior DSL work, which is not load-bearing for the central empirical claim. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that a discrete BEV grid plus deterministic compilation can enforce all necessary indoor-scene constraints; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Indoor scenes can be adequately represented as gravity-aligned layouts on a BEV grid without loss of critical spatial information or physical constraints.
Invoked when proposing SpatialGrammar as the core representation.

invented entities (1)

SpatialGrammar DSL no independent evidence
purpose: Provide a constrained, compiler-checkable representation for LLM scene generation.
Newly introduced language and associated compiler.

pith-pipeline@v0.9.0 · 5489 in / 1395 out tokens · 34656 ms · 2026-05-07T07:54:20.734342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Rio Aguina-Kang, Maxim Gumin, Do Heon Han, Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R Kenny Jones, Qiuhong Anna Wei, Kailiang Fu, and Daniel Ritchie. 2024. Open-universe indoor scene generation using llm program synthesis and uncurated object databases.arXiv preprint arXiv:2403.09675(2024)

work page arXiv 2024
[2]

Anthropic. 2024. Introducing the Model Context Protocol. Model Context Protocol

2024
[3]

Martin JJ Bucher and Iro Armeni. 2025. ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment.arXiv preprint arXiv:2506.02459(2025)

work page arXiv 2025
[4]

Wei Deng, Mengshi Qi, and Huadong Ma. 2025. Global-local tree search in vlms for 3d indoor scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 8975–8984

2025
[5]

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems36 (2023), 18225–18250

2023
[6]

Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. 2024. Anyhome: Open- vocabulary generation of structured and textured 3d homes. InEuropean Confer- ence on Computer Vision. Springer, 52–70

2024
[7]

Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. 2025. ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary. InProceedings of the Computer Vision and Pattern Recognition Conference. 2891–2901

2025
[8]

Maxim Gumin, Do Heon Han, Seung Jean Yoo, Aditya Ganeshan, R Kenny Jones, Kailiang Fu, Rio Aguina-Kang, Stewart Morris, and Daniel Ritchie. 2025. Procedural Scene Programs for Open-Universe Scene Generation: LLM-Free Error Correction via Program Search. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

2025
[9]

Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner
[10]

Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProc. of ICCV. 7909–7920
[11]

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A Ross, Cordelia Schmid, and Alireza Fathi. 2024. SceneCraft: An LLM Agent for Synthe- sizing 3D Scenes as Blender Code. InProc. of ICML. 19252–19282

2024
[12]

Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, and Alireza Fathi. 2025. Fireplace: Geometric refinements of llm common sense reasoning for 3d object placement. InProceedings of the Computer Vision and Pattern Recognition Conference. 13466–13476

2025
[13]

Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. 2023. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. InProc. of the AAAI. 1486–1494

2023
[14]

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. 2025. SpatialLM: Training Large Language Models for Structured Indoor Modeling.arXiv preprint arXiv:2506.07491(2025)

work page arXiv 2025
[15]

Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or
[16]

Latent-nerf for shape-guided generation of 3d shapes and textures. InProc. of CVPR. 12663–12673
[17]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM65 (2021), 99–106

2021
[18]

Başak Melis Öcal, Maxim Tatarchenko, Sezer Karaoğlu, and Theo Gevers. 2024. SceneTeller: Language-to-3D Scene Generation. InEuropean Conference on Com- puter Vision. Springer, 362–378

2024
[19]

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion.arXiv:2209.14988(2022)

work page internal anchor Pith review arXiv 2022
[20]

Hou In Derek Pun, Hou In Ivan Tam, Austin T Wang, Xiaoliang Huo, Angel X Chang, and Manolis Savva. 2025. HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation.arXiv preprint arXiv:2503.16848(2025)

work page arXiv 2025
[21]

Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. 2024. Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601(2024)

work page arXiv 2024
[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
[23]

Learning transferable visual models from natural language supervision. In Proc. of ICML. 8748–8763
[24]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[25]

Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, and Bo Dai. 2025. Direct numer- ical layout generation for 3D indoor scene synthesis via spatial reasoning.arXiv preprint arXiv:2506.05341(2025)

work page arXiv 2025
[26]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. InProc. of ICCV. 9339–9347

2019
[27]

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proc. of ICCV. 21729–21740

2023
[28]

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. 2018. Gibson env: Real-world perception for embodied agents. InProc. of CVPR. 9068–9079

2018
[29]

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. 2023. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. InProc. of the CVPR. 17830–17839

2023
[30]

Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. 2024. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866(2024)

work page arXiv 2024
[31]

Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng. 2025. Optiscene: Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization.arXiv preprint arXiv:2506.07570(2025)

work page arXiv 2025
[32]

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. 2024. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16227– 16237

2024
[33]

Place an armchair in the center of the room, facing west

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. Tang et al. Figure 10: Diverse display of scenes generated by our model and GT of instance segmentation & depth. Table 5: Summar...

2022