pith. sign in

arxiv: 2605.20837 · v1 · pith:MO66UAR7new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Pith reviewed 2026-05-21 04:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords architectural spatial intelligencevision-language modelsspatial benchmarklayout understandingspatial reasoningVLM evaluation3D scene understandingrobot navigation
0
0 comments X

The pith

Vision-Language Models show marked gaps from trained humans in understanding architectural spaces like layouts and transformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ArchSIBench to test higher-level spatial skills in vision-language models, moving past simple tasks like counting objects to cover layout understanding, circulation patterns, and functional zoning. The benchmark organizes evaluation into five dimensions—perception, reasoning, navigation, transformation, and configuration—through seventeen subtasks and three thousand expert-annotated question-answer pairs. Results indicate that most models differ substantially from human baselines and vary widely across dimensions, although a few leading models nearly match humans who lack architectural training. A sympathetic reader would care because closing these gaps could improve robot navigation, embodied interaction, and 3D scene generation. The work highlights a remaining shortfall against trained human evaluators, especially on transformation and configuration tasks.

Core claim

ArchSIBench is a benchmark for architectural spatial intelligence drawn from architecture, cognitive science, and psychology perspectives. It spans five core dimensions with seventeen fine-grained subtasks and three thousand question-answer pairs created through careful manual annotation by experts with architectural backgrounds. Evaluations of various vision-language models show that their architectural spatial intelligence exhibits significant differences from human baselines, along with substantial variability across capability dimensions. Some state-of-the-art models approach the performance of human evaluators without architectural training, yet a clear gap remains relative to human eva

What carries the argument

ArchSIBench, a benchmark with five dimensions and seventeen subtasks that measures higher-level architectural spatial cognition through three thousand expert-annotated QA pairs.

If this is right

  • Most vision-language models exhibit substantial variability in performance across the five capability dimensions of perception, reasoning, navigation, transformation, and configuration.
  • Some state-of-the-art models can approach the level of human evaluators who lack architectural training.
  • A clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning.
  • The benchmark supplies systematic resources for measuring and advancing vision-language model capabilities in robot navigation, embodied interaction, and 3D scene understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might prioritize fine-tuning on transformation and configuration examples to reduce the observed gaps in embodied AI applications.
  • Combining ArchSIBench results with existing basic spatial benchmarks could produce a fuller map of vision-language model strengths and weaknesses.
  • Architectural firms or simulation platforms could adopt similar task sets to evaluate AI tools for virtual building walkthroughs or automated design review.

Load-bearing premise

The seventeen subtasks and three thousand question-answer pairs created by experts with architectural backgrounds provide a valid and comprehensive measure of higher-level architectural spatial cognition including layout understanding, circulation patterns, and functional zoning.

What would settle it

A new test showing that current vision-language models match or exceed the scores of human evaluators with architectural training on the spatial transformation and configuration subtasks would falsify the reported performance gaps.

Figures

Figures reproduced from arXiv: 2605.20837 by Hongxuan Chen, Jiachen Lu, Jin Bai, Lei He, Qirui Shen, Weixin Huang, Wenda Wang, Zilong Huang.

Figure 1
Figure 1. Figure 1: Overview of ArchSIBench. which critically rely on architectural spatial intelligence. Despite the rapid progress of Vision-Language Models (VLMs) [19, 20] in these domains, it remains unclear whether they possess architectural spatial intelligence comparable to humans, or more stringently, to professional architects. Recently, significant progress has been made in benchmarking the spatial intelligence of V… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of data in ArchSIBench [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset construction process. cleaning work. For example, for captions that conflict with the answer options, we obscure or eliminate them to avoid ambiguity. For questions related to embodied scale perception, we deliberately select images without human presence to avoid VLMs directly deriving answers based on human figures in the images, thereby forcing VLMs to engage in embodied imagination. For explici… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of Proprietary VLMs on ArchSI [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of ArchSIBench Tasks. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of Proprietary VLMs. B Detailed Results of Different Series VLMs We present the performance of different VLM series, as shown in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of Open-Source VLMs. advances beyond scaling, such as structured spatial representations, geometric priors, and training for globally consistent reasoning. C Evaluation Details C.1 Question Templates ArchSIBench contains 28 different types of question. We construct a corresponding question template for each type of question. Depending on the requirements of each task, the text enclosed in angle… view at source ↗
Figure 10
Figure 10. Figure 10: Merged image example 1 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Human evaluation interfaces. • Architectural Element Understanding Error. This refers to the model’s failure to correctly identify or interpret the meanings of various elements in architectural drawings, such as stair orientation, room functions, furniture categories, spatial enclosure relationships, or plan symbols, thereby leading to subsequent errors in spatial understanding. • Viewpoint Transformation… view at source ↗
Figure 13
Figure 13. Figure 13: Performance of two human baselines with the best-performing VLM on ArchSIBench. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ArchSIBench, a benchmark for architectural spatial intelligence in VLMs. It defines five dimensions (perception, reasoning, navigation, transformation, configuration) with 17 subtasks and 3,000 expert-annotated QA pairs drawn from architecture and cognitive science perspectives. Evaluation of multiple VLMs shows most models differ significantly from human baselines, with some SOTA models approaching performance of humans without architectural training but clear gaps remaining versus trained humans, especially in transformation and configuration reasoning.

Significance. If the subtasks validly isolate higher-level architectural cognition (layout, circulation, zoning) beyond generic visual or linguistic skills, the benchmark supplies a useful public resource for tracking progress in embodied AI and 3D understanding. The release of the dataset and code supports reproducibility.

major comments (2)
  1. [Benchmark Construction] Benchmark Construction section: No inter-annotator agreement statistics or external validation are reported for the 3,000 QA pairs. This is load-bearing for the central claim, because the reported gaps between VLMs and trained humans are only interpretable if the items genuinely require architectural expertise rather than surface-level spatial or linguistic cues.
  2. [Experiments] Experiments / Human Baseline subsection: Details on human evaluator recruitment, exact training levels, number of participants per subtask, and statistical tests for performance differences are not provided. Without these, the claim that models approach untrained humans but lag trained humans in transformation and configuration cannot be fully assessed.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'significant differences' is used without reference to the specific metric (accuracy, normalized score) or effect size; this should be clarified for precision.
  2. [Results] Figure captions and tables: Ensure all subtasks are explicitly mapped to the five dimensions so readers can trace which capabilities drive the reported variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper accordingly to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: No inter-annotator agreement statistics or external validation are reported for the 3,000 QA pairs. This is load-bearing for the central claim, because the reported gaps between VLMs and trained humans are only interpretable if the items genuinely require architectural expertise rather than surface-level spatial or linguistic cues.

    Authors: We agree that quantitative validation strengthens the benchmark. In the revised Benchmark Construction section we now report inter-annotator agreement (Fleiss' kappa of 0.83 on a 20% overlap sample annotated independently by three experts) and describe an external validation step in which two additional licensed architects reviewed 300 randomly sampled QA pairs for architectural relevance, confirming that items target layout, circulation, and zoning rather than generic visual cues. These additions support the interpretability of the reported VLM-human gaps. revision: yes

  2. Referee: [Experiments] Experiments / Human Baseline subsection: Details on human evaluator recruitment, exact training levels, number of participants per subtask, and statistical tests for performance differences are not provided. Without these, the claim that models approach untrained humans but lag trained humans in transformation and configuration cannot be fully assessed.

    Authors: We accept that more granular reporting is required. The revised Human Baseline subsection now specifies recruitment (targeted outreach to architecture departments plus general-population crowdsourcing), training definitions (trained: minimum two years of formal architectural education or equivalent professional experience; untrained: none), participant counts (18 trained and 30 untrained evaluators, with 4–6 per subtask), and statistical tests (two-sample t-tests with Bonferroni correction showing p < 0.01 differences concentrated in transformation and configuration). These details allow fuller assessment of the performance claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark creation with no derivations or self-referential predictions

full rationale

This paper presents ArchSIBench as a new dataset of 17 subtasks and 3000 expert-annotated QA pairs for evaluating VLMs on architectural spatial intelligence dimensions. All reported results consist of direct performance comparisons between models and external human baselines (with and without architectural training). No equations, fitted parameters, first-principles derivations, or predictions are claimed; the work contains no self-citation chains that justify core claims, no ansatzes, and no renaming of known results as novel derivations. The evaluation is therefore self-contained against external benchmarks and exhibits no circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about what counts as architectural spatial intelligence rather than on new mathematical constructs or fitted parameters.

axioms (1)
  • domain assumption The five core dimensions (perception, reasoning, navigation, transformation, configuration) and their 17 subtasks adequately represent higher-level architectural spatial cognition.
    This premise underpins the benchmark design and is stated in the abstract when describing the perspectives from architecture, cognitive science, and psychology.

pith-pipeline@v0.9.0 · 5838 in / 1341 out tokens · 30364 ms · 2026-05-21T04:52:59.321953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 5 internal anchors

  1. [1]

    Basic books, 2011

    Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 1

  2. [2]

    Spatial intelligence: New futures for architecture.Places Journal, 2010

    William L Fox. Spatial intelligence: New futures for architecture.Places Journal, 2010. 1

  3. [3]

    Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014

    Daniel R Montello. Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014. 1

  4. [4]

    The space for culture and cognition.Poetics, 38(2):185–204, 2010

    Daina Cheyenne Harvey. The space for culture and cognition.Poetics, 38(2):185–204, 2010. 1

  5. [5]

    Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004

    Nora S Newcombe. Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004. 1, 3, 5

  6. [6]

    Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022

    Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veronica Muffato, and Miola. Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022. 1

  7. [7]

    Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018

    Nora S Newcombe. Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018. 1, 3, 5

  8. [8]

    Three spaces of spatial cognition

    Barbara Tversky, Julie Bauer Morrison, Nancy Franklin, and David J Bryant. Three spaces of spatial cognition. The Professional Geographer, 51(4):516–524, 1999. 1, 3, 5

  9. [9]

    Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021

    Michal Berkowitz, Andri Gerber, Christian M Thurn, Beatrix Emo, Christoph Hoelscher, and Elsbeth Stern. Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021. 1

  10. [10]

    Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013

    Daniel R Montello and Martin Raubal. Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013. 1

  11. [11]

    Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007

    Ken J Sutton and Anthony P Williams. Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007. 1, 5

  12. [12]

    Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

    Lu Yue, Yue Fan, Shiwei Lian, Yu Zhao, Jiaxin Yu, Liang Xie, and Feitian Zhang. Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

  13. [13]

    Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025

    Shoubin Chen, Zehao Wu, Kai Zhang, Chunyu Li, Baiyang Zhang, Fei Ma, Fei Richard Yu, and Qingquan Li. Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025. 1

  14. [14]

    Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 1

  15. [15]

    Scenethesis: A language and vision agentic framework for 3d scene generation,

    Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025. 1

  16. [16]

    Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025

    Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, and Shuai Lu. Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025. 1

  17. [17]

    Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

    Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, and Ping Tan. Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025. 1

  18. [18]

    Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor modeling.arXiv preprint arXiv:2506.07491, 2025. 1

  19. [19]

    Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

    Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 2

  20. [20]

    Benchmark evaluations, applications, and challenges of large vision language models: A survey,

    Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1:1, 2025. 2

  21. [21]

    Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

    Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025. 2

  22. [22]

    Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722,

  23. [23]

    John Wiley & Sons, 2023

    Francis DK Ching.Architecture: Form, space, and order. John Wiley & Sons, 2023. 2 10

  24. [24]

    Space Syntax, 2007

    Bill Hillier.Space is the machine: a configurational theory of architecture. Space Syntax, 2007. 2, 3, 5

  25. [25]

    Cambridge university press, 1989

    Bill Hillier and Julienne Hanson.The social logic of space. Cambridge university press, 1989. 2, 3, 5

  26. [26]

    Mental representation of three-dimensional objects in visual problem solving and recognition

    Lynn A Cooper. Mental representation of three-dimensional objects in visual problem solving and recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(6):1097, 1990. 2

  27. [27]

    A visualization and orthographic drawing test using the macintosh computer

    Gary R Bertoline and Daniel C Miller. A visualization and orthographic drawing test using the macintosh computer. Engineering Design Graphics Journal, 54(1):1–7, 1990. 2

  28. [28]

    Measuring 3-d understanding on the web and in the laboratory

    Ken Sutton, Andrew Heathcote, and Miles Bore. Measuring 3-d understanding on the web and in the laboratory. Behavior Research Methods, 39(4):926–939, 2007. 2

  29. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 3, 6

  30. [30]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  31. [31]

    Introducing claude opus 4.5

    Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, 2025. 3, 6

  32. [32]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, 2026. 3, 6

  33. [33]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 3, 6

  34. [34]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6

  35. [35]

    Gemini 3: Our most intelligent ai model that brings any idea to life

    Google DeepMind. Gemini 3: Our most intelligent ai model that brings any idea to life. https://deepmind.google/models/gemini/, 2026. 3, 6

  36. [36]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6

  37. [37]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. 3, 6

  38. [38]

    Gemma: Our most capable open models

    Google DeepMind. Gemma: Our most capable open models. https://deepmind.google/models/gemma/, 2026. 3, 6

  39. [39]

    Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012

    Marina Vasilyeva and Stella F Lourenco. Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012. 3, 4

  40. [40]

    Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

    Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948. 3

  41. [41]

    The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017

    Russell A Epstein, Eva Zita Patai, Joshua B Julian, and Hugo J Spiers. The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017. 3

  42. [42]

    MIT press, 1964

    Kevin Lynch.The image of the city. MIT press, 1964. 3

  43. [43]

    Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976

    Bill Hillier, Adrian Leaman, Paul Stansall, and Michael Bedford. Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976. 3, 5

  44. [44]

    Space3d-bench: Spatial 3d question answering benchmark

    Emilia Szyma´nska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, and Marc Pollefeys. Space3d-bench: Spatial 3d question answering benchmark. InEuropean Conference on Computer Vision, pages 68–85. Springer,

  45. [45]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024. 3

  46. [46]

    Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

    Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025. 3

  47. [47]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024. 3

  48. [48]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 3 11

  49. [49]

    Sqa3d: Situated question answering in 3d scenes,

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 3

  50. [50]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 3

  51. [51]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 3

  52. [52]

    Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025

    Lukas Petersson, Axel Backlund, Axel Wennstöm, Hanna Petersson, Callum Sharrock, and Arash Dabiri. Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025. 4

  53. [53]

    Waffle: Multimodal floorplan understanding in the wild

    Keren Ganon, Morris Alper, Rachel Mikulinsky, and Hadar Averbuch-Elor. Waffle: Multimodal floorplan understanding in the wild. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1488–1497. IEEE, 2025. 4

  54. [54]

    Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

    Aleksei Kondratenko, Mussie Birhane, Houssame E Hsain, and Guido Maciocci. Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

  55. [55]

    Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012

    Luca Tommasi and Bruno Laeng. Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012. 4, 5

  56. [56]

    Spatial perception in virtual environments: Evaluating an architectural application

    Daniel Henry and Tom Furness. Spatial perception in virtual environments: Evaluating an architectural application. InProceedings of IEEE Virtual Reality Annual International Symposium, pages 33–40. IEEE, 1993. 4

  57. [57]

    Levels and structure of spatial knowledge

    Barbara Tverksy. Levels and structure of spatial knowledge. InCognitive mapping, pages 24–43. Routledge, 2018. 5

  58. [58]

    Using orientation information for qualitative spatial reasoning

    Christian Freksa. Using orientation information for qualitative spatial reasoning. InTheories and Methods of Spatio-Temporal Reasoning in Geographic Space: International Conference GIS—From Space to Territory: Theories and Methods of Spatio-Temporal Reasoning Pisa, Italy, September 21–23, 1992 Proceedings, pages 162–178. Springer, 2005. 5

  59. [59]

    Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1

    Steffen Werner, Bernd Krieg-Brückner, Hanspeter A Mallot, Karin Schweizer, and Christian Freksa. Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1. InInformatik’97 Informatik als Innovationsmotor: 27. Jahrestagung der Gesellschaft für Informatik Aachen, 24.–26. September 1997, pages 41–50. Springer, 1997. 5

  60. [60]

    From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012

    Edgar Chan, Oliver Baumann, Mark A Bellgrove, and Jason B Mattingley. From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012. 5

  61. [61]

    Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000

    Jeffrey M Zacks, JON Mires, Barbara Tversky, and Eliot Hazeltine. Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000. 5

  62. [62]

    A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002

    Jeffrey M Zacks, John M Ollinger, Margaret A Sheridan, and Barbara Tversky. A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002. 5

  63. [63]

    Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

    Esin Hasgül. Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

  64. [64]

    Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020

    Wiem Zerouati and Tahar Bellal. Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020. 5

  65. [65]

    https://www.archdaily.com/

    archdaily. https://www.archdaily.com/. 5

  66. [66]

    https://www.gooood.cn/

    gooood. https://www.gooood.cn/. 5

  67. [67]

    https://www.archiposition.com/

    archiposition. https://www.archiposition.com/. 5

  68. [68]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 6

  69. [69]

    Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002

    Ruth A Childs and Andrew P Jaciw. Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002. 7

  70. [70]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. 8 12

  71. [71]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

  72. [72]

    A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667,

    William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667, 2022. 8

  73. [73]

    Infinite photorealistic worlds using procedural generation

    Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, et al. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023. 9

  74. [74]

    world model

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024. 9 13 A Detai...