ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Hongxuan Chen; Jiachen Lu; Jin Bai; Lei He; Qirui Shen; Weixin Huang; Wenda Wang; Zilong Huang

arxiv: 2605.20837 · v1 · pith:MO66UAR7new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Qirui Shen , Wenda Wang , Jiachen Lu , Zilong Huang , Jin Bai , Lei He , Hongxuan Chen , Weixin Huang This is my paper

Pith reviewed 2026-05-21 04:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords architectural spatial intelligencevision-language modelsspatial benchmarklayout understandingspatial reasoningVLM evaluation3D scene understandingrobot navigation

0 comments

The pith

Vision-Language Models show marked gaps from trained humans in understanding architectural spaces like layouts and transformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ArchSIBench to test higher-level spatial skills in vision-language models, moving past simple tasks like counting objects to cover layout understanding, circulation patterns, and functional zoning. The benchmark organizes evaluation into five dimensions—perception, reasoning, navigation, transformation, and configuration—through seventeen subtasks and three thousand expert-annotated question-answer pairs. Results indicate that most models differ substantially from human baselines and vary widely across dimensions, although a few leading models nearly match humans who lack architectural training. A sympathetic reader would care because closing these gaps could improve robot navigation, embodied interaction, and 3D scene generation. The work highlights a remaining shortfall against trained human evaluators, especially on transformation and configuration tasks.

Core claim

ArchSIBench is a benchmark for architectural spatial intelligence drawn from architecture, cognitive science, and psychology perspectives. It spans five core dimensions with seventeen fine-grained subtasks and three thousand question-answer pairs created through careful manual annotation by experts with architectural backgrounds. Evaluations of various vision-language models show that their architectural spatial intelligence exhibits significant differences from human baselines, along with substantial variability across capability dimensions. Some state-of-the-art models approach the performance of human evaluators without architectural training, yet a clear gap remains relative to human eva

What carries the argument

ArchSIBench, a benchmark with five dimensions and seventeen subtasks that measures higher-level architectural spatial cognition through three thousand expert-annotated QA pairs.

If this is right

Most vision-language models exhibit substantial variability in performance across the five capability dimensions of perception, reasoning, navigation, transformation, and configuration.
Some state-of-the-art models can approach the level of human evaluators who lack architectural training.
A clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning.
The benchmark supplies systematic resources for measuring and advancing vision-language model capabilities in robot navigation, embodied interaction, and 3D scene understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might prioritize fine-tuning on transformation and configuration examples to reduce the observed gaps in embodied AI applications.
Combining ArchSIBench results with existing basic spatial benchmarks could produce a fuller map of vision-language model strengths and weaknesses.
Architectural firms or simulation platforms could adopt similar task sets to evaluate AI tools for virtual building walkthroughs or automated design review.

Load-bearing premise

The seventeen subtasks and three thousand question-answer pairs created by experts with architectural backgrounds provide a valid and comprehensive measure of higher-level architectural spatial cognition including layout understanding, circulation patterns, and functional zoning.

What would settle it

A new test showing that current vision-language models match or exceed the scores of human evaluators with architectural training on the spatial transformation and configuration subtasks would falsify the reported performance gaps.

Figures

Figures reproduced from arXiv: 2605.20837 by Hongxuan Chen, Jiachen Lu, Jin Bai, Lei He, Qirui Shen, Weixin Huang, Wenda Wang, Zilong Huang.

**Figure 1.** Figure 1: Overview of ArchSIBench. which critically rely on architectural spatial intelligence. Despite the rapid progress of Vision-Language Models (VLMs) [19, 20] in these domains, it remains unclear whether they possess architectural spatial intelligence comparable to humans, or more stringently, to professional architects. Recently, significant progress has been made in benchmarking the spatial intelligence of V… view at source ↗

**Figure 2.** Figure 2: Distribution of data in ArchSIBench [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Dataset construction process. cleaning work. For example, for captions that conflict with the answer options, we obscure or eliminate them to avoid ambiguity. For questions related to embodied scale perception, we deliberately select images without human presence to avoid VLMs directly deriving answers based on human figures in the images, thereby forcing VLMs to engage in embodied imagination. For explici… view at source ↗

**Figure 5.** Figure 5: Performance of Proprietary VLMs on ArchSI [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Overview of ArchSIBench Tasks. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of Proprietary VLMs. B Detailed Results of Different Series VLMs We present the performance of different VLM series, as shown in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of Open-Source VLMs. advances beyond scaling, such as structured spatial representations, geometric priors, and training for globally consistent reasoning. C Evaluation Details C.1 Question Templates ArchSIBench contains 28 different types of question. We construct a corresponding question template for each type of question. Depending on the requirements of each task, the text enclosed in angle… view at source ↗

**Figure 10.** Figure 10: Merged image example 1 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 12.** Figure 12: Human evaluation interfaces. • Architectural Element Understanding Error. This refers to the model’s failure to correctly identify or interpret the meanings of various elements in architectural drawings, such as stair orientation, room functions, furniture categories, spatial enclosure relationships, or plan symbols, thereby leading to subsequent errors in spatial understanding. • Viewpoint Transformation… view at source ↗

**Figure 13.** Figure 13: Performance of two human baselines with the best-performing VLM on ArchSIBench. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ArchSIBench adds a new benchmark for higher-level architectural spatial tasks in VLMs but rests on expert annotations whose quality and specificity lack reported checks.

read the letter

I looked at the ArchSIBench paper. The main point is that it defines a benchmark with five dimensions and 17 subtasks drawn from architecture and cognitive science, then supplies 3000 expert-annotated QA pairs to test layout understanding, circulation, functional zoning, and related skills that earlier VLM spatial tests largely ignored. That combination is the actual new piece. They evaluate several models, report variability across dimensions, and show some state-of-the-art ones nearing the performance of humans without architectural training while still trailing trained humans, especially on transformation and configuration items. The human baselines with and without domain training give the results a practical anchor for embodied AI work. The construction draws on perspectives from architecture, psychology, and cognitive science, which is a reasonable way to move past elementary orientation and counting tasks. The dataset release on Hugging Face is also a plus for anyone who wants to use it directly. The softer part is the validation of the benchmark itself. The abstract notes manual annotation by experts with architectural backgrounds and human comparisons, but it does not mention inter-annotator agreement numbers, external checks that the questions actually require architectural knowledge rather than general visual or linguistic reasoning, or controls confirming the five dimensions are distinct and calibrated to expertise levels. If many items can be solved via surface cues, the claimed gaps in higher-level cognition become harder to interpret. That concern from the stress-test note lands on the current description. This work is aimed at researchers building or evaluating VLMs for robot navigation, 3D scene understanding, or built-environment tasks. A reader focused on benchmark design for specialized cognition would get concrete value from the task breakdown and the model-human comparisons. It has enough substance and a clear use case to deserve a serious referee, mainly to examine the annotation process and statistical details in the full methods. I would send it out for peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces ArchSIBench, a benchmark for architectural spatial intelligence in VLMs. It defines five dimensions (perception, reasoning, navigation, transformation, configuration) with 17 subtasks and 3,000 expert-annotated QA pairs drawn from architecture and cognitive science perspectives. Evaluation of multiple VLMs shows most models differ significantly from human baselines, with some SOTA models approaching performance of humans without architectural training but clear gaps remaining versus trained humans, especially in transformation and configuration reasoning.

Significance. If the subtasks validly isolate higher-level architectural cognition (layout, circulation, zoning) beyond generic visual or linguistic skills, the benchmark supplies a useful public resource for tracking progress in embodied AI and 3D understanding. The release of the dataset and code supports reproducibility.

major comments (2)

[Benchmark Construction] Benchmark Construction section: No inter-annotator agreement statistics or external validation are reported for the 3,000 QA pairs. This is load-bearing for the central claim, because the reported gaps between VLMs and trained humans are only interpretable if the items genuinely require architectural expertise rather than surface-level spatial or linguistic cues.
[Experiments] Experiments / Human Baseline subsection: Details on human evaluator recruitment, exact training levels, number of participants per subtask, and statistical tests for performance differences are not provided. Without these, the claim that models approach untrained humans but lag trained humans in transformation and configuration cannot be fully assessed.

minor comments (2)

[Abstract] Abstract: The phrase 'significant differences' is used without reference to the specific metric (accuracy, normalized score) or effect size; this should be clarified for precision.
[Results] Figure captions and tables: Ensure all subtasks are explicitly mapped to the five dimensions so readers can trace which capabilities drive the reported variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper accordingly to improve transparency and rigor.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: No inter-annotator agreement statistics or external validation are reported for the 3,000 QA pairs. This is load-bearing for the central claim, because the reported gaps between VLMs and trained humans are only interpretable if the items genuinely require architectural expertise rather than surface-level spatial or linguistic cues.

Authors: We agree that quantitative validation strengthens the benchmark. In the revised Benchmark Construction section we now report inter-annotator agreement (Fleiss' kappa of 0.83 on a 20% overlap sample annotated independently by three experts) and describe an external validation step in which two additional licensed architects reviewed 300 randomly sampled QA pairs for architectural relevance, confirming that items target layout, circulation, and zoning rather than generic visual cues. These additions support the interpretability of the reported VLM-human gaps. revision: yes
Referee: [Experiments] Experiments / Human Baseline subsection: Details on human evaluator recruitment, exact training levels, number of participants per subtask, and statistical tests for performance differences are not provided. Without these, the claim that models approach untrained humans but lag trained humans in transformation and configuration cannot be fully assessed.

Authors: We accept that more granular reporting is required. The revised Human Baseline subsection now specifies recruitment (targeted outreach to architecture departments plus general-population crowdsourcing), training definitions (trained: minimum two years of formal architectural education or equivalent professional experience; untrained: none), participant counts (18 trained and 30 untrained evaluators, with 4–6 per subtask), and statistical tests (two-sample t-tests with Bonferroni correction showing p < 0.01 differences concentrated in transformation and configuration). These details allow fuller assessment of the performance claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark creation with no derivations or self-referential predictions

full rationale

This paper presents ArchSIBench as a new dataset of 17 subtasks and 3000 expert-annotated QA pairs for evaluating VLMs on architectural spatial intelligence dimensions. All reported results consist of direct performance comparisons between models and external human baselines (with and without architectural training). No equations, fitted parameters, first-principles derivations, or predictions are claimed; the work contains no self-citation chains that justify core claims, no ansatzes, and no renaming of known results as novel derivations. The evaluation is therefore self-contained against external benchmarks and exhibits no circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about what counts as architectural spatial intelligence rather than on new mathematical constructs or fitted parameters.

axioms (1)

domain assumption The five core dimensions (perception, reasoning, navigation, transformation, configuration) and their 17 subtasks adequately represent higher-level architectural spatial cognition.
This premise underpins the benchmark design and is stated in the abstract when describing the perspectives from architecture, cognitive science, and psychology.

pith-pipeline@v0.9.0 · 5838 in / 1341 out tokens · 30364 ms · 2026-05-21T04:52:59.321953+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks... 3,000 question-answer pairs... evaluate 27 VLMs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 5 internal anchors

[1]

Basic books, 2011

Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 1

work page 2011
[2]

Spatial intelligence: New futures for architecture.Places Journal, 2010

William L Fox. Spatial intelligence: New futures for architecture.Places Journal, 2010. 1

work page 2010
[3]

Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014

Daniel R Montello. Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014. 1

work page 2014
[4]

The space for culture and cognition.Poetics, 38(2):185–204, 2010

Daina Cheyenne Harvey. The space for culture and cognition.Poetics, 38(2):185–204, 2010. 1

work page 2010
[5]

Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004

Nora S Newcombe. Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004. 1, 3, 5

work page 2004
[6]

Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022

Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veronica Muffato, and Miola. Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022. 1

work page 2022
[7]

Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018

Nora S Newcombe. Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018. 1, 3, 5

work page 2018
[8]

Three spaces of spatial cognition

Barbara Tversky, Julie Bauer Morrison, Nancy Franklin, and David J Bryant. Three spaces of spatial cognition. The Professional Geographer, 51(4):516–524, 1999. 1, 3, 5

work page 1999
[9]

Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021

Michal Berkowitz, Andri Gerber, Christian M Thurn, Beatrix Emo, Christoph Hoelscher, and Elsbeth Stern. Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021. 1

work page 2021
[10]

Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013

Daniel R Montello and Martin Raubal. Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013. 1

work page 2013
[11]

Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007

Ken J Sutton and Anthony P Williams. Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007. 1, 5

work page 2007
[12]

Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

Lu Yue, Yue Fan, Shiwei Lian, Yu Zhao, Jiaxin Yu, Liang Xie, and Feitian Zhang. Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

work page arXiv
[13]

Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025

Shoubin Chen, Zehao Wu, Kai Zhang, Chunyu Li, Baiyang Zhang, Fei Ma, Fei Richard Yu, and Qingquan Li. Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025. 1

work page 2025
[14]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 1

work page arXiv 2024
[15]

Scenethesis: A language and vision agentic framework for 3d scene generation,

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025. 1

work page arXiv 2025
[16]

Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025

Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, and Shuai Lu. Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025. 1

work page arXiv 2025
[17]

Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, and Ping Tan. Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025. 1

work page arXiv 2025
[18]

Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor modeling.arXiv preprint arXiv:2506.07491, 2025. 1

work page arXiv 2025
[19]

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 2

work page arXiv 2024
[20]

Benchmark evaluations, applications, and challenges of large vision language models: A survey,

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1:1, 2025. 2

work page arXiv 2025
[21]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025. 2

work page arXiv 2025
[22]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722,

work page arXiv
[23]

John Wiley & Sons, 2023

Francis DK Ching.Architecture: Form, space, and order. John Wiley & Sons, 2023. 2 10

work page 2023
[24]

Space Syntax, 2007

Bill Hillier.Space is the machine: a configurational theory of architecture. Space Syntax, 2007. 2, 3, 5

work page 2007
[25]

Cambridge university press, 1989

Bill Hillier and Julienne Hanson.The social logic of space. Cambridge university press, 1989. 2, 3, 5

work page 1989
[26]

Mental representation of three-dimensional objects in visual problem solving and recognition

Lynn A Cooper. Mental representation of three-dimensional objects in visual problem solving and recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(6):1097, 1990. 2

work page 1990
[27]

A visualization and orthographic drawing test using the macintosh computer

Gary R Bertoline and Daniel C Miller. A visualization and orthographic drawing test using the macintosh computer. Engineering Design Graphics Journal, 54(1):1–7, 1990. 2

work page 1990
[28]

Measuring 3-d understanding on the web and in the laboratory

Ken Sutton, Andrew Heathcote, and Miles Bore. Measuring 3-d understanding on the web and in the laboratory. Behavior Research Methods, 39(4):926–939, 2007. 2

work page 2007
[29]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, 2025. 3, 6

work page 2025
[32]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, 2026. 3, 6

work page 2026
[33]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 3, 6

work page 2026
[34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Gemini 3: Our most intelligent ai model that brings any idea to life

Google DeepMind. Gemini 3: Our most intelligent ai model that brings any idea to life. https://deepmind.google/models/gemini/, 2026. 3, 6

work page 2026
[36]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. 3, 6

work page 2024
[38]

Gemma: Our most capable open models

Google DeepMind. Gemma: Our most capable open models. https://deepmind.google/models/gemma/, 2026. 3, 6

work page 2026
[39]

Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012

Marina Vasilyeva and Stella F Lourenco. Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012. 3, 4

work page 2012
[40]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948. 3

work page 1948
[41]

The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017

Russell A Epstein, Eva Zita Patai, Joshua B Julian, and Hugo J Spiers. The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017. 3

work page 2017
[42]

MIT press, 1964

Kevin Lynch.The image of the city. MIT press, 1964. 3

work page 1964
[43]

Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976

Bill Hillier, Adrian Leaman, Paul Stansall, and Michael Bedford. Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976. 3, 5

work page 1976
[44]

Space3d-bench: Spatial 3d question answering benchmark

Emilia Szyma´nska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, and Marc Pollefeys. Space3d-bench: Spatial 3d question answering benchmark. InEuropean Conference on Computer Vision, pages 68–85. Springer,

work page
[45]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024. 3

work page 2024
[46]

Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025. 3

work page arXiv 2025
[47]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024. 3

work page 2024
[48]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 3 11

work page 2022
[49]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 3

work page arXiv 2022
[50]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 3

work page 2025
[51]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 3

work page 2025
[52]

Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025

Lukas Petersson, Axel Backlund, Axel Wennstöm, Hanna Petersson, Callum Sharrock, and Arash Dabiri. Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025. 4

work page arXiv 2025
[53]

Waffle: Multimodal floorplan understanding in the wild

Keren Ganon, Morris Alper, Rachel Mikulinsky, and Hadar Averbuch-Elor. Waffle: Multimodal floorplan understanding in the wild. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1488–1497. IEEE, 2025. 4

work page 2025
[54]

Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

Aleksei Kondratenko, Mussie Birhane, Houssame E Hsain, and Guido Maciocci. Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

work page arXiv
[55]

Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012

Luca Tommasi and Bruno Laeng. Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012. 4, 5

work page 2012
[56]

Spatial perception in virtual environments: Evaluating an architectural application

Daniel Henry and Tom Furness. Spatial perception in virtual environments: Evaluating an architectural application. InProceedings of IEEE Virtual Reality Annual International Symposium, pages 33–40. IEEE, 1993. 4

work page 1993
[57]

Levels and structure of spatial knowledge

Barbara Tverksy. Levels and structure of spatial knowledge. InCognitive mapping, pages 24–43. Routledge, 2018. 5

work page 2018
[58]

Using orientation information for qualitative spatial reasoning

Christian Freksa. Using orientation information for qualitative spatial reasoning. InTheories and Methods of Spatio-Temporal Reasoning in Geographic Space: International Conference GIS—From Space to Territory: Theories and Methods of Spatio-Temporal Reasoning Pisa, Italy, September 21–23, 1992 Proceedings, pages 162–178. Springer, 2005. 5

work page 1992
[59]

Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1

Steffen Werner, Bernd Krieg-Brückner, Hanspeter A Mallot, Karin Schweizer, and Christian Freksa. Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1. InInformatik’97 Informatik als Innovationsmotor: 27. Jahrestagung der Gesellschaft für Informatik Aachen, 24.–26. September 1997, pages 41–50. Springer, 1997. 5

work page 1997
[60]

From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012

Edgar Chan, Oliver Baumann, Mark A Bellgrove, and Jason B Mattingley. From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012. 5

work page 2012
[61]

Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000

Jeffrey M Zacks, JON Mires, Barbara Tversky, and Eliot Hazeltine. Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000. 5

work page 2000
[62]

A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002

Jeffrey M Zacks, John M Ollinger, Margaret A Sheridan, and Barbara Tversky. A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002. 5

work page 2002
[63]

Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

Esin Hasgül. Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

work page 2015
[64]

Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020

Wiem Zerouati and Tahar Bellal. Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020. 5

work page 2020
[65]

https://www.archdaily.com/

archdaily. https://www.archdaily.com/. 5

work page
[66]

https://www.gooood.cn/

gooood. https://www.gooood.cn/. 5

work page
[67]

https://www.archiposition.com/

archiposition. https://www.archiposition.com/. 5

work page
[68]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 6

work page 2023
[69]

Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002

Ruth A Childs and Andrew P Jaciw. Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002. 7

work page 2002
[70]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. 8 12

work page 1991
[71]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[72]

A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667,

William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667, 2022. 8

work page arXiv 2022
[73]

Infinite photorealistic worlds using procedural generation

Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, et al. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023. 9

work page 2023
[74]

world model

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024. 9 13 A Detai...

work page 2024

[1] [1]

Basic books, 2011

Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 1

work page 2011

[2] [2]

Spatial intelligence: New futures for architecture.Places Journal, 2010

William L Fox. Spatial intelligence: New futures for architecture.Places Journal, 2010. 1

work page 2010

[3] [3]

Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014

Daniel R Montello. Spatial cognition and architectural space: Research perspectives.Architectural Design, 84(5):74–79, 2014. 1

work page 2014

[4] [4]

The space for culture and cognition.Poetics, 38(2):185–204, 2010

Daina Cheyenne Harvey. The space for culture and cognition.Poetics, 38(2):185–204, 2010. 1

work page 2010

[5] [5]

Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004

Nora S Newcombe. Spatial cognition.Memory and Cognitive Processes, 3:113–163, 2004. 1, 3, 5

work page 2004

[6] [6]

Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022

Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veronica Muffato, and Miola. Individual differences in navigation: an introductory overview.Prime archives in psychology, 2:3, 2022. 1

work page 2022

[7] [7]

Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018

Nora S Newcombe. Three kinds of spatial cognition.Stevens’ handbook of experimental psychology and cognitive neuroscience, 3:1–31, 2018. 1, 3, 5

work page 2018

[8] [8]

Three spaces of spatial cognition

Barbara Tversky, Julie Bauer Morrison, Nancy Franklin, and David J Bryant. Three spaces of spatial cognition. The Professional Geographer, 51(4):516–524, 1999. 1, 3, 5

work page 1999

[9] [9]

Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021

Michal Berkowitz, Andri Gerber, Christian M Thurn, Beatrix Emo, Christoph Hoelscher, and Elsbeth Stern. Spatial abilities for architecture: Cross sectional and longitudinal assessment with novel and existing spatial ability tests.Frontiers in psychology, 11:609363, 2021. 1

work page 2021

[10] [10]

Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013

Daniel R Montello and Martin Raubal. Functions and applications of spatial cognition.Handbook of Spatial Cognition, 2013. 1

work page 2013

[11] [11]

Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007

Ken J Sutton and Anthony P Williams. Spatial cognition and its implications for design.International Association of Societies of Design Research, Hong Kong, China, 2007. 1, 5

work page 2007

[12] [12]

Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

Lu Yue, Yue Fan, Shiwei Lian, Yu Zhao, Jiaxin Yu, Liang Xie, and Feitian Zhang. Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

work page arXiv

[13] [13]

Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025

Shoubin Chen, Zehao Wu, Kai Zhang, Chunyu Li, Baiyang Zhang, Fei Ma, Fei Richard Yu, and Qingquan Li. Exploring embodied multimodal large models: Development, datasets, and future directions.Information Fusion, 122:103198, 2025. 1

work page 2025

[14] [14]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 1

work page arXiv 2024

[15] [15]

Scenethesis: A language and vision agentic framework for 3d scene generation,

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025. 1

work page arXiv 2025

[16] [16]

Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025

Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, and Shuai Lu. Floorplan-deepseek (fpds): A multimodal approach to floorplan generation using vector-based next room prediction.arXiv preprint arXiv:2506.21562, 2025. 1

work page arXiv 2025

[17] [17]

Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, and Ping Tan. Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025. 1

work page arXiv 2025

[18] [18]

Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor modeling.arXiv preprint arXiv:2506.07491, 2025. 1

work page arXiv 2025

[19] [19]

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 2

work page arXiv 2024

[20] [20]

Benchmark evaluations, applications, and challenges of large vision language models: A survey,

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1:1, 2025. 2

work page arXiv 2025

[21] [21]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025. 2

work page arXiv 2025

[22] [22]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722,

work page arXiv

[23] [23]

John Wiley & Sons, 2023

Francis DK Ching.Architecture: Form, space, and order. John Wiley & Sons, 2023. 2 10

work page 2023

[24] [24]

Space Syntax, 2007

Bill Hillier.Space is the machine: a configurational theory of architecture. Space Syntax, 2007. 2, 3, 5

work page 2007

[25] [25]

Cambridge university press, 1989

Bill Hillier and Julienne Hanson.The social logic of space. Cambridge university press, 1989. 2, 3, 5

work page 1989

[26] [26]

Mental representation of three-dimensional objects in visual problem solving and recognition

Lynn A Cooper. Mental representation of three-dimensional objects in visual problem solving and recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(6):1097, 1990. 2

work page 1990

[27] [27]

A visualization and orthographic drawing test using the macintosh computer

Gary R Bertoline and Daniel C Miller. A visualization and orthographic drawing test using the macintosh computer. Engineering Design Graphics Journal, 54(1):1–7, 1990. 2

work page 1990

[28] [28]

Measuring 3-d understanding on the web and in the laboratory

Ken Sutton, Andrew Heathcote, and Miles Bore. Measuring 3-d understanding on the web and in the laboratory. Behavior Research Methods, 39(4):926–939, 2007. 2

work page 2007

[29] [29]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, 2025. 3, 6

work page 2025

[32] [32]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6, 2026. 3, 6

work page 2026

[33] [33]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 3, 6

work page 2026

[34] [34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Gemini 3: Our most intelligent ai model that brings any idea to life

Google DeepMind. Gemini 3: Our most intelligent ai model that brings any idea to life. https://deepmind.google/models/gemini/, 2026. 3, 6

work page 2026

[36] [36]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. 3, 6

work page 2024

[38] [38]

Gemma: Our most capable open models

Google DeepMind. Gemma: Our most capable open models. https://deepmind.google/models/gemma/, 2026. 3, 6

work page 2026

[39] [39]

Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012

Marina Vasilyeva and Stella F Lourenco. Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012. 3, 4

work page 2012

[40] [40]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948. 3

work page 1948

[41] [41]

The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017

Russell A Epstein, Eva Zita Patai, Joshua B Julian, and Hugo J Spiers. The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017. 3

work page 2017

[42] [42]

MIT press, 1964

Kevin Lynch.The image of the city. MIT press, 1964. 3

work page 1964

[43] [43]

Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976

Bill Hillier, Adrian Leaman, Paul Stansall, and Michael Bedford. Space syntax.Environment and Planning B: Planning and design, 3(2):147–185, 1976. 3, 5

work page 1976

[44] [44]

Space3d-bench: Spatial 3d question answering benchmark

Emilia Szyma´nska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, and Marc Pollefeys. Space3d-bench: Spatial 3d question answering benchmark. InEuropean Conference on Computer Vision, pages 68–85. Springer,

work page

[45] [45]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16488–16498, 2024. 3

work page 2024

[46] [46]

Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025. 3

work page arXiv 2025

[47] [47]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024. 3

work page 2024

[48] [48]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 3 11

work page 2022

[49] [49]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 3

work page arXiv 2022

[50] [50]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 3

work page 2025

[51] [51]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 3

work page 2025

[52] [52]

Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025

Lukas Petersson, Axel Backlund, Axel Wennstöm, Hanna Petersson, Callum Sharrock, and Arash Dabiri. Blueprint- bench: Comparing spatial intelligence of llms, agents and image models.arXiv preprint arXiv:2509.25229, 2025. 4

work page arXiv 2025

[53] [53]

Waffle: Multimodal floorplan understanding in the wild

Keren Ganon, Morris Alper, Rachel Mikulinsky, and Hadar Averbuch-Elor. Waffle: Multimodal floorplan understanding in the wild. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1488–1497. IEEE, 2025. 4

work page 2025

[54] [54]

Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

Aleksei Kondratenko, Mussie Birhane, Houssame E Hsain, and Guido Maciocci. Aecv-bench: Benchmarking multimodal models on architectural and engineering drawings understanding.arXiv preprint arXiv:2601.04819,

work page arXiv

[55] [55]

Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012

Luca Tommasi and Bruno Laeng. Psychology of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(6):565–580, 2012. 4, 5

work page 2012

[56] [56]

Spatial perception in virtual environments: Evaluating an architectural application

Daniel Henry and Tom Furness. Spatial perception in virtual environments: Evaluating an architectural application. InProceedings of IEEE Virtual Reality Annual International Symposium, pages 33–40. IEEE, 1993. 4

work page 1993

[57] [57]

Levels and structure of spatial knowledge

Barbara Tverksy. Levels and structure of spatial knowledge. InCognitive mapping, pages 24–43. Routledge, 2018. 5

work page 2018

[58] [58]

Using orientation information for qualitative spatial reasoning

Christian Freksa. Using orientation information for qualitative spatial reasoning. InTheories and Methods of Spatio-Temporal Reasoning in Geographic Space: International Conference GIS—From Space to Territory: Theories and Methods of Spatio-Temporal Reasoning Pisa, Italy, September 21–23, 1992 Proceedings, pages 162–178. Springer, 2005. 5

work page 1992

[59] [59]

Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1

Steffen Werner, Bernd Krieg-Brückner, Hanspeter A Mallot, Karin Schweizer, and Christian Freksa. Spatial cognition: The role of landmark, route, and survey knowledge in human and robot navigation1. InInformatik’97 Informatik als Innovationsmotor: 27. Jahrestagung der Gesellschaft für Informatik Aachen, 24.–26. September 1997, pages 41–50. Springer, 1997. 5

work page 1997

[60] [60]

From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012

Edgar Chan, Oliver Baumann, Mark A Bellgrove, and Jason B Mattingley. From objects to landmarks: the function of visual location information in spatial navigation.Frontiers in psychology, 3:304, 2012. 5

work page 2012

[61] [61]

Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000

Jeffrey M Zacks, JON Mires, Barbara Tversky, and Eliot Hazeltine. Mental spatial transformations of objects and perspective.Spatial Cognition and Computation, 2(4):315–332, 2000. 5

work page 2000

[62] [62]

A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002

Jeffrey M Zacks, John M Ollinger, Margaret A Sheridan, and Barbara Tversky. A parametric study of mental spatial transformations of bodies.Neuroimage, 16(4):857–872, 2002. 5

work page 2002

[63] [63]

Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

Esin Hasgül. Space as configuration: Patterns of space and culture.Proceedings of the ARCHTHEO, 2015:9th,

work page 2015

[64] [64]

Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020

Wiem Zerouati and Tahar Bellal. Evaluating the impact of mass housings’ in-between spaces’ spatial configuration on users’ social interaction.Frontiers of Architectural Research, 9(1):34–53, 2020. 5

work page 2020

[65] [65]

https://www.archdaily.com/

archdaily. https://www.archdaily.com/. 5

work page

[66] [66]

https://www.gooood.cn/

gooood. https://www.gooood.cn/. 5

work page

[67] [67]

https://www.archiposition.com/

archiposition. https://www.archiposition.com/. 5

work page

[68] [68]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023. 6

work page 2023

[69] [69]

Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002

Ruth A Childs and Andrew P Jaciw. Matrix sampling of items in large-scale assessments.Practical Assessment, Research, and Evaluation, 8(1), 2002. 7

work page 2002

[70] [70]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. 8 12

work page 1991

[71] [71]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667,

William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667, 2022. 8

work page arXiv 2022

[73] [73]

Infinite photorealistic worlds using procedural generation

Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, et al. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023. 9

work page 2023

[74] [74]

world model

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024. 9 13 A Detai...

work page 2024