pith. machine review for the scientific record. sign in

arxiv: 2511.21471 · v4 · submitted 2025-11-26 · 💻 cs.AI

Recognition: unknown

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Gege Qi, Jianing Li, Peiran Xu, Sudong Wang, Yao Zhu, Yunjian Zhang

Authors on Pith no claims yet
classification 💻 cs.AI
keywords spatialcognitionmllmsmodelshierarchicallevelsmultimodalacross
0
0 comments X
read the original abstract

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D Primitives are a Spatial Language for VLMs

    cs.CV 2026-05 conditional novelty 7.0

    3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.

  2. ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

    cs.CV 2026-04 unverdicted novelty 7.0

    ARGOS is the first benchmark reformulating multi-camera person search as an agentic interactive reasoning task grounded in a spatio-temporal topology graph, with 2691 tasks across three tracks where current LLMs achie...

  3. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  4. Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

    cs.CV 2026-05 unverdicted novelty 6.0

    Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...

  5. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

  6. SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.