pith. sign in

arxiv: 2410.06468 · v2 · pith:YO3YJBVEnew · submitted 2024-10-09 · 💻 cs.AI · cs.CV· cs.LG

Does Spatial Cognition Emerge in Frontier Models?

classification 💻 cs.AI cs.CVcs.LG
keywords modelsspatialbenchmarkcognitionfrontiercognitiveevaluateslarge
0
0 comments X
read the original abstract

Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition. Code and data are available: https://github.com/apple/ml-space-benchmark

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

    physics.soc-ph 2026-06 unverdicted novelty 7.0

    A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.

  2. SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

    cs.CV 2026-05 unverdicted novelty 7.0

    SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.

  3. CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

    cs.CV 2026-06 unverdicted novelty 6.0

    CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.

  4. Spatio-Temporal Grounding of Large Language Models from Perception Streams

    cs.RO 2026-04 unverdicted novelty 6.0

    FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...

  5. Artificial Phantasia: Emergent Mental Imagery in Large Language Models

    cs.AI 2025-09 unverdicted novelty 6.0

    LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.

  6. Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.

  7. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    cs.CV 2024-12 unverdicted novelty 6.0

    MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distan...

  8. AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

    cs.AI 2026-06 unverdicted novelty 5.0

    AlloSpatial adds structured allocentric priors and a harness for tool-use and arbitration to improve spatial reasoning in foundation models, with 5-18% gains on VSI-Bench and MindCube in training-free settings and fur...