MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
SCBench shows frontier models lose accuracy as spatial tasks require more global consistency, with gains from extra tokens saturating quickly.
citing papers explorer
-
Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.
-
Spatial Competence Benchmark
SCBench shows frontier models lose accuracy as spatial tasks require more global consistency, with gains from extra tokens saturating quickly.