MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.
arXiv preprint arXiv:2508.13141 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
citing papers explorer
-
Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.