arxiv: 2604.22409 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

Chih-Ting Liao , Xi Xiao , Chunlei Meng , Zhangquan Chen , Yitong Qiao , Weilin Zhou , Tianyang Wang , Xu Zheng

show 1 more author

Xin Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningembodied environmentsvision-language modelsspatial memorybenchmarkdynamic belief updateperception integration

0 comments

The pith

Vision-language models maintain spatial beliefs only when given text histories and collapse without them in changing environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpaMEM, a benchmark built from millions of images across 25,000 action sequences in procedurally generated houses, to test how multimodal models update spatial beliefs when objects are spawned, placed, or removed. It defines a three-level progression: basic perception from single views, temporal reasoning supplied with oracle text, and full end-to-end maintenance from raw visual streams alone. Benchmark results on representative open-source models show consistent failure at coordinate grounding and a steep drop when text scaffolding is withheld, revealing that models can follow symbolic state updates but cannot sustain visual memory across long horizons. If this diagnosis holds, progress in embodied AI will require new mechanisms that keep spatial relations coherent without external bookkeeping. A reader should care because real robots and agents must track changing layouts from egocentric camera streams, exactly the setting the benchmark isolates.

Core claim

SpaMEM formalizes embodied spatial reasoning as a three-level hierarchy over action-conditioned transformations. Level 1 measures atomic perception, Level 2 adds oracle textual histories to remove perceptual noise, and Level 3 demands belief maintenance from raw RGB, depth, and segmentation streams under identical task dimensions. Evaluations across open-source VLM families identify a stacked bottleneck: coordinate-consistent grounding remains difficult, and performance collapses from Level 2 to Level 3, indicating that models succeed via text-based bookkeeping but cannot sustain robust visual memory.

What carries the argument

The three-level task hierarchy (atomic perception, oracle-text temporal reasoning, and raw-visual end-to-end belief maintenance) that isolates perception-memory integration across spawn-place-remove action sequences.

Load-bearing premise

The procedurally generated houses, action sequences, and task levels isolate genuine spatial belief evolution without introducing simulation artifacts that would not appear in physical settings.

What would settle it

Deploy the identical Level-3 tasks on a physical robot that performs the same spawn-place-remove actions in a real room and measure whether spatial reconstruction accuracy matches or exceeds the simulated Level-3 scores.

Figures

Figures reproduced from arXiv: 2604.22409 by Chih-Ting Liao, Chunlei Meng, Tianyang Wang, Weilin Zhou, Xin Cao, Xi Xiao, Xu Zheng, Yitong Qiao, Zhangquan Chen.

**Figure 1.** Figure 1: Overview of the SpaMEM benchmark. SpaMEM evaluates spatial reasoning under dynamic scene evolution. Scenes evolve through action-conditioned transformations (spawn, place, remove) over long temporal horizons. The benchmark organizes evaluation into three hierarchical levels: L1 atomic spatial perception from single observations, L2 symbolic temporal reasoning with textual state descriptions, and L3 full… view at source ↗

**Figure 2.** Figure 2: SpaMEM evaluation framework (Update and Answer Modes). view at source ↗

**Figure 3.** Figure 3: High-level diagnostic syntheses under SpaMEM. view at source ↗

**Figure 4.** Figure 4: Fine-grained analysis of Semantic Recognition Performance (T1_F1) conditioned on receptacle types. The results highlight a persistent performance gap between salient open surfaces and occluding containers view at source ↗

**Figure 5.** Figure 5: Object-wise Semantic Recognition Performance (F1) for InternVL and Qwen families. Both families show consistent improvement in grounding mid-sized objects across generations view at source ↗

**Figure 6.** Figure 6: Comparison across different VLM architectures and SOTA leaders. The results highlight the persistent resolution bottleneck for thin objects like pencils and forks across all leading models view at source ↗

**Figure 7.** Figure 7: Temporal stability analysis of the InternVL family. While SOR-M (Perception) remains consistent due to textual grounding, CSR (Integration) exhibits a sharp decay as the event sequence length increases. This "memory entropy" phenomenon suggests that as the history grows, the cumulative logic required to maintain a consistent world model exceeds the model’s coherent reasoning capacity. As shown in view at source ↗

**Figure 8.** Figure 8: Correlation analysis between Perception (F1) and Integration (CSR). The results show that while perception anchors addition events, removal events are entirely decoupled from perceptual grounding. – Background Bias: Large interactive objects like DiningTable suffer from high failure rates. Models often misclassify these as static environmental geometry rather than dynamic interactable entities. – Resoluti… view at source ↗

**Figure 9.** Figure 9: Diagnostic analysis of text-aided episodic memory (Level 2). view at source ↗

**Figure 10.** Figure 10: Temporal stability decay in visual-only episodic memory. view at source ↗

**Figure 11.** Figure 11: Causal analysis of the perception-integration link. view at source ↗

**Figure 12.** Figure 12: Fragility and Grounding Death in Level 3. view at source ↗

**Figure 13.** Figure 13: Cross-level diagnostic comparison demonstrating symbolic depen view at source ↗

**Figure 14.** Figure 14: SpaMEM Dataset Statistics. (Left) Action distribution showing the balance between scene population and manipulation. (Right) Top-8 receptacle interaction frequency, highlighting the dominance of occluding containers like Drawers. (Bottom) Top-15 object frequency distribution across the 103 unique categories, demonstrating semantic and scale diversity view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces SpaMEM, a large-scale benchmark for dynamic spatial reasoning in embodied environments. It consists of 10,601,392 images across RGB, depth, instance, and semantic modalities collected from 1,000 procedurally generated houses and over 25,000 interaction sequences using spawn, place, and remove actions. The benchmark defines a three-level task hierarchy with 15 diagnostic tasks: Level 1 for atomic spatial perception from single observations, Level 2 for temporal reasoning using oracle textual state histories, and Level 3 for end-to-end belief maintenance from raw visual streams. Evaluations of open-source VLM families show consistent performance collapse from Level 2 to Level 3, which the authors attribute to a symbolic scaffolding dependency where models rely on text-based bookkeeping but fail to sustain visual memory.

Significance. If the central claims hold after addressing potential confounds, the work is significant as a granular diagnostic benchmark that quantifies limitations in current VLMs for long-horizon spatial coherence and belief revision under environmental change. The dataset scale (over 10 million images from 1,000 houses) and structured hierarchy provide a reproducible standard that can drive progress on state representation, belief revision, and episodic integration mechanisms. This is a clear strength for the embodied AI and multimodal reasoning community.

major comments (3)

[Abstract] The abstract's claim that the sharp collapse from Level 2 to Level 3 'exposes a pronounced symbolic scaffolding dependency' is load-bearing for the main conclusion. However, this attribution assumes the only relevant difference between oracle text and raw visuals is perceptual noise, without controls or analysis to rule out that procedural simulation properties (perfectly consistent lighting, discrete object placements, absence of naturalistic sensor noise) systematically increase visual state-tracking difficulty independently of memory mechanisms.
[§4 (Benchmarking Results)] The performance claims of a 'consistent stacked bottleneck' and 'hard ceiling' on coordinate-consistent grounding (abstract and §4) are not accompanied by error bars, statistical significance tests, or variance analysis across the 1,000 houses and 25,000+ sequences. This weakens support for the cross-level and cross-model generalizations.
[§3 (Benchmark Construction)] The three-level hierarchy is presented as cleanly isolating spatial belief evolution, but §3 does not provide explicit verification that the fixed action set and procedural house generation do not introduce task-specific biases or simulation artifacts that affect Level 3 more than Level 2 beyond the intended perceptual noise factor.

minor comments (3)

[Abstract] The abstract reports '25,000+' sequences but gives an exact image count; ensure numerical consistency and precise reporting of sequence counts in the main text and tables.
A summary table listing all 15 diagnostic tasks with their level, input type (text vs. visual), and evaluation metric would improve clarity of the task hierarchy.
[Figures] Figure captions describing the modalities and example action sequences could be expanded for readers unfamiliar with the simulation setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] The abstract's claim that the sharp collapse from Level 2 to Level 3 'exposes a pronounced symbolic scaffolding dependency' is load-bearing for the main conclusion. However, this attribution assumes the only relevant difference between oracle text and raw visuals is perceptual noise, without controls or analysis to rule out that procedural simulation properties (perfectly consistent lighting, discrete object placements, absence of naturalistic sensor noise) systematically increase visual state-tracking difficulty independently of memory mechanisms.

Authors: We appreciate this observation on the attribution. The benchmark design holds the environment, actions, house generation, and task definitions fixed across levels, with the sole controlled difference being the input (oracle text histories in Level 2 versus raw visual streams in Level 3). This isolates the visual memory component. The idealized simulation is a deliberate choice to diagnose reasoning failures without sensor confounds, consistent with standard embodied AI benchmarks. In revision we will expand the abstract and add a dedicated paragraph in §4 (and a limitations subsection) explicitly discussing this design rationale and noting that future extensions could incorporate naturalistic noise. revision: partial
Referee: [§4 (Benchmarking Results)] The performance claims of a 'consistent stacked bottleneck' and 'hard ceiling' on coordinate-consistent grounding (abstract and §4) are not accompanied by error bars, statistical significance tests, or variance analysis across the 1,000 houses and 25,000+ sequences. This weakens support for the cross-level and cross-model generalizations.

Authors: We agree that quantitative rigor requires statistical support. In the revised manuscript we will augment all tables and figures in §4 with error bars (standard deviation computed across the 1,000 houses and 25,000+ sequences) and include paired t-test results (with p-values) comparing performance across levels and models to substantiate the reported bottlenecks and generalizations. revision: yes
Referee: [§3 (Benchmark Construction)] The three-level hierarchy is presented as cleanly isolating spatial belief evolution, but §3 does not provide explicit verification that the fixed action set and procedural house generation do not introduce task-specific biases or simulation artifacts that affect Level 3 more than Level 2 beyond the intended perceptual noise factor.

Authors: We thank the referee for this point on verification. Section 3 already specifies that the action vocabulary, procedural house generator, and 15 task dimensions are identical across levels. To make this explicit, we will insert a new verification subsection in §3 that (a) confirms instance-level matching of tasks between levels, (b) reports per-house variance statistics demonstrating consistency, and (c) includes qualitative examples illustrating that any simulation artifacts are shared and do not differentially impact Level 3. These additions will be supported by supplementary per-house breakdowns. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct evaluations

full rationale

The paper introduces a new benchmark (SpaMEM) with procedurally generated data, defines a three-level task hierarchy, and reports model performance on direct evaluations across modalities and horizons. No derivations, equations, fitted parameters, or predictions appear; claims about bottlenecks and scaffolding dependency are interpretive summaries of observed results rather than reductions to self-defined inputs or self-citations. The work is self-contained as an empirical diagnostic standard without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the defined task hierarchy and procedural environments measure intended spatial belief capabilities without major confounds.

axioms (1)

domain assumption Embodied spatial reasoning can be formalized as a three-level hierarchy that isolates atomic perception, temporal reasoning with oracle textual histories, and end-to-end belief maintenance from raw visual streams.
This formalization is presented directly in the abstract as the basis for the 15 diagnostic tasks.

pith-pipeline@v0.9.0 · 5608 in / 1245 out tokens · 71138 ms · 2026-05-08T12:29:34.022871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 38 canonical work pages · 8 internal anchors

[1]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) 20

Antol, S., et al.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) 20

2015
[2]

arXiv preprint arXiv:2512.02458 (2025),https://arxiv.org/abs/2512.024582, 4, 20

Cai, Z., Du, Y., Wang, C., Kong, Y.: Vision to geometry: 3d spatial memory for sequential embodied mllm reasoning and exploration. arXiv preprint arXiv:2512.02458 (2025),https://arxiv.org/abs/2512.024582, 4, 20

work page arXiv 2025
[3]

arXiv preprint arXiv:1902.01385 (2020), https://arxiv.org/abs/1902.01385, iJCAI 2020 3

Chaplot, D.S., Lee, L., Salakhutdinov, R., Parikh, D., Batra, D.: Embod- ied multimodal multitask learning. arXiv preprint arXiv:1902.01385 (2020), https://arxiv.org/abs/1902.01385, iJCAI 2020 3

work page arXiv 1902
[4]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Chen, Z., Wang, J., Cao, Z., Liu, M., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024),https://arxiv.org/abs/2404.168218

work page internal anchor Pith review arXiv 2024
[5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, J., Cao, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024),https://arxiv.org/abs/2412. 052718, 9

work page internal anchor Pith review arXiv 2024
[6]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Chen, Z., Wang, J., Cao, Z., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025),https://arxiv.org/abs/2504.104798, 9, 10, 11

work page internal anchor Pith review arXiv 2025
[7]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 3

Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 3

2018
[8]

arXiv preprint arXiv:1810.11181 (2018), https://arxiv.org/abs/1810.111813

Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. arXiv preprint arXiv:1810.11181 (2018), https://arxiv.org/abs/1810.111813

work page arXiv 2018
[9]

arXiv preprint arXiv:2004.06799 (2020),https://arxiv.org/abs/ 2004.067993

Deitke, M., et al.: Robothor: An open simulation-to-real embodied ai plat- form. arXiv preprint arXiv:2004.06799 (2020),https://arxiv.org/abs/ 2004.067993

work page arXiv 2004
[10]

Deitke, M., et al.: Procthor: Large-scale embodied ai using procedural gen- eration. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://proceedings.neurips.cc/paper_files/paper/2022/ hash/27c546ab1e4f1d7d638e6a8dfbad9a07-Abstract-Conference.html 3, 20

2022
[11]

arXiv preprint arXiv:2509.23690 (2025),https://arxiv

Gao, S., Yao, J., Wen, H., Guo, Y., Liu, Z., Huang, H.: Homesafebench: A benchmark for embodied vision-language models in free-exploration home safety inspection. arXiv preprint arXiv:2509.23690 (2025),https://arxiv. org/abs/2509.236903

work page arXiv 2025
[12]

arXiv preprint arXiv:2305.06178 (2023),https://arxiv.org/abs/2305.061784

Gireesh, N., Agrawal, A., Datta, A., Banerjee, S., Sridharan, M., Bhowmick, B., Krishna, M.: Sequence-agnostic multi-object navigation. arXiv preprint arXiv:2305.06178 (2023),https://arxiv.org/abs/2305.061784

work page arXiv 2023
[13]

AI2-THOR: An Interactive 3D Environment for Visual AI

Kolve, E., et al.: Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017),https://arxiv.org/abs/1712.054743 SpaMEM: Dynamic Spatial Reasoning Benchmark 17

work page internal anchor Pith review arXiv 2017
[14]

TVQA: Localized, Compositional Video Question Answering

Lei, J., et al.: Tvqa: Localized, compositional video question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018),https://arxiv.org/abs/1809.0169620

work page Pith review arXiv 2018
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024),https://arxiv.org/abs/2408.033268

work page internal anchor Pith review arXiv 2024
[16]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

Li, C., et al.: igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272 (2021),https: //arxiv.org/abs/2108.032723

work page arXiv 2021
[17]

Egocentric prediction of action target in 3d, 2022

Li,Y.,Cao,Z.,Liang,A.,Liang,B.,Chen,L.,Zhao,H.,Feng,C.:Egocentric prediction of action target in 3d. arXiv preprint arXiv:2203.13116 (2022), https://arxiv.org/abs/2203.131164

work page arXiv 2022
[18]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025

Li, Y., et al.: Viewspatial-bench: Evaluating multi-perspective spatial local- ization in vision-language models. arXiv preprint arXiv:2505.21500 (2025), https://arxiv.org/abs/2505.215004

work page arXiv 2025
[19]

Liu, H., Li, C., Li, Y., et al.: Llava-next: Stronger llms supercharge multi- modal capabilities in the wild.https://llava-vl.github.io/blog/2024- 01-30-llava-next/(2024) 8

2024
[20]

arXiv preprint arXiv:2410.15589 (2024), https://arxiv.org/abs/2410.155893

Liu, J., et al.: Explore with long-term memory: Brain-inspired multi-agent memory for embodied exploration. arXiv preprint arXiv:2410.15589 (2024), https://arxiv.org/abs/2410.155893

work page arXiv 2024
[21]

Transactions of the Association for Computational Linguistics (2024) 3

Liu, N.F., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024) 3

2024
[22]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction, 2024

Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., Yi, L.: Hoi4d: A 4d egocentric dataset for category- level human-object interaction. arXiv preprint arXiv:2203.01577 (2022), https://arxiv.org/abs/2203.015774

work page arXiv 2022
[23]

Liu, Z., Chen, Z., Pan, L., Liu, Z.: Onlinesi: Taming large language model foronline3dunderstandingandgrounding.arXivpreprintarXiv:2601.16538 (2026),https://arxiv.org/abs/2601.165384

work page arXiv 2026
[24]

arXiv preprint arXiv:2212.06969 (2022),https://arxiv.org/abs/2212.069694

Mai, J., Hamdi, A., Giancola, S., Zhao, C., Ghanem, B.: Egoloc: Revisit- ing 3d object localization from egocentric videos with visual queries. arXiv preprint arXiv:2212.06969 (2022),https://arxiv.org/abs/2212.069694

work page arXiv 2022
[25]

arXiv preprint arXiv:2505.24257 (2025), https://arxiv.org/abs/2505.242574

Ravi, N., et al.: Out of sight, not out of context? egocentric spatial reason- ing in vlms across disjoint frames. arXiv preprint arXiv:2505.24257 (2025), https://arxiv.org/abs/2505.242574

work page arXiv 2025
[26]

arXiv preprint arXiv:2304.03696 (2023),https://arxiv.org/abs/2304.036964

Raychaudhuri, S., Campari, T., Jain, U., Savva, M., Chang, A.X.: Re- duce, reuse, recycle: Modular multi-object navigation. arXiv preprint arXiv:2304.03696 (2023),https://arxiv.org/abs/2304.036964

work page arXiv 2023
[27]

Habitat: A platform for embodied ai research

Savva, M., et al.: Habitat: A platform for embodied ai research. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019),https://arxiv.org/abs/1904.01201, arXiv:1904.01201 3

work page arXiv 2019
[28]

arXiv preprint arXiv:2012.02924 (2020),https: //arxiv.org/abs/2012.029243

Shen, B., et al.: igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv:2012.02924 (2020),https: //arxiv.org/abs/2012.029243

work page arXiv 2012
[29]

Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks,

Shukla, A., Tao, S., Su, H.: Maniskill-hab: A benchmark for low-level ma- nipulation in home rearrangement tasks. arXiv preprint arXiv:2412.13211 (2024),https://arxiv.org/abs/2412.132113 18 Liao et al

work page arXiv 2024
[30]

Team,Q.:Qwen2.5-vltechnicalreport.https://qwenlm.github.io/blog/ qwen2.5-vl/(2025) 8

2025
[31]

Qwen3-VL Technical Report

Team, Q.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025),https://arxiv.org/abs/2511.216318, 9, 10, 11

work page internal anchor Pith review arXiv 2025
[32]

arXiv preprint arXiv:2512.00736 (2025),https://arxiv.org/abs/2512.007363, 4

Thompson, J., Garcia-Lopez, E., Bisk, Y.: Rem: Evaluating llm em- bodied spatial reasoning through multi-frame trajectories. arXiv preprint arXiv:2512.00736 (2025),https://arxiv.org/abs/2512.007363, 4

work page arXiv 2025
[33]

arXiv preprint arXiv:2206.13396 (2022),https://arxiv

Trabucco, B., Sigurdsson, G., Piramuthu, R., Sukhatme, G.S., Salakhut- dinov, R.: A simple approach for visual rearrangement: 3d mapping and semantic search. arXiv preprint arXiv:2206.13396 (2022),https://arxiv. org/abs/2206.133963

work page arXiv 2022
[34]

arXiv preprint arXiv:2602.07082 (2026),https://arxiv.org/ abs/2602.070824

Wang, H., Xue, Q., Liu, W., Gao, W.: Mosaicthinker: On-device visual spatial reasoning for embodied ai via iterative construction of space rep- resentation. arXiv preprint arXiv:2602.07082 (2026),https://arxiv.org/ abs/2602.070824

work page arXiv 2026
[35]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024),https://arxiv.org/abs/2409.121918

work page internal anchor Pith review arXiv 2024
[36]

In: Ad- vances in Neural Information Processing Systems (NeurIPS) (2020),https: //arxiv.org/abs/2012.03912, arXiv:2012.03912 4

Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multion: Bench- marking semantic map memory using multi-object navigation. In: Ad- vances in Neural Information Processing Systems (NeurIPS) (2020),https: //arxiv.org/abs/2012.03912, arXiv:2012.03912 4

work page arXiv 2020
[37]

arXiv preprint arXiv:2103.16544 (2021),https://arxiv.org/abs/ 2103.165443

Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrange- ment. arXiv preprint arXiv:2103.16544 (2021),https://arxiv.org/abs/ 2103.165443

work page arXiv 2021
[38]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Wu, H., Huang, X., Chen, Y., Zhang, Y., Wang, Y., Xie, W.: Spatialscore: Towards unified evaluation for multimodal spatial understanding. arXiv preprint arXiv:2505.17012 (2025),https://arxiv.org/abs/2505.17012 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

In: CIRA (1997) 20

Yamauchi, B.: A frontier-based approach for autonomous exploration. In: CIRA (1997) 20

1997
[40]

arXiv preprint arXiv:2411.01859 (2024),https://arxiv

Ye, G., et al.: Thinking in space: A visual-spatial intelligence benchmark for multimodal llms. arXiv preprint arXiv:2411.01859 (2024),https://arxiv. org/abs/2411.0185920

work page arXiv 2024
[41]

Yu, L., Zhu, Y., Yang, J., Kiros, J., Batra, D., Parikh, D., Berg, A.C., et al.: Multi-targetembodiedquestionanswering.arXivpreprintarXiv:1904.04686 (2019),https://arxiv.org/abs/1904.046863

work page arXiv 1904
[42]

Zhang, P., et al.: Theory of space: Can foundation models construct spatial beliefs through active exploration? In: International Conference on Learn- ing Representations (ICLR) (2026),https://arxiv.org/abs/2602.07055, arXiv:2602.07055 2, 3, 4, 20

work page arXiv 2026
[43]

arXiv preprint arXiv:2503.11094 (2025),https://arxiv.org/abs/2503.110943, 4 SpaMEM: Dynamic Spatial Reasoning Benchmark 19

Zhang, W., Zhou, Z., Zheng, Z., Gao, C., Cui, J., Li, Y., Chen, X., Zhang, X.P.: Open3dvqa: A benchmark for comprehensive spatial reason- ing with multimodal large language model in open space. arXiv preprint arXiv:2503.11094 (2025),https://arxiv.org/abs/2503.110943, 4 SpaMEM: Dynamic Spatial Reasoning Benchmark 19

work page arXiv 2025
[44]

arXiv preprint arXiv:2503.11089 (2025),https: //arxiv.org/abs/2503.110894

Zhang, Y., Zhang, Q., Ju, X., Liu, Z., Mao, J., Sun, J., Wu, J., Gao, S., Cai, S., Qin, Z., Liang, L., Wang, J., Duan, Y., Cao, J., Xu, R., Tang, J.: Embodiedvsr: Dynamic scene graph-guided chain-of-thought reasoning for visual spatial tasks. arXiv preprint arXiv:2503.11089 (2025),https: //arxiv.org/abs/2503.110894

work page arXiv 2025
[45]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Zhao, Y., et al.: Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722 (2025),https://arxiv.org/abs/2511.157223, 4

work page arXiv 2025
[46]

quality of action

Zhu, Y., et al.: 3dllm-mem: Long-term spatial-temporal memory for embod- ied 3d large language model. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 4 20 Liao et al. 7 Additional Related Work Passive Perception vs. Embodied Exploration.Traditional visual-spatial reasoning has largely been framed as a disembodied task, where models ...

2025