SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Chengnuo Sun; Chenjia Bai; Hua Yang; Pingrui Lai; Xiaoheng Deng; Xinhai Li; Xuelong Li; Yucheng Deng

arxiv: 2606.08992 · v1 · pith:RU2TUC6Qnew · submitted 2026-06-08 · 💻 cs.RO · cs.AI· cs.CV

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Yucheng Deng , Pingrui Lai , Xinhai Li , Chenjia Bai , Xiaoheng Deng , Chengnuo Sun , Xuelong Li , Hua Yang This is my paper

Pith reviewed 2026-06-27 16:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords vision-and-language navigationzero-shot navigationspatial cognitive memoryembodied reasoningcontinuous environmentsobject-goal navigationtask-guided reasoning

0 comments

The pith

SpaceVLN builds an online spatial cognitive memory from waypoints and landmarks to enable zero-shot navigation without task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpaceVLN, an agent that organizes navigation around a stagewise closed-loop process of planning and execution. It progressively abstracts explored regions into Spatial Waypoints while maintaining subtask-grounded landmark evidence to form a hierarchical Spatial Cognitive Memory. This memory supports Spatial-CoT, which combines task-progress reasoning with spatial perception, analysis, and prediction. The result is Task-Guided Spatial Reasoning that works for both vision-and-language navigation and object-goal navigation in a unified zero-shot setting. The approach reports state-of-the-art zero-shot results on several continuous-environment benchmarks plus real-robot validation.

Core claim

SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space-landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation under a unified zero-shot setting without task-specific policy training.

What carries the argument

The Spatial Cognitive Memory, formed by abstracting explored regions into verifiable Spatial Waypoints and subtask-grounded landmark evidence, which enables progress localization and spatial-relation understanding through Task-Guided Spatial Reasoning.

If this is right

The same stage interface handles both vision-and-language navigation and object-goal navigation without separate training.
The memory structure replaces linear history-based reasoning with hierarchical spatial relations.
State-of-the-art zero-shot results hold across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON.
Real-robot deployment confirms the framework transfers beyond simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The waypoint-and-landmark memory could support longer-horizon tasks where agents must return to earlier locations.
Extending the same abstraction process to dynamic obstacles might improve robustness in changing scenes.
The stagewise interface suggests a route to combine navigation with manipulation by treating object interactions as additional subtasks.

Load-bearing premise

Foundation models can reliably turn explored regions into verifiable Spatial Waypoints and keep enough subtask-grounded landmark evidence to support spatial reasoning.

What would settle it

An environment where the agent cannot correctly update progress or spatial relations despite having built the waypoint-and-landmark memory structure, such as repeated failures to distinguish similar landmarks across subtasks.

Figures

Figures reproduced from arXiv: 2606.08992 by Chengnuo Sun, Chenjia Bai, Hua Yang, Pingrui Lai, Xiaoheng Deng, Xinhai Li, Xuelong Li, Yucheng Deng.

**Figure 1.** Figure 1: Comparison between prior navigation strategies and SpaceVLN. (a) Prior strategies rely on task-specific training, direct VLM action prediction, or pre-built maps, limiting generalization and weakening spatial progress grounding. (b) SpaceVLN is a zero-shot, training-free navigation agent. It leverages online Spatial Cognitive Memory and Task-Guided Spatial Reasoning to strengthen the agent’s spatial cogn… view at source ↗

**Figure 2.** Figure 2: Framework of SpaceVLN. SpaceVLN realizes navigation through a stagewise closedloop framework, augmented by Spatial Cognitive Memory and Task-Guided Spatial Reasoning. The planner infers progress from 12-direction look-around and spatial memory, then generates a verifiable space–landmark stage as an executable subtask. The executor follows this subtask using FPV views and landmark memory, while closed-loo… view at source ↗

**Figure 3.** Figure 3: Spatial Cognitive Memory. (a) Hierarchical Spatial Memory abstracts views and traversed paths into a Spatial Waypoint graph and executed Spatial Waypoint chain, then converts them into enhanced egocentric spatial representation for planner. (b) Local Landmark Memory maintains a subtask-specific landmark pool, providing top-K cues for execution and next-stage planning. environments queryable for language r… view at source ↗

**Figure 4.** Figure 4: Task-Guided Spatial Reasoning Example. Given an instruction and a 12-view lookaround, the planner follows Spatial-CoT for environment analysis, localization, and planning, then outputs a structured stage-level subtask. The executor combines the enhanced FPV view, landmark detection, and obstacle cues to complete the subtask and return feedback for next-stage planning. 3.3 Task-Guided Spatial Reasoning Rec… view at source ↗

**Figure 5.** Figure 5: VLM Context Architecture of SpaceVLN. (a) The planner context is assembled from the task goal, 12-view panoramic observations, the enhanced egocentric spatial representation, and previous-stage feedback, providing the input to planner-side Spatial-CoT for progress localization and next-stage generation. (b) The executor context combines the current subtask, FPV observation, Local Landmark Memory, and obsta… view at source ↗

**Figure 6.** Figure 6: Real-robot platform. TX-Q1 mobile base with RealSense D435i RGB-D sensing, Livox Mid-360 LiDAR pose input, and Jetson AGX Orin onboard computation. Hardware Details. The real-robot platform is a TX-Q1 differential-drive mobile base developed by Linghou Robotics. It is equipped with an Intel RealSense D435i RGB-D camera for visual observation, a Livox Mid-360 LiDAR for pose estimation, and an NVIDIA Je… view at source ↗

**Figure 7.** Figure 7: Successful simulator episode. The route starts from a hallway, passes around the kitchen area, and enters the living room. The planner row shows surrounding views, semantic maps, stagewise closed-loop planning decisions, and the top-down global trajectory with evaluation markers. The executor row shows one stage subtask with FPV observations, Local Landmark Memory, primitive action outputs, and the top-d… view at source ↗

**Figure 8.** Figure 8: Real-robot deployment episode. This real-robot case shows the robot navigating from the hallway into the exhibition room and stopping near the cabinet. The planner row contains surrounding views, semantic maps, and third-person views; the executor panels contain FPV observations, semantic maps, and executed actions for entering the exhibition room and approaching the cabinet. D.2 Real-World Episode Visual… view at source ↗

**Figure 9.** Figure 9: Failure cases. Four representative failure modes are shown: object-detection misrecognition, obstacle-avoidance difficulty, ambiguous-instruction misinterpretation, and progress-tracking failure. Each panel includes the task instruction, visual observations, map information, the failure reason, and the corresponding agent reasoning excerpt for localizing the source of the error. These cases are consistent… view at source ↗

read the original abstract

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpaceVLN adds a stagewise spatial memory structure to zero-shot navigation but its claims rest on unproven reliability of foundation-model abstraction.

read the letter

The main thing here is that SpaceVLN organizes zero-shot navigation around a stagewise closed-loop framework that builds hierarchical Spatial Cognitive Memory from waypoints and subtask landmarks, then applies Spatial-CoT for task-guided reasoning.

This setup is new in its explicit focus on verifiable space-landmark stages and the unified interface that covers both VLN and OVON without task-specific training. The description of how the memory enables progress localization and spatial-relation understanding is clear, and the real-robot deployment note gives it some practical grounding beyond simulation.

The soft spots are in the execution details. The approach assumes foundation models can reliably turn explored regions into accurate waypoints and keep landmark evidence consistent across long trajectories without hallucinating relations. That step is known to be fragile in current VLMs, and if it fails the memory cannot support the claimed localization or performance. The abstract states SOTA zero-shot results on R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON but supplies no numbers, baselines, ablations, or error analysis, so it is not possible to tell whether the new memory components drive any gains.

This paper is for people working on memory mechanisms in embodied agents. A reader interested in zero-shot robotics methods would find the architecture description useful. It deserves a serious referee because the problem is concrete and the proposal is specific enough to test, even though the evidence needs substantial strengthening in the full paper.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SpaceVLN, a zero-shot vision-and-language navigation agent for continuous environments. It introduces a stagewise closed-loop framework organized around verifiable space-landmark stages, where the agent abstracts explored regions into Spatial Waypoints and maintains subtask-grounded landmark evidence to form a hierarchical Spatial Cognitive Memory. This memory supports Spatial-CoT for task-progress reasoning, spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning. The unified interface addresses both VLN and OVON without task-specific policy training. The paper claims state-of-the-art zero-shot performance on R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, with real-robot deployment validation.

Significance. If the empirical results hold and the spatial abstraction step proves reliable, the work would represent a meaningful advance in zero-shot embodied navigation. The introduction of an online hierarchical Spatial Cognitive Memory that unifies VLN and OVON under a single stagewise interface, without any task-specific training, addresses a clear limitation of local-cue and linear-history methods. The real-robot validation is a positive indicator of practical applicability.

major comments (2)

[Abstract] Abstract: The central claim of state-of-the-art zero-shot performance on four benchmarks plus real-robot validation is asserted without any reported metrics, baselines, error bars, ablation studies, or statistical details. This absence is load-bearing because the soundness of the entire contribution rests on these unspecified empirical results.
[Method (stagewise closed-loop framework)] Stagewise closed-loop framework (method description): The SOTA claim depends on foundation models reliably abstracting explored regions into verifiable Spatial Waypoints and maintaining consistent subtask-grounded landmark evidence for progress localization without hallucination or inconsistency over long trajectories. No robustness analysis, failure-case examination, or quantitative test of abstraction accuracy is provided, which constitutes a correctness-risk concern given known VLM limitations; a concrete test would be an ablation measuring waypoint/relation error rates on held-out trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with targeted revisions to strengthen the presentation of our results and method validation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of state-of-the-art zero-shot performance on four benchmarks plus real-robot validation is asserted without any reported metrics, baselines, error bars, ablation studies, or statistical details. This absence is load-bearing because the soundness of the entire contribution rests on these unspecified empirical results.

Authors: We agree that the abstract would be strengthened by including representative quantitative results. While the full manuscript contains detailed tables reporting success rates, SPL, and comparisons against baselines on R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, we will revise the abstract to incorporate key metrics (e.g., success rate improvements) and note the presence of error bars and ablations in the experimental section. This change directly addresses the concern without altering the underlying claims. revision: yes
Referee: [Method (stagewise closed-loop framework)] Stagewise closed-loop framework (method description): The SOTA claim depends on foundation models reliably abstracting explored regions into verifiable Spatial Waypoints and maintaining consistent subtask-grounded landmark evidence for progress localization without hallucination or inconsistency over long trajectories. No robustness analysis, failure-case examination, or quantitative test of abstraction accuracy is provided, which constitutes a correctness-risk concern given known VLM limitations; a concrete test would be an ablation measuring waypoint/relation error rates on held-out trajectories.

Authors: The referee correctly highlights a gap in explicit validation of the abstraction step. The manuscript relies on end-to-end navigation metrics as indirect evidence of reliability but does not provide a dedicated quantitative analysis of waypoint accuracy or hallucination rates. We will add an ablation measuring waypoint/relation error rates on held-out trajectories together with a failure-case discussion in the revised version. This addition directly mitigates the identified correctness-risk concern. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims are purely empirical

full rationale

The paper presents a descriptive framework for a navigation agent using foundation models to build Spatial Cognitive Memory and perform Task-Guided Spatial Reasoning. No equations, mathematical derivations, fitted parameters, or first-principles predictions appear in the provided text. All performance claims (SOTA zero-shot results on R2R-CE, RxR-CE, etc.) are stated as outcomes of empirical evaluation and real-robot deployment rather than any logical reduction to inputs. The stagewise closed-loop framework is introduced as a design choice without self-referential definitions or self-citation load-bearing steps. This is a standard case of an applied systems paper whose validity rests on external benchmarks, not internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or background assumptions; free_parameters, axioms, and invented_entities cannot be identified.

pith-pipeline@v0.9.1-grok · 5841 in / 1237 out tokens · 26063 ms · 2026-06-27T16:47:53.552756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018
[2]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InProceedings of the European Conference on Computer Vision, pages 104–120, 2020

2020
[3]

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4392–4412, 2020

2020
[4]

N. P. Bhatt, Y . Yang, R. Siva, P. Samineni, D. Milan, Z. Wang, and U. Topcu. VLN-Zero: Rapid exploration and cache-enabled neurosymbolic vision-language planning for zero-shot transfer in robot navigation.arXiv preprint arXiv:2509.18592, 2025

arXiv 2025
[5]

Zhang, Z

J. Zhang, Z. Li, S. Wang, X. Shi, Z. Wei, and Q. Wu. SpatialNav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

arXiv 2026
[6]

Zhang, X

J. Zhang, X. Shi, S. Wang, Z. Li, Z. Wei, and Q. Wu. SpatialAnt: Autonomous zero- shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026. doi:10.48550/arXiv.2603.26837

work page doi:10.48550/arxiv.2603.26837 2026
[7]

H. An, W. Hu, S. Huang, S. Huang, R. Li, Y . Liang, J. Shao, Y . Song, Z. Wang, C. Yuan, C. Zhang, H. Zhang, W. Zhuang, and X. Li. AI Flow: Perspectives, scenarios, and approaches. arXiv preprint arXiv:2506.12479, 2025. doi:10.48550/arXiv.2506.12479. URLhttps:// arxiv.org/abs/2506.12479

work page doi:10.48550/arxiv.2506.12479 2025
[8]

G. Zhou, Y . Hong, and Q. Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024. doi:10.1609/aaai.v38i7.28597

work page doi:10.1609/aaai.v38i7.28597 2024
[9]

Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In 2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. doi:10.1109/ ICRA55743.2025.11127584

arXiv 2025
[10]

K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. Reid, and L. Wang. Constraint-aware zero- shot vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. doi:10.1109/TPAMI.2025.3594204

work page doi:10.1109/tpami.2025.3594204 2025
[11]

H. Yin, H. Wei, X. Xu, W. Guo, J. Zhou, and J. Lu. GC-VLN: Instruction as graph constraints for training-free vision-and-language navigation. InProceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 1809–
[12]

URLhttps://proceedings.mlr.press/v305/yin25a.html

PMLR, 2025. URLhttps://proceedings.mlr.press/v305/yin25a.html. 9

2025
[13]

G. Dai, S. Wang, Z. Wang, G. Xie, Y . Yang, J. Pan, Q. Sun, and X. Shu. HISTORY TO FUTURE: Evolving agent with experience and thought for zero-shot vision-and-language nav- igation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026

2026
[14]

Fried, R

D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navi- gation. InAdvances in Neural Information Processing Systems, 2018

2018
[15]

W. Hao, C. Li, X. Li, L. Carin, and J. Gao. Towards learning a generic agent for vision- and-language navigation via pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020

2020
[16]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev. History aware multimodal transformer for vision-and-language navigation. InAdvances in Neural Information Processing Systems, vol- ume 34, pages 5834–5847, 2021

2021
[17]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual- scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022

2022
[18]

Y . Hong, Z. Wang, Q. Wu, and S. Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022

2022
[19]

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. BEVBert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[20]

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. ETPNav: Evolv- ing topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5130–5145, 2025. doi: 10.1109/TPAMI.2024.3386695

work page doi:10.1109/tpami.2024.3386695 2025
[21]

T. Yu, Y . Wu, Q. Cui, Q. Huang, and J. Yu. MossVLN: Memory-observation synergistic system for continuous vision-language navigation.IEEE Transactions on Multimedia, 27, 2025. doi: 10.1109/TMM.2025.3586105

work page doi:10.1109/tmm.2025.3586105 2025
[22]

Zhang, X

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu. MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2025. arXiv:2502.13451

Pith/arXiv arXiv 2025
[23]

Zhang, K

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang. NaVid: Video-based VLM plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024

2024
[24]

Zhang, K

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang. Uni- NaVid: A video-based vision-language-action model for unifying embodied navigation tasks. InRobotics: Science and Systems, 2025

2025
[25]

X. Zhou, T. Xiao, L. Liu, Y . Wang, M. Chen, X. Meng, X. Wang, W. Feng, W. Sui, and Z. Su. FSR-VLN: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph.arXiv preprint arXiv:2509.13733, 2025

arXiv 2025
[26]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . K. Wong. MapGPT: Map-guided prompt- ing with adaptive path planning for vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 9796–9810, 2024. 10

2024
[27]

Z. Li, H. Zheng, F. Zhao, A. Chan, J. Zhou, S. Lin, S. Li, and Q. Wu. One agent to guide them all: Empowering MLLMs for vision-and-language navigation via explicit world representation. arXiv preprint arXiv:2602.15400, 2026

arXiv 2026
[28]

S. Zhou, Y . Wu, T. Wang, X. Li, G. Chen, L. Liu, C. Bai, and X. Li. DeCoNav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486, 2026. doi:10.48550/arXiv.2604.12486

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12486 2026
[29]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InProceedings of the International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

2022
[30]

Ichter, A

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Ir- pan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Lu...

2023
[31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2023

2023
[32]

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans.Autonomous Robots, 47(8):1345–1365, 2023

2023
[33]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 540–562. PMLR, 2023. URLhttps://proceedings.mlr.press/v229/huang23b.html

2023
[34]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. SayPlan: Ground- ing large language models using 3D scene graphs for scalable robot task planning. InPro- ceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 23–72. PMLR, 2023. URLhttps://proceedings.mlr.press/ v229/rana23a.html

2023
[35]

Y . Chen, J. Arkin, C. Dawson, Y . Zhang, N. Roy, and C. Fan. Autotamp: Autoregressive task and motion planning with llms as translators and checkers. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6695–6702. IEEE, 2024

2024
[36]

Rajvanshi, K

A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez. Saynav: Grounding large language models for dynamic planning to navigation in new environments.Proceedings of the International Conference on Automated Planning and Scheduling, 34(1):464–474, 2024. doi:10.1609/icaps.v34i1.31506

work page doi:10.1609/icaps.v34i1.31506 2024
[37]

Majumdar, A

A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud, K. Yadav, Q. Li, B. Newman, M. Sharma, V . Berges, S. Zhang, P. Agrawal, Y . Bisk, D. Batra, M. Kalakrishnan, F. Meier, C. Paxton, S. Sax, and A. Ra- jeswaran. OpenEQA: Embodied question answering in the era of foundation models. InPro- ceedin...

2024
[38]

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11

2025
[39]

H. Yang, Y . Long, Z. Yu, Z. Yang, M. Wang, J. Xu, Y . Wang, Z. Yu, W. Cai, L. Kang, and H. Dong. NavSpace: How navigation agents follow spatial intelligence instructions.arXiv preprint arXiv:2510.08173, 2025

arXiv 2025
[40]

H. Pan, S. Huang, J. Yang, et al. Robot navigation via foundation language models: A review. ACM Computing Surveys, 2026. doi:10.1145/3802539

work page doi:10.1145/3802539 2026
[41]

Werby, C

A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard. Hierarchical open-vocabulary 3D scene graphs for language-grounded robot navigation. InRobotics: Science and Systems,
[42]

URLhttps://www.roboticsproceedings.org/rss20/p077.html
[43]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

2022
[44]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero- shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022

2022
[45]

J. Li, G. Li, Y . Li, and Z. Jin. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599, 2023

arXiv 2023
[46]

Ang, and Francis E

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha. HM3D-OVON: A dataset and benchmark for open-vocabulary object goal navigation. InProceedings of the IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, 2024. doi:10.1109/IROS58592.2024. 10802368

work page doi:10.1109/iros58592.2024 2024
[47]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A platform for embodied AI research. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019
[48]

A. X. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InProceedings of the International Conference on 3D Vision, pages 667–676, 2017

2017
[49]

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Un- dersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra. Habitat- Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URLhttps:...

2021
[50]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, 2024

2024
[51]

A. Wang, H. Chen, Z. Lin, H. Pu, and G. Ding. RepViT-SAM: Towards real-time segmenting anything.arXiv preprint arXiv:2312.05760, 2023

arXiv 2023
[52]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, and X. Liu. Ground slow, move fast: A dual-system foundation model for generalizable vision- language navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=GK4rznYwhn

2026
[53]

X. Li, X. Zhang, Y . Huang, J. Dong, T. Wang, S. Zhou, Y . Wu, C. Sun, Y . Ge, Q. Weng, C. Zhang, C. Bai, and X. Li. GN0: Toward a unified paradigm for generation, evaluation, and policy learning in visual-language navigation.arXiv preprint arXiv:2606.03682, 2026. doi:10.48550/arXiv.2606.03682. URLhttps://arxiv.org/abs/2606.03682. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.03682 2026
[54]

Zhang, A

J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, X. Li, Y . Fan, W. Li, Z. Chen, F. Gao, Q. Wu, Z. Zhang, and H. Wang. Embodied navigation foundation model. In International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=kkBOIsrCXh

2026
[55]

Cheng, Y

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher. VLFM: Vision-language frontier maps for zero-shot semantic navigation. InProceedings of the IEEE International Conference on Robotics and Automation, pages 42–48, 2024. doi:10.1109/ICRA57147.2024.10610712

work page doi:10.1109/icra57147.2024.10610712 2024
[56]

Ziliotto, T

F. Ziliotto, T. Campari, L. Serafini, and L. Ballan. TANGO: Training-free embodied ai agents for open-world tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[57]

X. Li, F. Lyu, H. Wu, M. Liu, J.-N. Liu, and G. Liu. Stop wandering: Efficient vision- language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026. doi:10.48550/arXiv.2604.02318

work page doi:10.48550/arxiv.2604.02318 2026
[58]

Huang, S

X. Huang, S. Zhao, Y . Wang, X. Lu, W. Zhang, R. Qu, W. Li, Y . Wang, and C. Wen. MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2026. URLhttps://media.eventhosts.cc/Conferences/CVPR2026/ CVPR_main_conf_2026_15.pdf. CVPR 20...

2026
[59]

M. Gao, Z. Zhu, Z. Sun, Z. Ma, L. Yuan, Z. Ma, Z. Gao, J. Zhang, and S. Zou. DRIVE-Nav: Directional reasoning, inspection, and verification for efficient open-vocabulary navigation. arXiv preprint arXiv:2603.28691, 2026. doi:10.48550/arXiv.2603.28691

work page doi:10.48550/arxiv.2603.28691 2026
[60]

A. Elfes. Using occupancy grids for mobile robot perception and navigation.Computer, 22 (6):46–57, 1989. doi:10.1109/2.30720

work page doi:10.1109/2.30720 1989
[61]

D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov. Object goal naviga- tion using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[62]

B. Lin, Y . Nie, Z. Wei, J. Chen, S. Ma, J. Han, H. Xu, X. Chang, and X. Liang. NavCoT: Boost- ing LLM-based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. arXiv:2403.07376

arXiv 2025
[63]

Xiaomi MiMo API Open Platform: Pricing and rate limits

Xiaomi MiMo API Open Platform. Xiaomi MiMo API Open Platform: Pricing and rate limits. https://platform.xiaomimimo.com/docs/pricing, 2026. Accessed: 2026-05-29

2026
[64]

Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

Kimi Team et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

Pith/arXiv arXiv
[65]

doi:10.48550/arXiv.2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276
[66]

Xiaomi MiMo-V2.5 series open-sourced & orbit 100 trillion token plan launched.https://platform.xiaomimimo.com/docs/en-US/news/ v2.5-open-sourced, May 2026

Xiaomi MiMo API Open Platform. Xiaomi MiMo-V2.5 series open-sourced & orbit 100 trillion token plan launched.https://platform.xiaomimimo.com/docs/en-US/news/ v2.5-open-sourced, May 2026. Updated: 2026-05-28; accessed: 2026-05-29

2026
[67]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Li, J. Lin, X. Lin, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
[68]

reasoning

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id= qwen3.5, Feb. 2026. 13 A Supplementary Overview This supplementary material is organized as follows: • Appendix B.1 provides the SpaceVLN agent overview and runtime pipeline. • Appendix B.2 specifies the stage-level subtask interface. • Appendix B.3 details Spatial Cognitive Mem...

2026

[1] [1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018

[2] [2]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InProceedings of the European Conference on Computer Vision, pages 104–120, 2020

2020

[3] [3]

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4392–4412, 2020

2020

[4] [4]

N. P. Bhatt, Y . Yang, R. Siva, P. Samineni, D. Milan, Z. Wang, and U. Topcu. VLN-Zero: Rapid exploration and cache-enabled neurosymbolic vision-language planning for zero-shot transfer in robot navigation.arXiv preprint arXiv:2509.18592, 2025

arXiv 2025

[5] [5]

Zhang, Z

J. Zhang, Z. Li, S. Wang, X. Shi, Z. Wei, and Q. Wu. SpatialNav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

arXiv 2026

[6] [6]

Zhang, X

J. Zhang, X. Shi, S. Wang, Z. Li, Z. Wei, and Q. Wu. SpatialAnt: Autonomous zero- shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026. doi:10.48550/arXiv.2603.26837

work page doi:10.48550/arxiv.2603.26837 2026

[7] [7]

H. An, W. Hu, S. Huang, S. Huang, R. Li, Y . Liang, J. Shao, Y . Song, Z. Wang, C. Yuan, C. Zhang, H. Zhang, W. Zhuang, and X. Li. AI Flow: Perspectives, scenarios, and approaches. arXiv preprint arXiv:2506.12479, 2025. doi:10.48550/arXiv.2506.12479. URLhttps:// arxiv.org/abs/2506.12479

work page doi:10.48550/arxiv.2506.12479 2025

[8] [8]

G. Zhou, Y . Hong, and Q. Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024. doi:10.1609/aaai.v38i7.28597

work page doi:10.1609/aaai.v38i7.28597 2024

[9] [9]

Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In 2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. doi:10.1109/ ICRA55743.2025.11127584

arXiv 2025

[10] [10]

K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. Reid, and L. Wang. Constraint-aware zero- shot vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. doi:10.1109/TPAMI.2025.3594204

work page doi:10.1109/tpami.2025.3594204 2025

[11] [11]

H. Yin, H. Wei, X. Xu, W. Guo, J. Zhou, and J. Lu. GC-VLN: Instruction as graph constraints for training-free vision-and-language navigation. InProceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 1809–

[12] [12]

URLhttps://proceedings.mlr.press/v305/yin25a.html

PMLR, 2025. URLhttps://proceedings.mlr.press/v305/yin25a.html. 9

2025

[13] [13]

G. Dai, S. Wang, Z. Wang, G. Xie, Y . Yang, J. Pan, Q. Sun, and X. Shu. HISTORY TO FUTURE: Evolving agent with experience and thought for zero-shot vision-and-language nav- igation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026

2026

[14] [14]

Fried, R

D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navi- gation. InAdvances in Neural Information Processing Systems, 2018

2018

[15] [15]

W. Hao, C. Li, X. Li, L. Carin, and J. Gao. Towards learning a generic agent for vision- and-language navigation via pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020

2020

[16] [16]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev. History aware multimodal transformer for vision-and-language navigation. InAdvances in Neural Information Processing Systems, vol- ume 34, pages 5834–5847, 2021

2021

[17] [17]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual- scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022

2022

[18] [18]

Y . Hong, Z. Wang, Q. Wu, and S. Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022

2022

[19] [19]

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. BEVBert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023

[20] [20]

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. ETPNav: Evolv- ing topological planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5130–5145, 2025. doi: 10.1109/TPAMI.2024.3386695

work page doi:10.1109/tpami.2024.3386695 2025

[21] [21]

T. Yu, Y . Wu, Q. Cui, Q. Huang, and J. Yu. MossVLN: Memory-observation synergistic system for continuous vision-language navigation.IEEE Transactions on Multimedia, 27, 2025. doi: 10.1109/TMM.2025.3586105

work page doi:10.1109/tmm.2025.3586105 2025

[22] [22]

Zhang, X

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu. MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2025. arXiv:2502.13451

Pith/arXiv arXiv 2025

[23] [23]

Zhang, K

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang. NaVid: Video-based VLM plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024

2024

[24] [24]

Zhang, K

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang. Uni- NaVid: A video-based vision-language-action model for unifying embodied navigation tasks. InRobotics: Science and Systems, 2025

2025

[25] [25]

X. Zhou, T. Xiao, L. Liu, Y . Wang, M. Chen, X. Meng, X. Wang, W. Feng, W. Sui, and Z. Su. FSR-VLN: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph.arXiv preprint arXiv:2509.13733, 2025

arXiv 2025

[26] [26]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . K. Wong. MapGPT: Map-guided prompt- ing with adaptive path planning for vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 9796–9810, 2024. 10

2024

[27] [27]

Z. Li, H. Zheng, F. Zhao, A. Chan, J. Zhou, S. Lin, S. Li, and Q. Wu. One agent to guide them all: Empowering MLLMs for vision-and-language navigation via explicit world representation. arXiv preprint arXiv:2602.15400, 2026

arXiv 2026

[28] [28]

S. Zhou, Y . Wu, T. Wang, X. Li, G. Chen, L. Liu, C. Bai, and X. Li. DeCoNav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486, 2026. doi:10.48550/arXiv.2604.12486

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12486 2026

[29] [29]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InProceedings of the International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

2022

[30] [30]

Ichter, A

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Ir- pan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Lu...

2023

[31] [31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2023

2023

[32] [32]

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans.Autonomous Robots, 47(8):1345–1365, 2023

2023

[33] [33]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 540–562. PMLR, 2023. URLhttps://proceedings.mlr.press/v229/huang23b.html

2023

[34] [34]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. SayPlan: Ground- ing large language models using 3D scene graphs for scalable robot task planning. InPro- ceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 23–72. PMLR, 2023. URLhttps://proceedings.mlr.press/ v229/rana23a.html

2023

[35] [35]

Y . Chen, J. Arkin, C. Dawson, Y . Zhang, N. Roy, and C. Fan. Autotamp: Autoregressive task and motion planning with llms as translators and checkers. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6695–6702. IEEE, 2024

2024

[36] [36]

Rajvanshi, K

A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez. Saynav: Grounding large language models for dynamic planning to navigation in new environments.Proceedings of the International Conference on Automated Planning and Scheduling, 34(1):464–474, 2024. doi:10.1609/icaps.v34i1.31506

work page doi:10.1609/icaps.v34i1.31506 2024

[37] [37]

Majumdar, A

A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud, K. Yadav, Q. Li, B. Newman, M. Sharma, V . Berges, S. Zhang, P. Agrawal, Y . Bisk, D. Batra, M. Kalakrishnan, F. Meier, C. Paxton, S. Sax, and A. Ra- jeswaran. OpenEQA: Embodied question answering in the era of foundation models. InPro- ceedin...

2024

[38] [38]

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11

2025

[39] [39]

H. Yang, Y . Long, Z. Yu, Z. Yang, M. Wang, J. Xu, Y . Wang, Z. Yu, W. Cai, L. Kang, and H. Dong. NavSpace: How navigation agents follow spatial intelligence instructions.arXiv preprint arXiv:2510.08173, 2025

arXiv 2025

[40] [40]

H. Pan, S. Huang, J. Yang, et al. Robot navigation via foundation language models: A review. ACM Computing Surveys, 2026. doi:10.1145/3802539

work page doi:10.1145/3802539 2026

[41] [41]

Werby, C

A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard. Hierarchical open-vocabulary 3D scene graphs for language-grounded robot navigation. InRobotics: Science and Systems,

[42] [42]

URLhttps://www.roboticsproceedings.org/rss20/p077.html

[43] [43]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

2022

[44] [44]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero- shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022

2022

[45] [45]

J. Li, G. Li, Y . Li, and Z. Jin. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599, 2023

arXiv 2023

[46] [46]

Ang, and Francis E

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha. HM3D-OVON: A dataset and benchmark for open-vocabulary object goal navigation. InProceedings of the IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, 2024. doi:10.1109/IROS58592.2024. 10802368

work page doi:10.1109/iros58592.2024 2024

[47] [47]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A platform for embodied AI research. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019

[48] [48]

A. X. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InProceedings of the International Conference on 3D Vision, pages 667–676, 2017

2017

[49] [49]

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Un- dersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra. Habitat- Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URLhttps:...

2021

[50] [50]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, 2024

2024

[51] [51]

A. Wang, H. Chen, Z. Lin, H. Pu, and G. Ding. RepViT-SAM: Towards real-time segmenting anything.arXiv preprint arXiv:2312.05760, 2023

arXiv 2023

[52] [52]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, and X. Liu. Ground slow, move fast: A dual-system foundation model for generalizable vision- language navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=GK4rznYwhn

2026

[53] [53]

X. Li, X. Zhang, Y . Huang, J. Dong, T. Wang, S. Zhou, Y . Wu, C. Sun, Y . Ge, Q. Weng, C. Zhang, C. Bai, and X. Li. GN0: Toward a unified paradigm for generation, evaluation, and policy learning in visual-language navigation.arXiv preprint arXiv:2606.03682, 2026. doi:10.48550/arXiv.2606.03682. URLhttps://arxiv.org/abs/2606.03682. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.03682 2026

[54] [54]

Zhang, A

J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, X. Li, Y . Fan, W. Li, Z. Chen, F. Gao, Q. Wu, Z. Zhang, and H. Wang. Embodied navigation foundation model. In International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=kkBOIsrCXh

2026

[55] [55]

Cheng, Y

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher. VLFM: Vision-language frontier maps for zero-shot semantic navigation. InProceedings of the IEEE International Conference on Robotics and Automation, pages 42–48, 2024. doi:10.1109/ICRA57147.2024.10610712

work page doi:10.1109/icra57147.2024.10610712 2024

[56] [56]

Ziliotto, T

F. Ziliotto, T. Campari, L. Serafini, and L. Ballan. TANGO: Training-free embodied ai agents for open-world tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[57] [57]

X. Li, F. Lyu, H. Wu, M. Liu, J.-N. Liu, and G. Liu. Stop wandering: Efficient vision- language navigation via metacognitive reasoning.arXiv preprint arXiv:2604.02318, 2026. doi:10.48550/arXiv.2604.02318

work page doi:10.48550/arxiv.2604.02318 2026

[58] [58]

Huang, S

X. Huang, S. Zhao, Y . Wang, X. Lu, W. Zhang, R. Qu, W. Li, Y . Wang, and C. Wen. MSGNav: Unleashing the power of multi-modal 3D scene graph for zero-shot embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2026. URLhttps://media.eventhosts.cc/Conferences/CVPR2026/ CVPR_main_conf_2026_15.pdf. CVPR 20...

2026

[59] [59]

M. Gao, Z. Zhu, Z. Sun, Z. Ma, L. Yuan, Z. Ma, Z. Gao, J. Zhang, and S. Zou. DRIVE-Nav: Directional reasoning, inspection, and verification for efficient open-vocabulary navigation. arXiv preprint arXiv:2603.28691, 2026. doi:10.48550/arXiv.2603.28691

work page doi:10.48550/arxiv.2603.28691 2026

[60] [60]

A. Elfes. Using occupancy grids for mobile robot perception and navigation.Computer, 22 (6):46–57, 1989. doi:10.1109/2.30720

work page doi:10.1109/2.30720 1989

[61] [61]

D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov. Object goal naviga- tion using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020

[62] [62]

B. Lin, Y . Nie, Z. Wei, J. Chen, S. Ma, J. Han, H. Xu, X. Chang, and X. Liang. NavCoT: Boost- ing LLM-based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. arXiv:2403.07376

arXiv 2025

[63] [63]

Xiaomi MiMo API Open Platform: Pricing and rate limits

Xiaomi MiMo API Open Platform. Xiaomi MiMo API Open Platform: Pricing and rate limits. https://platform.xiaomimimo.com/docs/pricing, 2026. Accessed: 2026-05-29

2026

[64] [64]

Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

Kimi Team et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

Pith/arXiv arXiv

[65] [65]

doi:10.48550/arXiv.2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276

[66] [66]

Xiaomi MiMo-V2.5 series open-sourced & orbit 100 trillion token plan launched.https://platform.xiaomimimo.com/docs/en-US/news/ v2.5-open-sourced, May 2026

Xiaomi MiMo API Open Platform. Xiaomi MiMo-V2.5 series open-sourced & orbit 100 trillion token plan launched.https://platform.xiaomimimo.com/docs/en-US/news/ v2.5-open-sourced, May 2026. Updated: 2026-05-28; accessed: 2026-05-29

2026

[67] [67]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Li, J. Lin, X. Lin, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025

[68] [68]

reasoning

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id= qwen3.5, Feb. 2026. 13 A Supplementary Overview This supplementary material is organized as follows: • Appendix B.1 provides the SpaceVLN agent overview and runtime pipeline. • Appendix B.2 specifies the stage-level subtask interface. • Appendix B.3 details Spatial Cognitive Mem...

2026