MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

arxiv: 2605.19594 · v1 · pith:KL7BTPEUnew · submitted 2026-05-19 · 💻 cs.RO

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

Jingyu Li , Zhe Liu , Wenxiao Wu , Li Zhang This is my paper

Pith reviewed 2026-05-20 04:59 UTC · model grok-4.3

classification 💻 cs.RO

keywords zero-shot goal-oriented navigationdynamic cognitive mapmemory-aware explorationgoal re-validationmissed goal re-explorationinstance-level navigationHM3D dataset

0 comments p. Extension

pith:KL7BTPEU Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{KL7BTPEU}

Prints a linked pith:KL7BTPEU badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Dynamic cognitive map enables re-validation and re-exploration to fix missed targets in zero-shot navigation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MCNav builds a dynamic cognitive map to store queryable details on objects encountered in explored parts of an environment. This structure supports two key strategies: re-validating previously seen objects to fix identification mistakes and estimating the chance a target exists in an already visited area based on surrounding context. Supported by mechanisms to avoid repeating mistakes and to double-check promising leads, the approach shifts focus from solely exploring new space to also making better use of what has already been seen. If effective, this leads to fewer navigation failures when targets are specific instances that might have been overlooked initially.

Core claim

We propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this, we introduce goal re-validation to re-assess previously seen objects to correct matching failures, and missed goal re-exploration to estimate the likelihood that a target is present in an explored region from contextual cues. These are stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. Evaluations on HM3Dv1 and HM3Dv2 datasets show state-of-the-art performance, especially on instance-level goal navigation.

What carries the argument

Dynamic cognitive map storing efficiently queryable information about relevant objects in explored areas, which supports memory-aware strategies for re-validation and likelihood estimation.

Load-bearing premise

The contextual cues stored in the cognitive map allow reliable estimation of whether a target is likely in an explored region, and that re-validating objects will fix errors without creating new mistakes.

What would settle it

Observe a scenario where a target object is in an explored region with matching contextual cues, yet the system neither re-validates it correctly nor decides to re-explore, resulting in continued failure to reach the goal.

Figures

Figures reproduced from arXiv: 2605.19594 by Jingyu Li, Li Zhang, Wenxiao Wu, Zhe Liu.

**Figure 1.** Figure 1: Different scene representation paradigms adopted in zero-shot goal-oriented navigation: (a) the raw scene; (b) graph-based methods that construct node–edge structures; (c) map-based methods that build semantic occupancy maps; and (d) our cognitive map representation, which models explicit spatial relationships among objects and supports memory accumulation and dynamic updates. Nonetheless, current zero-s… view at source ↗

**Figure 2.** Figure 2: An overview of the MCNav framework. Our model first processes different task goals using an LLM/VLM to extract objects of interest and goal properties. During navigation, the agent iteratively performs a memory-aware exploration strategy using the cognitive map. At each step, potential targets retrieved from the map are treated as temporary goals and verified by a VLM-based double-check mechanism. We furth… view at source ↗

**Figure 3.** Figure 3: In our cognitive map construction process, the agent observes multiple frames and, after performing detection and segmentation fusion, stores the results into the cognitive map. navigation, the agent operates in a continuous perception-exploration loop. It periodically sets the nearest frontier as a long-term target every 20 time steps, while simultaneously employing Mask R-CNN [12] to detect candidate obj… view at source ↗

**Figure 4.** Figure 4: Demonstration of the decision process of MCNav. The agent makes a new exploration decision upon reaching each temporary goal. Red dots on the occupancy map represent boundary points, while green dots indicate the projection of object nodes from the cognitive map onto the corresponding positions in the occupancy map [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between UniGoal and MCNav on TN. The double-check mechanism improves navigation reliability. This is likely due to the stronger scene understanding and reasoning capabilities of Qwen2.5-VL-7B. Ideally, our framework is model-agnostic and can naturally benefit from future advances in MLLMs capabilities. 5.4 Qualitative results We visualize MCNav to illustrate its memory-aware exploration strateg… view at source ↗

read the original abstract

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCNav adds a dynamic cognitive map plus re-validation and re-exploration tactics to fix missed goals in zero-shot navigation, but the SOTA claims lack the experimental backing needed to judge if they hold up.

read the letter

The main point is that this paper targets a practical gap: most zero-shot navigation methods keep pushing into new space and ignore what they already saw, so missed or mis-matched instance targets cause repeated failures. MCNav tries to fix that with a queryable dynamic cognitive map that stores object details from explored areas, then layers on goal re-validation to double-check past objects and missed-goal re-exploration that pulls contextual cues to decide whether to revisit a region. Blacklist and double-check mechanisms are added to limit repeated mistakes. That combination is the concrete new piece, and it makes sense as an engineering response to a frequent failure mode in indoor robot navigation.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MCNav, a memory-aware navigation framework for zero-shot goal-oriented navigation that maintains a dynamic cognitive map storing efficiently queryable information about relevant objects in explored areas. It introduces two memory-aware exploration strategies—goal re-validation to re-assess previously seen objects and correct matching failures, and missed goal re-exploration that estimates target presence likelihood in explored regions from contextual cues—stabilized by blacklist and double-check mechanisms. The method is evaluated on HM3Dv1 and HM3Dv2 datasets across three tasks and claims state-of-the-art performance, particularly on instance-level goal navigation.

Significance. If the proposed memory strategies deliver robust gains by better exploiting explored regions without introducing new errors or excessive path length overhead, the work could advance zero-shot navigation in embodied AI by addressing a key limitation of prior methods that emphasize new-region exploration. The integration of cognitive maps with scene understanding models is a timely engineering contribution, though its impact depends on generalizability beyond the HM3D datasets.

major comments (2)

[§5] §5 (Experimental results): The SOTA claim on instance-level goal navigation rests on the two memory-aware strategies functioning as intended, yet the manuscript provides no quantitative analysis of how often the blacklist and double-check mechanisms trigger, nor their net effect on success rate versus path length. This leaves open whether the reported gains are robust or sensitive to dataset-specific choices in HM3Dv1/HM3Dv2.
[§4.3] §4.3 (Missed goal re-exploration): The strategy assumes contextual cues from the cognitive map can reliably estimate the likelihood a target is present in an already-explored region; if stored object attributes are incomplete or the cue-to-likelihood mapping is noisy, re-exploration risks wasting steps on low-probability areas. The paper lacks failure-case analysis or overhead measurements to confirm the net benefit.

minor comments (1)

[Abstract] The abstract states evaluation across three tasks but does not name them explicitly; adding this detail would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each of the major comments in detail below, outlining the revisions we intend to make to the manuscript.

read point-by-point responses

Referee: [§5] §5 (Experimental results): The SOTA claim on instance-level goal navigation rests on the two memory-aware strategies functioning as intended, yet the manuscript provides no quantitative analysis of how often the blacklist and double-check mechanisms trigger, nor their net effect on success rate versus path length. This leaves open whether the reported gains are robust or sensitive to dataset-specific choices in HM3Dv1/HM3Dv2.

Authors: We concur that providing quantitative analysis of the blacklist and double-check mechanisms would enhance the understanding of their role in achieving the reported performance. In the revised version, we will incorporate new experimental results detailing the activation frequency of these mechanisms and their effects on success rate and path length. This addition will help substantiate the robustness of the SOTA claims across the evaluated datasets. revision: yes
Referee: [§4.3] §4.3 (Missed goal re-exploration): The strategy assumes contextual cues from the cognitive map can reliably estimate the likelihood a target is present in an already-explored region; if stored object attributes are incomplete or the cue-to-likelihood mapping is noisy, re-exploration risks wasting steps on low-probability areas. The paper lacks failure-case analysis or overhead measurements to confirm the net benefit.

Authors: The concern regarding potential inefficiencies in the missed goal re-exploration strategy is valid. We will revise the manuscript to include a dedicated analysis of failure cases, along with measurements of the overhead in terms of additional steps taken. This will demonstrate the net benefit by comparing scenarios with and without the re-exploration strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MCNav engineering framework

full rationale

The paper describes an applied navigation system that combines a dynamic cognitive map with two memory-aware strategies (goal re-validation and missed-goal re-exploration) plus stabilization mechanisms. No equations, fitted parameters, or derivation chains appear in the provided text. The central claims rest on empirical SOTA results on HM3D datasets rather than any reduction of outputs to inputs by construction, self-citation load-bearing premises, or ansatz smuggling. This is the expected non-circular outcome for a methods paper in robotics that does not attempt a first-principles derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or newly postulated entities are stated. The framework appears to rest on standard assumptions about object detection, map building, and LLM scene understanding that are common in the field.

pith-pipeline@v0.9.0 · 5737 in / 1099 out tokens · 33322 ms · 2026-05-20T04:59:07.493755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose MCNav, a memory-aware navigation framework with a dynamic cognitive map... goal re-validation... missed goal re-exploration... blacklist mechanism and a double-check mechanism

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: ICRA (2025)

Busch, F.L., Homberger, T., Ortega-Peimbert, J., Yang, Q., Andersson, O.: One map to find them all: Real-time open-vocabulary mapping for zero-shot multi- object navigation. In: ICRA (2025)

work page 2025
[5]

In: ICRA (2024)

Cai, W., Huang, S., Cheng, G., Long, Y., Gao, P., Sun, C., Dong, H.: Bridging zero-shotobjectnavigationandfoundationmodelsthroughpixel-guidednavigation skill. In: ICRA (2024)

work page 2024
[6]

In: ICCV (2025)

Cao, Y., Zhang, J., Yu, Z., Liu, S., Qin, Z., Zou, Q., Du, B., Xu, K.: Cognav: Cognitive process modeling for object goal navigation with llms. In: ICCV (2025)

work page 2025
[7]

arXiv preprint arXiv:2311.06430 (2023)

Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S.Y., Shah, K., Paxton, C., Gupta, S., Batra, D., et al.: Goat: Go to any thing. arXiv preprint arXiv:2311.06430 (2023)

work page arXiv 2023
[8]

NeurIPS (2020)

Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal naviga- tion using goal-oriented semantic exploration. NeurIPS (2020)

work page 2020
[9]

In: ACL (2024)

Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y.: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In: ACL (2024)

work page 2024
[10]

RSS (2023)

Chen, J., Li, G., Kumar, S., Ghanem, B., Yu, F.: How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers. RSS (2023)

work page 2023
[11]

In: CVPR (2023)

Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Cows on pas- ture: Baselines and benchmarks for language-driven zero-shot object navigation. In: CVPR (2023)

work page 2023
[12]

In: CVPR (2017)

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: CVPR (2017)

work page 2017
[13]

In: CVPR (2023)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: CVPR (2023)

work page 2023
[14]

In: CVPR (2023)

Krantz, J., Gervet, T., Yadav, K., Wang, A., Paxton, C., Mottaghi, R., Batra, D., Malik, J., Lee, S., Chaplot, D.S.: Navigating to objects specified by images. In: CVPR (2023)

work page 2023
[15]

arXiv preprint arXiv:2211.15876 (2022)

Krantz, J., Lee, S., Malik, J., Batra, D., Chaplot, D.S.: Instance-specific image goal navigation: Training embodied agents to find object instances. arXiv preprint arXiv:2211.15876 (2022)

work page arXiv 2022
[16]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Kuang, Y., Lin, H., Jiang, M.: Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670 (2024)

work page arXiv 2024
[17]

In: CVPR (2023)

Kwon, O., Park, J., Oh, S.: Renderable neural radiance map for visual navigation. In: CVPR (2023)

work page 2023
[18]

In: CVPR (2024)

Lei, X., Wang, M., Zhou, W., Li, L., Li, H.: Instance-aware exploration-verification- exploitation for instance imagegoal navigation. In: CVPR (2024)

work page 2024
[19]

In: CVPR (2026) 16 J

Li, J., Wu, J., Hu, D., Huang, X., Sun, B., Hao, Z., Lang, X., Zhu, X., Zhang, L.: Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. In: CVPR (2026) 16 J. Li et al

work page 2026
[20]

Li, J., Zhang, B., Jin, X., Deng, J., Zhu, X., Zhang, L.: Imagidrive: A unified imagination-and-planning framework for autonomous driving (2025)

work page 2025
[21]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y., Xu, Y.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

In: CVPR (2023)

Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: CVPR (2023)

work page 2023
[23]

NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS (2023)

work page 2023
[24]

arXiv preprint arXiv:2509.01364 (2025)

Liu, P., Zhang, Q., Peng, D., Zhang, L., Qin, Y., Zhou, H., Ma, J., Xu, R., Ji, Y.: Toponav: Topological graphs as a key enabler for advanced object navigation. arXiv preprint arXiv:2509.01364 (2025)

work page arXiv 2025
[25]

In: CVPR (2023)

Liu, R., Wang, X., Wang, W., Yang, Y.: Bird’s-eye-view scene graph for vision- language navigation. In: CVPR (2023)

work page 2023
[26]

In: ECCV (2024)

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV (2024)

work page 2024
[27]

CVPR (2026)

Liu, Z., Huang, R., Yang, R., Yan, S., Wang, Z., Hou, L., Lin, D., Bai, X., Zhao, H.: Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning. CVPR (2026)

work page 2026
[28]

CoRL (2024)

Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. CoRL (2024)

work page 2024
[29]

In: ICRA (2024)

Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: Visual language navi- gation via multi-expert discussions. In: ICRA (2024)

work page 2024
[30]

NeurIPS (2022)

Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., Batra, D.: Zson: Zero-shot object-goal navigation using multimodal goal embeddings. NeurIPS (2022)

work page 2022
[31]

Meta AI Blog

Meta, A.: Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. Retrieved December (2024)

work page 2024
[32]

IROS (2025)

Nie, D., Guo, X., Duan, Y., Zhang, R., Chen, L.: Wmnav: Integrating vision- language models into world models for object goal navigation. IROS (2025)

work page 2025
[33]

In: Proceedings of the International Conference on Automated Planning and Scheduling (2024)

Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling (2024)

work page 2024
[34]

In: CVPR (2022)

Ramakrishnan, S.K., Chaplot, D.S., Al-Halah, Z., Malik, J., Grauman, K.: Poni: Potential functions for objectgoal navigation with interaction-free learning. In: CVPR (2022)

work page 2022
[35]

SIAM review (1999)

Sethian, J.A.: Fast marching methods. SIAM review (1999)

work page 1999
[36]

In: ECCV (2024)

Sun, X., Liu, L., Zhi, H., Qiu, R., Liang, J.: Prioritized semantic learning for zero- shot instance navigation. In: ECCV (2024)

work page 2024
[37]

In: NeurIPS (2021)

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N.,Mukadam,M.,Chaplot,D.,Maksymets,O.,Gokaslan,A.,Vondrus,V.,Dharur, S., Meier, F., Galuba, W., Chang, A., Kira, Z., Koltun, V., Malik, J., Savva, M., Batra, D.: Habitat 2.0: Training home assistants to rearrange their habitat. In: NeurIPS (2021)

work page 2021
[38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

In: ICML (2024) MCNav 17

Wu, P., Mu, Y., Wu, B., Hou, Y., Ma, J., Zhang, S., Liu, C.: Voronav: Voronoi- based zero-shot object navigation with large language model. In: ICML (2024) MCNav 17

work page 2024
[40]

arXiv preprint arXiv:2303.07798 (2023)

Yadav, K., Majumdar, A., Ramrakhya, R., Yokoyama, N., Baevski, A., Kira, Z., Maksymets, O., Batra, D.: Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav. arXiv preprint arXiv:2303.07798 (2023)

work page arXiv 2023
[41]

NeurIPS (2024)

Yin, H., Xu, X., Wu, Z., Zhou, J., Lu, J.: Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. NeurIPS (2024)

work page 2024
[42]

In: CVPR (2025)

Yin, H., Xu, X., Zhao, L., Wang, Z., Zhou, J., Lu, J.: Unigoal: Towards universal zero-shot goal-oriented navigation. In: CVPR (2025)

work page 2025
[43]

In: ICRA (2024)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: ICRA (2024)

work page 2024
[44]

In: IROS (2023)

Yu, B., Kasaei, H., Cao, M.: L3mvn: Leveraging large language models for visual target navigation. In: IROS (2023)

work page 2023
[45]

Yuan, T., Dong, Z., Liu, Y., Zhao, H.: Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

In: ICRA (2025)

Zhang, L., Wang, H., Xiao, E., Zhang, X., Zhang, Q., Jiang, Z., Xu, R.: Multi-floor zero-shot object navigation policy. In: ICRA (2025)

work page 2025
[47]

In: IROS (2024)

Zhang, L., Zhang, Q., Wang, H., Xiao, E., Jiang, Z., Chen, H., Xu, R.: Trihelper: Zero-shot object navigation with dynamic assistance. In: IROS (2024)

work page 2024
[48]

IEEE RA-L (2025)

Zhang, M., Du, Y., Wu, C., Zhou, J., Qi, Z., Ma, J., Zhou, B.: Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion. IEEE RA-L (2025)

work page 2025
[49]

TopV-Nav: Unlocking the top-view spatial reasoning potential of MLLM for zero-shot object navigation,

Zhong, L., Gao, C., Ding, Z., Liao, Y., Ma, H., Zhang, S., Zhou, X., Liu, S.: Topv- nav:Unlockingthetop-viewspatialreasoningpotentialofmllmforzero-shotobject navigation. arXiv preprint arXiv:2411.16425 (2024)

work page arXiv 2024
[50]

room": "bedroom

Zhou, K., Zheng, K., Pryor, C., Shen, Y., Jin, H., Getoor, L., Wang, X.E.: Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In: ICML (2023) MCNav 1 Appendix A Overview This supplementary material is organized as follows: –Section B provides the details of the three studied tasks. –Section C provides details on the real-d...

work page 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: ICRA (2025)

Busch, F.L., Homberger, T., Ortega-Peimbert, J., Yang, Q., Andersson, O.: One map to find them all: Real-time open-vocabulary mapping for zero-shot multi- object navigation. In: ICRA (2025)

work page 2025

[5] [5]

In: ICRA (2024)

Cai, W., Huang, S., Cheng, G., Long, Y., Gao, P., Sun, C., Dong, H.: Bridging zero-shotobjectnavigationandfoundationmodelsthroughpixel-guidednavigation skill. In: ICRA (2024)

work page 2024

[6] [6]

In: ICCV (2025)

Cao, Y., Zhang, J., Yu, Z., Liu, S., Qin, Z., Zou, Q., Du, B., Xu, K.: Cognav: Cognitive process modeling for object goal navigation with llms. In: ICCV (2025)

work page 2025

[7] [7]

arXiv preprint arXiv:2311.06430 (2023)

Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S.Y., Shah, K., Paxton, C., Gupta, S., Batra, D., et al.: Goat: Go to any thing. arXiv preprint arXiv:2311.06430 (2023)

work page arXiv 2023

[8] [8]

NeurIPS (2020)

Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal naviga- tion using goal-oriented semantic exploration. NeurIPS (2020)

work page 2020

[9] [9]

In: ACL (2024)

Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y.: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In: ACL (2024)

work page 2024

[10] [10]

RSS (2023)

Chen, J., Li, G., Kumar, S., Ghanem, B., Yu, F.: How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers. RSS (2023)

work page 2023

[11] [11]

In: CVPR (2023)

Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Cows on pas- ture: Baselines and benchmarks for language-driven zero-shot object navigation. In: CVPR (2023)

work page 2023

[12] [12]

In: CVPR (2017)

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: CVPR (2017)

work page 2017

[13] [13]

In: CVPR (2023)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: CVPR (2023)

work page 2023

[14] [14]

In: CVPR (2023)

Krantz, J., Gervet, T., Yadav, K., Wang, A., Paxton, C., Mottaghi, R., Batra, D., Malik, J., Lee, S., Chaplot, D.S.: Navigating to objects specified by images. In: CVPR (2023)

work page 2023

[15] [15]

arXiv preprint arXiv:2211.15876 (2022)

Krantz, J., Lee, S., Malik, J., Batra, D., Chaplot, D.S.: Instance-specific image goal navigation: Training embodied agents to find object instances. arXiv preprint arXiv:2211.15876 (2022)

work page arXiv 2022

[16] [16]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Kuang, Y., Lin, H., Jiang, M.: Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670 (2024)

work page arXiv 2024

[17] [17]

In: CVPR (2023)

Kwon, O., Park, J., Oh, S.: Renderable neural radiance map for visual navigation. In: CVPR (2023)

work page 2023

[18] [18]

In: CVPR (2024)

Lei, X., Wang, M., Zhou, W., Li, L., Li, H.: Instance-aware exploration-verification- exploitation for instance imagegoal navigation. In: CVPR (2024)

work page 2024

[19] [19]

In: CVPR (2026) 16 J

Li, J., Wu, J., Hu, D., Huang, X., Sun, B., Hao, Z., Lang, X., Zhu, X., Zhang, L.: Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. In: CVPR (2026) 16 J. Li et al

work page 2026

[20] [20]

Li, J., Zhang, B., Jin, X., Deng, J., Zhu, X., Zhang, L.: Imagidrive: A unified imagination-and-planning framework for autonomous driving (2025)

work page 2025

[21] [21]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y., Xu, Y.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

In: CVPR (2023)

Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: CVPR (2023)

work page 2023

[23] [23]

NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS (2023)

work page 2023

[24] [24]

arXiv preprint arXiv:2509.01364 (2025)

Liu, P., Zhang, Q., Peng, D., Zhang, L., Qin, Y., Zhou, H., Ma, J., Xu, R., Ji, Y.: Toponav: Topological graphs as a key enabler for advanced object navigation. arXiv preprint arXiv:2509.01364 (2025)

work page arXiv 2025

[25] [25]

In: CVPR (2023)

Liu, R., Wang, X., Wang, W., Yang, Y.: Bird’s-eye-view scene graph for vision- language navigation. In: CVPR (2023)

work page 2023

[26] [26]

In: ECCV (2024)

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV (2024)

work page 2024

[27] [27]

CVPR (2026)

Liu, Z., Huang, R., Yang, R., Yan, S., Wang, Z., Hou, L., Lin, D., Bai, X., Zhao, H.: Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning. CVPR (2026)

work page 2026

[28] [28]

CoRL (2024)

Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. CoRL (2024)

work page 2024

[29] [29]

In: ICRA (2024)

Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: Visual language navi- gation via multi-expert discussions. In: ICRA (2024)

work page 2024

[30] [30]

NeurIPS (2022)

Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., Batra, D.: Zson: Zero-shot object-goal navigation using multimodal goal embeddings. NeurIPS (2022)

work page 2022

[31] [31]

Meta AI Blog

Meta, A.: Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. Retrieved December (2024)

work page 2024

[32] [32]

IROS (2025)

Nie, D., Guo, X., Duan, Y., Zhang, R., Chen, L.: Wmnav: Integrating vision- language models into world models for object goal navigation. IROS (2025)

work page 2025

[33] [33]

In: Proceedings of the International Conference on Automated Planning and Scheduling (2024)

Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling (2024)

work page 2024

[34] [34]

In: CVPR (2022)

Ramakrishnan, S.K., Chaplot, D.S., Al-Halah, Z., Malik, J., Grauman, K.: Poni: Potential functions for objectgoal navigation with interaction-free learning. In: CVPR (2022)

work page 2022

[35] [35]

SIAM review (1999)

Sethian, J.A.: Fast marching methods. SIAM review (1999)

work page 1999

[36] [36]

In: ECCV (2024)

Sun, X., Liu, L., Zhi, H., Qiu, R., Liang, J.: Prioritized semantic learning for zero- shot instance navigation. In: ECCV (2024)

work page 2024

[37] [37]

In: NeurIPS (2021)

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N.,Mukadam,M.,Chaplot,D.,Maksymets,O.,Gokaslan,A.,Vondrus,V.,Dharur, S., Meier, F., Galuba, W., Chang, A., Kira, Z., Koltun, V., Malik, J., Savva, M., Batra, D.: Habitat 2.0: Training home assistants to rearrange their habitat. In: NeurIPS (2021)

work page 2021

[38] [38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

In: ICML (2024) MCNav 17

Wu, P., Mu, Y., Wu, B., Hou, Y., Ma, J., Zhang, S., Liu, C.: Voronav: Voronoi- based zero-shot object navigation with large language model. In: ICML (2024) MCNav 17

work page 2024

[40] [40]

arXiv preprint arXiv:2303.07798 (2023)

Yadav, K., Majumdar, A., Ramrakhya, R., Yokoyama, N., Baevski, A., Kira, Z., Maksymets, O., Batra, D.: Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav. arXiv preprint arXiv:2303.07798 (2023)

work page arXiv 2023

[41] [41]

NeurIPS (2024)

Yin, H., Xu, X., Wu, Z., Zhou, J., Lu, J.: Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. NeurIPS (2024)

work page 2024

[42] [42]

In: CVPR (2025)

Yin, H., Xu, X., Zhao, L., Wang, Z., Zhou, J., Lu, J.: Unigoal: Towards universal zero-shot goal-oriented navigation. In: CVPR (2025)

work page 2025

[43] [43]

In: ICRA (2024)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: ICRA (2024)

work page 2024

[44] [44]

In: IROS (2023)

Yu, B., Kasaei, H., Cao, M.: L3mvn: Leveraging large language models for visual target navigation. In: IROS (2023)

work page 2023

[45] [45]

Yuan, T., Dong, Z., Liu, Y., Zhao, H.: Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

In: ICRA (2025)

Zhang, L., Wang, H., Xiao, E., Zhang, X., Zhang, Q., Jiang, Z., Xu, R.: Multi-floor zero-shot object navigation policy. In: ICRA (2025)

work page 2025

[47] [47]

In: IROS (2024)

Zhang, L., Zhang, Q., Wang, H., Xiao, E., Jiang, Z., Chen, H., Xu, R.: Trihelper: Zero-shot object navigation with dynamic assistance. In: IROS (2024)

work page 2024

[48] [48]

IEEE RA-L (2025)

Zhang, M., Du, Y., Wu, C., Zhou, J., Qi, Z., Ma, J., Zhou, B.: Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion. IEEE RA-L (2025)

work page 2025

[49] [49]

TopV-Nav: Unlocking the top-view spatial reasoning potential of MLLM for zero-shot object navigation,

Zhong, L., Gao, C., Ding, Z., Liao, Y., Ma, H., Zhang, S., Zhou, X., Liu, S.: Topv- nav:Unlockingthetop-viewspatialreasoningpotentialofmllmforzero-shotobject navigation. arXiv preprint arXiv:2411.16425 (2024)

work page arXiv 2024

[50] [50]

room": "bedroom

Zhou, K., Zheng, K., Pryor, C., Shen, Y., Jin, H., Getoor, L., Wang, X.E.: Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In: ICML (2023) MCNav 1 Appendix A Overview This supplementary material is organized as follows: –Section B provides the details of the three studied tasks. –Section C provides details on the real-d...

work page 2023