pith. sign in

arxiv: 2502.13451 · v5 · submitted 2025-02-19 · 💻 cs.RO

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Pith reviewed 2026-05-23 02:56 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-and-language navigationsemantic mapsmemory representationvision-language modelsembodied AIannotated mapsnavigation agentstop-down maps
0
0 comments X

The pith

MapNav replaces historical observation frames with an Annotated Semantic Map to guide vision-and-language navigation agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that traditional VLN methods incur high storage and compute costs by retaining sequences of past visual frames as context. MapNav instead builds a single top-down semantic map at episode start, updates it each step with new observations, and adds explicit text labels to important regions to create an Annotated Semantic Map. This ASM becomes the sole memory input to a VLM-based agent. Experiments show the approach reaches state-of-the-art success rates in both simulation and real-world settings while eliminating the need to store frame histories. The authors position the ASM as a reusable new memory representation for the VLN task.

Core claim

MapNav constructs a top-down semantic map at the beginning of each episode and updates it at every timestep; key regions receive explicit textual labels that convert abstract semantics into navigation cues, producing the Annotated Semantic Map; the resulting ASM is supplied directly to a VLM-powered agent as its only memory representation, replacing all historical observation frames.

What carries the argument

Annotated Semantic Map (ASM): a top-down semantic map that is initialized once per episode, updated each timestep, and augmented with textual labels on key regions to supply structured navigation cues to the agent.

If this is right

  • Storage and compute overhead from maintaining observation histories is eliminated.
  • The same ASM construction process yields state-of-the-art navigation success in both simulated and physical environments.
  • VLMs can be applied directly to the compact annotated map rather than to raw image sequences.
  • The released ASM generation code and dataset enable other researchers to adopt the representation without re-implementing map construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the ASM proves robust across longer trajectories, similar map-based memory could reduce context length requirements in other embodied tasks such as object manipulation or multi-agent coordination.
  • Explicit text labels on the map may allow human operators to inspect or correct the agent's internal state more easily than inspecting raw image histories.
  • The approach implicitly assumes reliable semantic segmentation and mapping; any degradation in those upstream modules would directly limit the ASM's usefulness.

Load-bearing premise

The combination of a top-down semantic map and its textual annotations contains enough structured information to substitute for stored historical frames without losing decision-critical detail.

What would settle it

A controlled ablation in which the textual annotations are removed from the ASM while keeping the geometric map intact, followed by measurement of whether success rate or path efficiency falls in the same environments where the full ASM previously achieved SOTA.

Figures

Figures reproduced from arXiv: 2502.13451 by Jing Zhang, Lingfeng Zhang, Pengwei Wang, Qiang Zhang, Qinwen Xu, Renjing Xu, Shanghang Zhang, Xiaoshuai Hao, Xinyao Zhang, Zhongyuan Wang.

Figure 1
Figure 1. Figure 1: Illustration of our Annotated Semantic Map (ASM). At each timestep, MapNav agent lever￾ages egocentric observations to capture semantic objects and assign explicit textual labels to key regions, creating the ASM for the current moment. ASM provides infor￾mation such as physical obstacles, explored regions, the agent’s current position, trajectory and semantic objects. tion of embodied AI and multimodal und… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of MapNav framework. We present a top-down Annotated Semantic Map (ASM), updated at each timestep for precise object mapping and structured navigation. It features explicit textual labels for key regions, providing clear navigation cues. The current RGB observation, ASM, and instruction are used as inputs to an end-to-end VLM framework, which generates navigation actions in natural language. gr… view at source ↗
Figure 3
Figure 3. Figure 3: ASM Generation Process. Semantic map generation starts with episode initialization. At each timestep, the RGB image is processed by a semantic segmentation module to create a semantic mask aligned with the depth-converted 3D point cloud. By combining this with the previous pose transformation, we project the 3D point cloud onto a 2D plane to update the semantic map. Finally, we convert the semantic map int… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different VLM’s understanding [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of MapNav using different num [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The real-world MapNav robot setup. to 32,768 tokens and incorporates sliding window attention with a window size of 131,072 tokens. Training Setting. We conducted our training on 8 NVIDIA A100 GPUs for approximately 30 hours, totaling 240 GPU hours (≈500k step-wise sam￾ples). During the fine-tuning process, we froze the vision encoder and only fine-tuned the multi￾modal projector and language model compone… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results of MapNav in the simulator. Timestep = 0 Timestep = 17 Timestep = 23 Third Perspective Egocentric View ASMs Simple Instruction “Walk forward, turn right and go straight, stop at the door. ” Timestep = 0 Timestep = 17 Timestep = 23 Semantic Instruction “Walk forward, turn right at the refrigerator, go straight, stop at the wall.” [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization results of MapNav in the real-world. successfully identifies the shortest path while fol￾lowing complex instructions involving multiple waypoints. In contrast, without ASM, the agent struggles to find the correct path, underscoring ASM’s importance in semantic understanding and path planning. In real-world tests, the agent ef￾fectively executes simple navigation instructions and excels at com… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of VLM Attention Across Different Map Representations. A comparison of different map representations showing that while Semantic Map exhibits sparse attention patterns without convergence on semantic objects, our ASM successfully leverages textual labels to guide attention towards semantic objects, as evidenced by concentrated attention distributions and the VLM’s responses. attention alignme… view at source ↗
Figure 10
Figure 10. Figure 10: Additional Visualizations of VLM Attention Across Different Map Representations [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (1/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: (2/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: (3/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (4/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: (5/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: (6/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: (1/2) Real-world demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: (2/2) Real-world demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
read the original abstract

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MapNav, an end-to-end VLN model for vision-and-language navigation that constructs a top-down Annotated Semantic Map (ASM) at episode start, updates it each timestep, adds explicit textual labels to key regions, and feeds the resulting ASM to a VLM agent in place of historical observation frames, claiming this yields SOTA performance in both simulated and real-world settings while reducing storage and compute overhead.

Significance. If the central performance claims hold after proper controls, the work would supply a concrete alternative memory representation for VLN that trades egocentric history for an explicitly annotated top-down semantic map, potentially lowering the cost of maintaining long-horizon context and offering a reusable resource via the promised code and dataset release.

major comments (2)
  1. [Abstract / method description] Abstract and method overview: the central claim that the ASM fully substitutes for historical egocentric frames without loss of decision-critical detail (viewpoint-dependent appearance, texture, partial occlusions, metric depth referenced by instructions) is not isolated by any described ablation that holds the VLM backbone, training regime, and map-construction oracle fixed while toggling the presence of past RGB observations.
  2. [Abstract] Abstract: the assertion of SOTA results in simulated and real-world environments is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis, preventing evaluation of whether gains arise from the ASM substitution itself.
minor comments (1)
  1. [Abstract] Abstract contains several grammatical issues (e.g., 'update it at each timestep' should be 'updates'; the clause 'transforming abstract semantics into clear navigation cues and generate our ASM' is incomplete).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on MapNav. The comments highlight opportunities to strengthen the presentation of our core claims regarding the Annotated Semantic Map (ASM) as a memory representation. We address each point below and will revise the manuscript to improve clarity and provide additional supporting evidence.

read point-by-point responses
  1. Referee: [Abstract / method description] Abstract and method overview: the central claim that the ASM fully substitutes for historical egocentric frames without loss of decision-critical detail (viewpoint-dependent appearance, texture, partial occlusions, metric depth referenced by instructions) is not isolated by any described ablation that holds the VLM backbone, training regime, and map-construction oracle fixed while toggling the presence of past RGB observations.

    Authors: We agree that an explicit ablation isolating the ASM substitution—while holding the VLM backbone, training regime, and map-construction process fixed—is necessary to rigorously support the claim. The current manuscript focuses on end-to-end performance comparisons but does not include this controlled toggle of historical RGB frames. In the revised version, we will add such an ablation study, reporting navigation success rates and other metrics with and without past RGB observations under otherwise identical conditions. This will directly address whether decision-critical details are preserved by the ASM alone. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of SOTA results in simulated and real-world environments is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis, preventing evaluation of whether gains arise from the ASM substitution itself.

    Authors: The abstract was written concisely and therefore omits specific numbers. The full manuscript contains quantitative results, baseline comparisons, and error analyses in the experiments section. To improve accessibility, we will revise the abstract to include key metrics (e.g., success rate improvements over baselines in simulation and real-world settings) while maintaining brevity. We will also ensure the abstract references the relevant tables for full details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction with no derivations or self-referential reductions.

full rationale

The paper introduces MapNav as an empirical method for VLN that constructs and annotates a top-down semantic map to replace historical frames, then evaluates it experimentally. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central claim rests on experimental SOTA results rather than any step that reduces by construction to its inputs. No self-citations are invoked as load-bearing uniqueness theorems. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces ASM as a new construct; no explicit free parameters, mathematical axioms, or invented physical entities are described in the abstract. Standard VLN assumptions (e.g., availability of depth or semantic segmentation) are implicit but not enumerated.

invented entities (1)
  • Annotated Semantic Map (ASM) no independent evidence
    purpose: Compact memory representation that replaces historical observation frames for VLN decision making
    Introduced in the abstract as the core novel component; no independent falsifiable prediction outside the paper is stated.

pith-pipeline@v0.9.0 · 5788 in / 1106 out tokens · 29373 ms · 2026-05-23T02:56:49.176727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 7.0

    Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.

  2. VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

    cs.RO 2026-03 conditional novelty 7.0

    VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

  3. GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

    cs.CV 2026-05 unverdicted novelty 6.0

    GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.

  4. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. 2024. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence

  2. [2]

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision, pages 667--676

  3. [3]

    Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, and Kwan-Yee K Wong. 2024. Affordances-oriented planning using foundation models for continuous vision-language navigation. arXiv preprint arXiv:2407.05890

  4. [4]

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. 2022. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Advances in Neural Information Processing Systems, pages 38149--38161

  5. [5]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  6. [6]

    Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. 2022. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 7606--7623

  7. [7]

    Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025 a . Tla: Tactile-language-action model for contact-rich manipulation. arXiv preprint arXiv:2503.08548

  8. [8]

    Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137--13146

  9. [9]

    Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yifan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. 2025 b . Mapfusion: A novel bev feature fusion network for multi-modal map construction. Information Fusion, 119:103018

  10. [10]

    Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, and Jing Zhang. 2024 a . Mapdistill: Boosting efficient camera-based hd map construction via camera-lidar fusion model distillation. In European Conference on Computer Vision, pages 166--183. Springer

  11. [11]

    Xiaoshuai Hao, Guanqun Liu, Yuting Zhao, Yuheng Ji, Mengchuan Wei, Haimei Zhao, Lingdong Kong, Rong Yin, and Yu Liu. 2025 c . Msc-bench: Benchmarking and analyzing multi-sensor corruption for driving perception. arXiv preprint arXiv:2501.01037

  12. [12]

    Xiaoshuai Hao, Hui Zhang, Yifan Yang, Yi Zhou, Sangil Jung, Seung-In Park, and ByungIn Yoo. 2024 b . Mbfusion: A new multi-modal bev feature fusion method for hd map construction. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15922--15928. IEEE

  13. [13]

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. 2022. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439--15449

  14. [14]

    Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055--3067

  15. [15]

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. 2025. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257

  16. [16]

    Glenn Jocher, Jing Qiu, and Ayush Chaurasia. 2023. https://github.com/ultralytics/ultralytics Ultralytics YOLO

  17. [17]

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. 2021. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162--15171

  18. [18]

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision, pages 104--120

  19. [19]

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 a . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

  20. [20]

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 b . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

  21. [21]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024 a . Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

  22. [22]

    Dingzhe Li, Yixiang Jin, Yuhao Sun, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Jianwei Zhang, et al. 2024 b . What foundation models can bring for robot learning in manipulation: A survey. arXiv preprint arXiv:2404.18201

  23. [23]

    Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. 2024. Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning. arXiv preprint arXiv:2403.07376

  24. [24]

    Rui Liu, Wenguan Wang, and Yi Yang. 2024. Volumetric environment representation for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16317--16328

  25. [25]

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. 2024 a . Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882

  26. [26]

    Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. 2024 b . Discuss before moving: Visual language navigation via multi-expert discussions. In IEEE International Conference on Robotics and Automation, pages 17380--17387

  27. [27]

    Sang-Min Park and Young-Gab Kim. 2023. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, pages 365--427

  28. [28]

    Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020. Object-and-action aware model for visual language navigation. In European Conference on Computer Vision, pages 303--317

  29. [29]

    St \'e phane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the international conference on artificial intelligence and statistics, pages 627--635

  30. [30]

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339--9347

  31. [31]

    Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492--504

  32. [32]

    Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. 2025. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. arXiv preprint arXiv:2503.00778

  33. [33]

    Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394--406

  34. [34]

    Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. 2021. Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision, pages 246--266

  35. [35]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

  36. [36]

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. 2023. Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625--15636

  37. [37]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, pages 24824--24837

  38. [38]

    Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. 2024. Voronav: Voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695

  39. [39]

    Siying Wu, Xueyang Fu, Feng Wu, and Zheng-Jun Zha. 2022. Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the ACM International Conference on Multimedia, pages 4233--4241

  40. [40]

    Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. 2025. Evaluating gpt-4o's embodied intelligence: A comprehensive empirical study. TechRxiv preprint techrxiv.174495686.69962588/v1

  41. [41]

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. 2024. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation, pages 42--48

  42. [42]

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3554--3560

  43. [43]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975--11986

  44. [44]

    Zhaohuan Zhan, Lisha Yu, Sijie Yu, and Guang Tan. 2024. Mc-gpt: Empowering vision-and-language navigation with memory map and reasoning chains. arXiv preprint arXiv:2405.10620

  45. [45]

    Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025. Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation. arXiv preprint arXiv:2505.09577

  46. [46]

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. 2023. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289

  47. [47]

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. 2024 a . Navid: Video-based vlm plans the next step for vision-and-language navigation. In Proceedings of Robotics: Science and Systems

  48. [48]

    Lingfeng Zhang, Hao Wang, Erjia Xiao, Xinyao Zhang, Qiang Zhang, Zixuan Jiang, and Renjing Xu. 2024 b . Multi-floor zero-shot object navigation policy. arXiv preprint arXiv:2409.10906

  49. [49]

    Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. 2024 c . Trihelper: Zero-shot object navigation with dynamic assistance. arXiv preprint arXiv:2403.15223

  50. [50]

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. 2024. Towards learning a generalist model for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624--13634

  51. [51]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, pages 46595--46623

  52. [52]

    Gengze Zhou, Yicong Hong, and Qi Wu. 2024. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7641--7649

  53. [53]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...