arXiv preprint arXiv:2502.13451 , year=

A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation , author= · 2025 · cs.RO · arXiv 2502.13451

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

representative citing papers

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

cs.RO · 2026-03-07 · conditional · novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.

citing papers explorer

Showing 3 of 3 citing papers.

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation cs.CV · 2026-04-19 · unverdicted · none · ref 117 · internal anchor
Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness cs.RO · 2026-03-07 · conditional · none · ref 35 · internal anchor
VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation cs.CV · 2026-05-21 · unverdicted · none · ref 44 · internal anchor
GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.

arXiv preprint arXiv:2502.13451 , year=

fields

years

verdicts

representative citing papers

citing papers explorer