Efficient-VLN: A Simple yet Strong Baseline for Efficient Vision-Language Navigation

Duo Zheng; Liwei Wang; Shijia Huang; Yanyang Li

arxiv: 2512.10310 · v2 · pith:Q5KELL7Nnew · submitted 2025-12-11 · 💻 cs.CV

Efficient-VLN: A Simple yet Strong Baseline for Efficient Vision-Language Navigation

Duo Zheng , Shijia Huang , Yanyang Li , Liwei Wang This is my paper

classification 💻 cs.CV

keywords trainingcollectiondataefficient-vlninferencenavigationacrossaction

0 comments

read the original abstract

While Multimodal Large Language Models (MLLMs) have demonstrated significant promise in Vision-Language Navigation (VLN), existing agents remain heavily constrained by systemic bottlenecks across inference, training, and data collection. Specifically, they suffer from prohibitive latency due to visual history reprocessing, action leakage during sequence-packed training, and suboptimal exploration in self-correction data collection. To overcome these intertwined challenges, we present Efficient-VLN, a highly efficient and robust baseline that systematically resolves these issues through three simple-yet-effective mechanisms. (1) Inference: We introduce KV-cache reuse with contiguous RoPE, enabling the model to process only the newly observed frame at each step for real-time inference. (2) Training: We propose packed training with an action-isolating mask to accelerate throughput while effectively bridging the training-inference gap by preventing action leakage. (3) Data Collection: We employ an Adaptive DAgger to dynamically balance autonomous exploration and oracle guidance, enhancing error-recovery capability without escalating computational costs. Extensive evaluations show that Efficient-VLN significantly advances the state-of-the-art across the R2R-CE (73.2% SR) and RxR-CE (75.6% SR) benchmarks. Meanwhile, it yields a 28% latency reduction compared to the previous state-of-the-art StreamVLN, establishing a new paradigm for streaming MLLM-based navigation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
cs.RO 2026-03 conditional novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation
cs.RO 2026-05 unverdicted novelty 6.0

IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
What Limits Vision-and-Language Navigation ?
cs.RO 2026-05 unverdicted novelty 5.0

StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.