pith. sign in

arxiv: 2510.03142 · v2 · pith:3LTPZCOSnew · submitted 2025-10-03 · 💻 cs.RO · cs.CV

MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

classification 💻 cs.RO cs.CV
keywords modelnavigationdatavisualcapabilitiesmodelsobservationsdepth
0
0 comments X
read the original abstract

Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

    cs.RO 2026-06 unverdicted novelty 4.0

    AgenticDiffusion proposes a multi-view UAV navigation framework using language-guided reasoning, open-vocabulary grounding, vision-based diffusion planning, and NMPC, reporting 80% mission success across 40 real-world trials.

  2. GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

    cs.RO 2026-06 unverdicted novelty 4.0

    GN0 curates GN-Matrix dataset, builds 3DGS simulator and GN-Bench, and trains BAE model via supervised learning plus DAgger and RL to unify VLN tasks and outperform prior methods on GN-Bench and VLN-CE.