Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Giovanni Beltrame; Jana Pavlasek; Karthik Soma; Maeva Guerrier

arxiv: 2603.25937 · v2 · pith:XBPWCNGGnew · submitted 2026-03-26 · 💻 cs.RO · cs.LG

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Maeva Guerrier , Karthik Soma , Jana Pavlasek , Giovanni Beltrame This is my paper

classification 💻 cs.RO cs.LG

keywords modelsevaluationreal-worldrobotvnmsenvironmentsfivegoal

0 comments

read the original abstract

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Models as Group Actions
cs.CV 2026-05 unverdicted novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
VISTA: Scale-Aware Visual Navigation via Action History Conditioning
cs.RO 2026-06 unverdicted novelty 4.0

VISTA conditions visual navigation policies on action history and DINOv3 features to achieve scale-aware zero-shot deployment, reporting 100% goal accuracy and 95% checkpoint success in real-world outdoor, forest, and...
SAFER-Nav: Enhancing Safety for Visual Robot Navigation via Segmentation-Aware Fine-Tuning
cs.RO 2026-06 unverdicted novelty 4.0

SAFER-Nav fine-tunes visual navigation models with segmentation to reduce collisions versus ViNT and NoMaD baselines while preserving goal-reaching across robot platforms and obstacle scenarios.