ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

· 2025 · cs.RO · arXiv 2512.17435

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

representative citing papers

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

Hierarchical 3D Scene Graph Construction and Belief-based Planning for Semantic Navigation

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

Proposes online hierarchical 3D scene graph construction paired with belief-based planning to improve zero-shot semantic navigation performance in unseen environments.

citing papers explorer

Showing 2 of 2 citing papers after filters.

World Models as Group Actions cs.CV · 2026-05-23 · unverdicted · none · ref 37 · internal anchor
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
Hierarchical 3D Scene Graph Construction and Belief-based Planning for Semantic Navigation cs.CV · 2026-06-30 · unverdicted · none · ref 27 · internal anchor
Proposes online hierarchical 3D scene graph construction paired with belief-based planning to improve zero-shot semantic navigation performance in unseen environments.

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

fields

years

verdicts

representative citing papers

citing papers explorer