WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while making small VLMs competitive with large ones.
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
ToLL pretrains 3D scene graph generators via anchor-conditioned topological layout recovery and asymmetric structural distillation to learn predicate constraints rather than geometric interpolation shortcuts.
A monocular RGB-only aerial VLN framework outperforms baselines via prompt-guided multi-task learning, keyframe selection, and label reweighting on AerialVLN and OpenFly benchmarks.
citing papers explorer
-
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while making small VLMs competitive with large ones.
-
ToLL: Topological Layout Learning with Asymmetric Cross-View Structural Distillation for 3D Scene Graph Generation Pretraining
ToLL pretrains 3D scene graph generators via anchor-conditioned topological layout recovery and asymmetric structural distillation to learn predicate constraints rather than geometric interpolation shortcuts.
-
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
A monocular RGB-only aerial VLN framework outperforms baselines via prompt-guided multi-task learning, keyframe selection, and label reweighting on AerialVLN and OpenFly benchmarks.