AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
baseline 2polarities
baseline 2representative citing papers
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
citing papers explorer
-
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
-
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.