PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.
NavGPT: Explicit reasoning in vision-and-language navigation with large language models, 2023
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor navigation.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
citing papers explorer
-
PhotoFlow: Agentic 3D Virtual Photography Missions
PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
-
Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation
Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor navigation.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.