Airnav: A large-scale real-world uav vision-and- language navigation dataset with natural and diverse instructions

· 2026 · cs.CL · arXiv 2601.03707

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\% success rate on the test-unseen split. Real-world experiments on a physical UAV platform provide preliminary evidence of sim-to-real transferability, and our dataset and code are publicly available.

citation-role summary

dataset 1

citation-polarity summary

background 1

representative citing papers

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

cs.RO · 2026-04-09 · unverdicted · novelty 4.0

This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

citing papers explorer

Showing 2 of 2 citing papers.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration cs.CV · 2026-06-26 · unverdicted · none · ref 10 · internal anchor
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models cs.RO · 2026-04-09 · unverdicted · none · ref 112 · internal anchor
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

Airnav: A large-scale real-world uav vision-and- language navigation dataset with natural and diverse instructions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer