Recognition: unknown
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
Pith reviewed 2026-05-10 13:44 UTC · model grok-4.3
The pith
UAV vision-and-language navigation has progressed from modular and deep learning methods to agentic systems using large foundation models, with a proposed roadmap for addressing deployment barriers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including VLMs, VLA models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. It reviews the ecosystem of simulators, datasets, and evaluation metrics, conducts a critical analysis of the primary challenges impeding real-world deployment including the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and efficient deployment of large models, and concludes by proposing a
What carries the argument
The methodological taxonomy that organizes the evolution of UAV-VLN approaches from modular and deep learning to agentic foundation model systems.
Load-bearing premise
That the taxonomy comprehensively captures all important developments in the field and that the four challenges are the main barriers to real-world UAV deployment.
What would settle it
A review of the latest literature identifying multiple UAV-VLN approaches that fall outside the proposed taxonomy categories or uncovering additional primary challenges not listed in the survey.
Figures
read the original abstract
Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys Vision-and-Language Navigation for UAVs (UAV-VLN), defining the task and establishing a methodological taxonomy that traces evolution from early modular and deep-learning approaches to contemporary agentic systems based on VLMs, VLA models, and generative world models. It reviews the ecosystem of simulators, datasets, and evaluation metrics, critically analyzes four primary challenges (simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and efficient deployment of large models), and concludes with a forward-looking research roadmap covering multi-agent swarm coordination and air-ground collaboration.
Significance. If the taxonomy and challenge prioritization hold, the survey would organize a rapidly evolving subfield of embodied AI, synthesize benchmarks, and provide a useful roadmap for UAV navigation research. The structured progression from modular to foundation-model approaches and the explicit focus on real-world deployment barriers could help standardize evaluation and direct effort toward high-impact areas.
major comments (1)
- [Taxonomy and Challenges sections] The central claim that the survey establishes a comprehensive methodological taxonomy and identifies the four primary challenges rests on an undocumented literature selection process. No section (including the taxonomy presentation or challenges analysis) specifies search databases, keywords, date ranges, inclusion/exclusion criteria, total papers screened, or quantitative breakdown of coverage per category. This absence makes it impossible to assess whether the reviewed works are representative or whether omitted issues (e.g., safety certification or regulatory constraints) are equally load-bearing.
minor comments (2)
- [Abstract] The abstract and introduction could include a brief quantitative overview (e.g., number of papers reviewed or distribution across taxonomy categories) to give readers immediate context on scope.
- [Resources and Evaluation Metrics sections] A summary table mapping taxonomy categories to representative papers, simulators, and metrics would improve readability and allow quick cross-referencing.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. We address the major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Taxonomy and Challenges sections] The central claim that the survey establishes a comprehensive methodological taxonomy and identifies the four primary challenges rests on an undocumented literature selection process. No section (including the taxonomy presentation or challenges analysis) specifies search databases, keywords, date ranges, inclusion/exclusion criteria, total papers screened, or quantitative breakdown of coverage per category. This absence makes it impossible to assess whether the reviewed works are representative or whether omitted issues (e.g., safety certification or regulatory constraints) are equally load-bearing.
Authors: We agree that documenting the literature selection process would enhance the transparency and reproducibility of the survey. Our review was comprehensive, drawing from key publications in the field, but it was not conducted as a formal systematic review with predefined protocols. To address this, we will revise the manuscript to include a dedicated 'Literature Review Methodology' subsection. This will specify the primary sources (arXiv, major robotics and AI conferences such as ICRA, IROS, CVPR, NeurIPS), search keywords (e.g., 'UAV VLN', 'drone vision language navigation', 'aerial embodied AI'), date range (papers published from 2015 onwards), and quantitative breakdown of coverage per category. Regarding potential omitted issues such as safety certification and regulatory constraints, we acknowledge their importance for real-world UAV deployment. We will incorporate a brief discussion of these in the challenges section and the research roadmap, noting them as complementary barriers alongside the four primary technical challenges we identified. revision: yes
Circularity Check
No circularity: survey synthesizes external literature without self-referential derivations
full rationale
This paper is a literature survey that defines a taxonomy of UAV-VLN approaches and lists challenges by reviewing prior external work. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The central claims rest on synthesis of cited literature rather than any reduction to the paper's own inputs, self-citations as load-bearing premises, or renamed ansatzes. Absence of documented search methods affects representativeness but does not create circularity under the specified patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Feng et al
T. Feng et al. Embodied AI: From LLMs to world models. IEEE Circuits and Systems Magazine , 2025
2025
-
[2]
arXiv preprint arXiv:2508.10399 , year=
W. Liang et al. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025
-
[3]
Aligning cyber space with physical world: A comprehensive survey on embodied AI,
Y. Liu et al. Aligning cyber space with physical world: A comprehensive survey on embodied AI. arXiv preprint arXiv:2407.06886, 2024
-
[4]
Choutri et al
K. Choutri et al. Leveraging large language models for real-time uav control. Electronics, 14(21):4312, 2025
2025
-
[5]
S. A. Salunkhe et al. Intuitive human-drone collaborative navigation in unknown environments through mixed reality. In 2025 International Conference on Unmanned Aircraft Systems (ICUAS), 2025
2025
- [6]
- [7]
-
[8]
Cai et al
H. Cai et al. FlightGPT: Towards generalizable and inter- pretable UA V vision-and-Language Navigation with vision- Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pp. 6659– 6676, Suzhou, China, november 2025. Association for Compu- tational Linguistics
2025
-
[9]
L. Feng. Invited speaker 1: Navigation without GPS for un- manned aerial vehicles. In 2019 International Conference on Computer and Drone Applications (IConDA) , pp. 1, 2019
2019
-
[10]
Seidel et al
L. Seidel et al. Advancing early wildfire detection: Integration of vision language models with unmanned aerial vehicle remote sensing for enhanced situational awareness. Drones, 9(5):347, 2025
2025
-
[11]
Salahat et al
E. Salahat et al. Waypoint planning for autonomous aerial inspection of large-scale solar farms. In IECON 2019 - 45th Annual Conference of the IEEE Industrial Electronics Society , pp. 763–769, 2019
2019
-
[12]
Zhou et al
Z. Zhou et al. A lightweight drone vision system for autonomous inspection with real-time processing. Drones, 10(2):126, 2026
2026
-
[13]
Zhang et al
X. Zhang et al. Logisticsvln: Vision-language navigation for low- altitude terminal delivery based on agentic uavs. In 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), 2025
2025
-
[14]
D. H. Lee et al. A review on recent deep learning-based seman- tic segmentation for urban greenness measurement. Sensors, 24(7):2245, 2024
2024
-
[15]
R. Sapkota et al. Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs. arXiv preprint arXiv:2506.08045, 2025
-
[16]
Morando and G
L. Morando and G. Loianno. Spatial assisted human-drone col- laborative navigation and interaction through immersive mixed reality. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024
2024
-
[17]
J. Feng et al. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 , 2025
-
[18]
Zhang et al
Y. Zhang et al. Vision-and-language navigation today and to- morrow: A survey in the era of foundation models. Transactions on Machine Learning Research, 2024. Survey Certification
2024
-
[19]
Firoozi et al
R. Firoozi et al. Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research, 44(5):701–739, 2024
2024
-
[20]
Chen et al
S. Chen et al. Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion, 122:103198, 2025
2025
-
[21]
D. Zhang et al. Pure vision language action (vla) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 , 2025
-
[22]
R. Sapkota et al. Vision-language-action (vla) models: Con- cepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769, 2025
-
[23]
Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025
J. Zhang et al. Embodied Navigation Foundation Model. arXiv preprint arXiv:2509.12129, 2025
-
[24]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024. 31
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025
work page internal anchor Pith review arXiv 2025
-
[26]
NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 , 2025
-
[27]
Anderson et al
P. Anderson et al. Vision-and-language navigation: Interpret- ing visually-grounded navigation instructions in real environ- ments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683, 2018
2018
-
[28]
Wu et al
W. Wu et al. Vision-language navigation: a survey and tax- onomy. Neural Computing and Applications , 36(7):3291–3316, 2024
2024
-
[29]
Liu et al
S. Liu et al. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 15384–15394, 2023
2023
-
[30]
Gao et al
Y. Gao et al. Openfly: A versatile toolchain and large-scale benchmark for aerial vision-language navigation. In Interna- tional Conference on Learning Representations, 2026
2026
-
[31]
Wu et al
R. Wu et al. Aeroduo: Aerial duo for uav-based vision and language navigation. In Proceedings of the 33rd ACM Interna- tional Conference on Multimedia, 2025
2025
-
[32]
Zhang et al
W. Zhang et al. CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pp. 31292–31309, Vienna, Austria, jul 2025. Association for Computational Linguistics
2025
-
[33]
Lee et al
J. Lee et al. Citynav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 5912–5922, Octo- ber 2025
2025
- [34]
-
[35]
Gao et al
Y. Gao et al. Openfly: A comprehensive platform for aerial vision-language navigation. In International Conference on Learning Representations, 2026
2026
-
[36]
Liu et al
X. Liu et al. Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pp. 23864–23872, 2026
2026
-
[37]
Tian et al
Y. Tian et al. UA Vs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility. Information Fusion , 122:103158, 2025
2025
-
[38]
S. Javaid et al. Large language models for uavs: Current state and pathways to the future. arXiv preprint arXiv:2405.01745 , 2024
-
[39]
X. Zhao et al. Agrivln: Vision-and-language navigation for agricultural robots. arXiv preprint arXiv:2508.07406 , 2025
-
[40]
A. Torneiro et al. Towards general urban monitoring with vision-language models: A review, evaluation, and a research agenda. arXiv preprint arXiv:2510.12400 , 2025
-
[41]
Osmani and D
K. Osmani and D. Schulz. Comprehensive investigation of un- manned aerial vehicles (uavs): An in-depth analysis of avionics systems. Sensors, 24(10):3064, 2024
2024
-
[42]
Xiao et al
A. Xiao et al. Foundation models for remote sensing and earth observation: A survey. IEEE Geoscience and Remote Sensing Magazine, 2024
2024
-
[43]
Weng et al
X. Weng et al. Vision-language modeling meets remote sens- ing: Models, datasets and perspectives. IEEE Geoscience and Remote Sensing Magazine , 2025
2025
-
[44]
Bu et al
Y. Bu et al. Advancement challenges in uav swarm formation control: A comprehensive review. Drones, 8(7):320, 2024
2024
-
[45]
Zhai et al
L. Zhai et al. Intelligent optimization algorithms for multi-uav path planning: A comprehensive review. IEEE Access, 13:1–1, 2025
2025
-
[46]
W. Y. H. Adoni et al. Investigation of autonomous multi- uav systems for target detection in distributed environment: Current developments and open challenges. Drones, 7(4):263, 2023
2023
-
[47]
Y. Gong et al. Safe and economical uav trajectory planning in low-altitude airspace: A hybrid drl-llm approach with compli- ance awareness. arXiv preprint arXiv:2506.08532 , 2025
-
[48]
E. Cereda et al. On-device self-supervised learning of visual per- ception tasks aboard hardware-limited nano-quadrotors. arXiv preprint arXiv:2403.04071, 2024
-
[49]
Wang et al
J. Wang et al. Vision-based deep reinforcement learning of unmanned aerial vehicle (uav) autonomous navigation using privileged information. Drones, 8(12):782, 2024
2024
-
[50]
Sartori et al
M. Sartori et al. AI and vision based autonomous navigation of nano-drones in partially-known environments. In 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT) , 2025
2025
-
[51]
P. Doma et al. LLM-Enhanced Path Planning: Safe and Effi- cient Autonomous Navigation with Instructional Inputs. arXiv preprint arXiv:2412.02655, 2024
-
[52]
Frattolillo et al
F. Frattolillo et al. Scalable and cooperative deep reinforcement learning approaches for multi-uav systems: A systematic review. Drones, 7(4):236, 2023
2023
-
[53]
K. I. Qureshi et al. Multi-agent drl for air-to-ground com- munication planning in uav-enabled iot networks. Sensors, 24(20):6535, 2024
2024
-
[54]
Singla et al
A. Singla et al. Memory-based deep reinforcement learning for obstacle avoidance in UA V with limited environment knowl- edge. IEEE Transactions on Intelligent Transportation Sys- tems, 22(1):107–118, 2021
2021
-
[55]
Fan et al
Y. Fan et al. Aerial vision-and-dialog navigation. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 3043–3061, Toronto, Canada, July 2023. Association for Computational Linguistics
2023
-
[56]
Hong et al
H. Hong et al. Why only text: Empowering vision-and-language navigation with multi-modal prompts. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intel- ligence, IJCAI-24, pp. 839–847, 8 2024
2024
-
[57]
Liu et al
Z. Liu et al. ReasonGrounder: L VLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025
2025
-
[58]
Huang et al
C. Huang et al. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automa- tion (ICRA), pp. 9947–9954, 2023
2023
-
[59]
Liang et al
X. Liang et al. Real-time semantic octree mapping under aerial-ground cooperative system. Intelligent Service Robotics , 18(3):567–578, 2025
2025
-
[60]
H. Shi et al. DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation. arXiv preprint arXiv:2508.09444, 2025
-
[61]
Wu et al
H. Wu et al. Model-free uav navigation in unknown complex en- vironments using vision-based reinforcement learning. Drones, 9(8):566, 2025
2025
-
[62]
Wang et al
X. Wang et al. UA V-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UA V Imitation Learning. In Thirty- ninth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025
2025
-
[63]
Song et al
X. Song et al. Towards long-horizon vision-language navigation: Platform, benchmark and method. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12078–12088, 2025
2025
-
[64]
Saxena et al
P. Saxena et al. UA V-VLN: End-to-end vision language guided navigation for UA Vs. In 2025 European Conference on Mobile Robots (ECMR), pp. 1–6, 2025
2025
-
[65]
Krantz et al
J. Krantz et al. Waypoint models for instruction-guided naviga- tion in continuous environments. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 15162–15171, 2021
2021
-
[66]
O. Sautenkov et al. UA V-VLA: Vision-language-action sys- tem for large scale aerial mission generation. arXiv preprint arXiv:2501.05014, 2025
-
[67]
Manjunath et al
T. Manjunath et al. Reprohrl: Towards multi-goal navigation in the real world using hierarchical agents. In AAAI Conference on Artificial Intelligence, RL Ready for Production Workshop , 2023
2023
-
[68]
Zhao et al
F. Zhao et al. Autonomous localized path planning algorithm for UA Vs based on TD3 strategy. Scientific Reports, 14(1):763, 2024
2024
-
[69]
Jiang et al
L. Jiang et al. Improving multi-UA V cooperative path-finding through multiagent experience learning. Applied Intelligence , 54:11103–11119, 2024
2024
- [70]
-
[71]
Sanyal and K
S. Sanyal and K. Roy. Asma: An adaptive safety margin algorithm for vision-language drone navigation via scene-aware 32 control barrier functions. IEEE Robotics and Automation Letters, 10(8):7536–7543, 2025
2025
-
[72]
M. Ramezani and J. L. Sanchez-Lopez. Human-Centric Aware UA V Trajectory Planning in Search and Rescue Mis- sions Employing Multi-Objective Reinforcement Learning with AHP and Similarity-Based Experience Replay. arXiv preprint arXiv:2402.18487, 2024
-
[73]
Wang et al
C. Wang et al. Uav path planning in multi-task environments with risks through natural language understanding. Drones, 7(3):147, 2023
2023
-
[74]
X. Wang et al. GPS denied IBVS-based navigation and collision avoidance of UA V using a low-cost RGB camera.arXiv preprint arXiv:2509.17435, 2025
-
[75]
Xue and T
Z. Xue and T. Gonsalves. Vision based drone obstacle avoid- ance by deep reinforcement learning. AI, 2(3):366–380, 2021
2021
-
[76]
Chen et al
S. Chen et al. History aware multimodal transformer for vision- and-language navigation. In Advances in Neural Information Processing Systems, 2021
2021
-
[77]
Xu et al
H. Xu et al. GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation. Pat- tern Recognition, 177:113365, 2026
2026
-
[78]
Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,
Y. Su et al. Target-grounded graph-aware transformer for aerial vision-and-dialog navigation. arXiv preprint arXiv:2308.11561, 2023
- [79]
-
[80]
Li et al
T. Li et al. Skyvln: Vision-and-language navigation and nmpc control for uavs in urban environments. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 17199–17206, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.