pith. machine review for the scientific record. sign in

arxiv: 2604.07705 · v1 · submitted 2026-04-09 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.RO
keywords Aerial VLNVision-Language NavigationUAV navigationLarge language modelsVision-language modelsOpen problemsTaxonomySimulation platforms
0
0 comments X

The pith

Aerial VLN methods fall into five categories but face seven specific gaps in language grounding, continuous control, and real-world use when scaled with LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes the growing body of work on aerial vision-and-language navigation into a clear taxonomy and then isolates the practical barriers that prevent drones from following natural language commands reliably in three-dimensional space. A sympathetic reader cares because the identified gaps directly limit applications such as search-and-rescue or inspection where verbal instructions replace detailed programming. The paper shows that recent LLM and VLM approaches improve grounding yet expose shortfalls in long-horizon reasoning, viewpoint changes, and onboard execution. It therefore supplies a shared map of what must be solved next rather than incremental model tweaks.

Core claim

The authors formally define Aerial VLN with single-instruction and dialog-based interaction paradigms, place existing methods into five architectural categories (sequence-to-sequence and attention-based, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based), analyze design rationales and performance trade-offs on shared benchmarks, document shortcomings in current datasets and metrics, and synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation.

What carries the argument

A five-category taxonomy of Aerial VLN architectures that structures the comparison of discrete versus continuous actions, end-to-end versus hierarchical designs, and simulation-to-reality gaps.

If this is right

  • Progress on long-horizon instruction grounding would allow drones to execute multi-step missions from a single command without intermediate human input.
  • Solutions for continuous 6-DoF action execution would reduce the simulation-to-reality gap compared with discrete action spaces.
  • Standardized benchmarks with greater environmental diversity would enable direct cross-method comparisons that current platforms do not support.
  • Onboard deployment research would shift focus from cloud-dependent models to resource-constrained UAV hardware.
  • Multi-UAV swarm navigation methods would extend single-agent techniques to coordinated teams following shared language instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The survey's emphasis on continuous actions suggests that hierarchical methods may scale better to real flights than purely end-to-end LLM pipelines.
  • Benchmark standardization could accelerate progress in the same way shared simulators did for ground-based VLN.
  • Viewpoint robustness and scalable spatial representation together point to the need for explicit 3D world models rather than 2D image features alone.

Load-bearing premise

The reviewed literature is complete enough that the five-category taxonomy covers the field and the seven listed gaps are the most important ones.

What would settle it

Publication of an Aerial VLN method that cannot be placed in any of the five categories, or release of a benchmark that already solves all seven listed open problems, would show the survey's organization and gap analysis are incomplete.

Figures

Figures reproduced from arXiv: 2604.07705 by Hai Zhu, Lekai Zhou, Wen Yao, Xiaozhou Zhu, Xingyu Xia, Yujie Tang.

Figure 1
Figure 1. Figure 1: The evolution from UAV Navigation to Aerial VLN. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure and organization of the survey paper. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The problem definition of Aerial VLN. 2) State Space: The full state of the system at time t is defined as: st = (xt, x˙ t, ωt, qt, E) (1) where xt ∈ R 3 , x˙ t ∈ R 3 and ωt ∈ R 3 are the UAV’s position, linear velocity and angular velocity, qt ∈ SO(3), or parameterized by Euler angles (ϕ, θ, ψ) is its orientation, and E represents the full environment state including the geometry, semantics, and dynamics … view at source ↗
Figure 4
Figure 4. Figure 4: The evolution of VLN. It progresses from the foundation phase, characterized by discrete indoor navigation and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interaction paradigms of Aerial VLN: Aerial vision-and-instruction navigation (AVIN) vs. aerial vision-and-dialog [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The taxonomy of Aerial VLN methods. rather than generating dialog in real time. The gap to conduct live interactive dialog and integrate the response into ongoing navigation remain largely open. 3) Relationship Between Paradigms and Methods: It is important to note that AVIN and AVDN define the interaction interface, not the method architectures in Section III. In practice, the vast majority of current met… view at source ↗
Figure 7
Figure 7. Figure 7: Typical simulation platforms: Gazebo[132], [133], Habitat[134], AirSim[45], Issac Sim[135], [136], Unity[137], [138], [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Open problems of Aerial VLN. approaches and limitations, then propose specific research directions. A. Long-Horizon Navigation and Instruction Grounding Problem. Aerial VLN instructions are substantially longer and more structurally complex than their indoor counterparts (Section II-B): they interleave horizontal navigation, vertical maneuvers, temporal sequencing, and conditional logic over trajectories s… view at source ↗
read the original abstract

Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a critical survey of Aerial Vision-Language Navigation (Aerial VLN) for UAVs. It formally defines the problem and two interaction paradigms (single-instruction and dialog-based), organizes existing methods into a five-category taxonomy (sequence-to-sequence/attention-based, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based), analyzes design rationales, trade-offs, and reported performance for each, evaluates datasets/simulators/metrics and their limitations, consolidates cross-method comparisons on shared benchmarks, and synthesizes seven open problems (long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation) with grounded research directions.

Significance. If the literature coverage and analysis hold, the survey provides substantial value by structuring an emerging interdisciplinary field at the intersection of VLN, LLMs/VLMs, and aerial robotics. It explicitly credits the synthesis of actionable open problems derived from cross-category comparisons and evaluation gaps, offering a reference that can guide targeted research on real-world deployment challenges such as continuous control and swarm coordination.

minor comments (3)
  1. [Taxonomy section] The taxonomy introduction would benefit from an explicit justification or decision tree explaining why the five categories are mutually exclusive and exhaustive, particularly regarding overlap between hierarchical and multi-agent approaches.
  2. [Evaluation section] In the evaluation infrastructure assessment, the discussion of metric coverage could include a table summarizing which metrics are used across the reviewed papers to make the identified gaps more quantifiable.
  3. [Cross-method comparisons] The consolidated cross-method comparisons on shared benchmarks would be strengthened by noting the number of papers per benchmark and any statistical significance tests applied to performance differences.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our survey, which correctly identifies the taxonomy, evaluation analysis, and open problems. We appreciate the recommendation for minor revision and will incorporate any editorial or minor clarifications in the next version. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in this literature survey

full rationale

This is a survey paper that formally defines the Aerial VLN problem, organizes existing methods into a five-category taxonomy, analyzes design trade-offs and performance on shared benchmarks, critiques datasets/metrics, and synthesizes seven open problems as an analytical summary of gaps identified across the reviewed literature. No mathematical derivations, equations, fitted parameters, or predictions appear; the synthesis of open problems is explicitly grounded in external prior work rather than reducing to self-defined inputs or self-citations by construction. The taxonomy and gaps are presented as critical review, not as a load-bearing uniqueness theorem or ansatz imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper synthesizing existing research on Aerial VLN; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5588 in / 1179 out tokens · 63776 ms · 2026-05-10T18:16:49.138235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

158 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    A novel uav-enabled data collection scheme for intelligent transportation system through uav speed control,

    X. Li, J. Tan, A. Liu, P. Vijayakumar, N. Kumar, and M. Alazab, “A novel uav-enabled data collection scheme for intelligent transportation system through uav speed control,”IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 4, pp. 2100–2110, 2021. 24

  2. [2]

    Uav-basedelivery systems: A systematic review, current trends, and research challenges,

    F. Betti Sorbelli, “Uav-basedelivery systems: A systematic review, current trends, and research challenges,”ACM Journal on Autonomous Transportation Systems, vol. 1, no. 3, pp. 1–40, 2024

  3. [3]

    Challenges and opportunities for autonomous micro-uavs in precision agriculture,

    X. Liu, S. W. Chen, G. V . Nardari, C. Qu, F. Cladera, C. J. Taylor, and V . Kumar, “Challenges and opportunities for autonomous micro-uavs in precision agriculture,”IEEE Micro, vol. 42, no. 1, pp. 61–68, 2022

  4. [4]

    Online path plan- ning for multi-robot multi-source seeking using distributed gaussian processes,

    H. Huang, H. Zhu, X. Zhu, W. Mei, and B. Deng, “Online path plan- ning for multi-robot multi-source seeking using distributed gaussian processes,”IET Cyber-Systems and Robotics, vol. 7, no. 1, p. e70030, 2025

  5. [5]

    Online informative path planning for active information gathering of a 3d surface,

    H. Zhu, J. J. Chung, N. R. Lawrance, R. Siegwart, and J. Alonso-Mora, “Online informative path planning for active information gathering of a 3d surface,” inIEEE International Conference on Robotics and Automation, pp. 1488–1494, 2021

  6. [6]

    Edge computing powers aerial swarms in sensing, communication, and planning,

    H. Zhu, Q. Chen, X. Zhu, W. Yao, and X. Chen, “Edge computing powers aerial swarms in sensing, communication, and planning,”The Innovation, p. 100506, 2023

  7. [7]

    The small-drone revolution is coming — scientists need to ensure it will be safe,

    X. Huang, “The small-drone revolution is coming — scientists need to ensure it will be safe,”Nature, vol. 637, no. 8044, pp. 29–30, 2025

  8. [8]

    UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility,

    Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, B. Li, Y . Lv, L. Kov ´acs, and F.-Y . Wang, “UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility,”Information Fusion, vol. 122, p. 103158, 2025

  9. [9]

    A comprehensive review on autonomous navigation,

    S. Nahavandi, R. Alizadehsani, D. Nahavandi, S. Mohamed, N. Mo- hajer, M. Rokonuzzaman, and I. Hossain, “A comprehensive review on autonomous navigation,”ACM Computing Surveys, vol. 57, no. 9, 2025

  10. [10]

    A survey on lidar-based autonomous aerial vehicles,

    Y . Ren, Y . Cai, H. Li, N. Chen, F. Zhu, L. Yin, F. Kong, R. Li, and F. Zhang, “A survey on lidar-based autonomous aerial vehicles,” IEEE/ASME Transactions on Mechatronics, pp. 1–17, 2025

  11. [11]

    Learning high-speed flight in the wild,

    A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Learning high-speed flight in the wild,”Science Robotics, vol. 6, no. 59, p. eabg5810, 2021

  12. [12]

    Explainable deep reinforcement learning for UA V autonomous path planning,

    L. He, N. Aouf, and B. Song, “Explainable deep reinforcement learning for UA V autonomous path planning,”Aerospace Science and Technology, vol. 118, p. 107052, 2021

  13. [13]

    AerialVLN: Vision-and-language navigation for UA Vs,

    S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu, “AerialVLN: Vision-and-language navigation for UA Vs,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15338– 15348, 2023

  14. [14]

    Aerial vision-and-dialog navigation,

    Y . Fan, W. Chen, T. Jiang, C. Zhou, Y . Zhang, and X. E. Wang, “Aerial vision-and-dialog navigation,”Findings of the Association for Computational Linguistics: ACL 2023, pp. 3043–3061, 2023

  15. [15]

    Onfly: On- board zero-shot aerial vision-language navigation toward safety and efficiency,

    G. Zheng, Y . Ban, M. Zhang, J. Zheng, and B. Zhou, “Onfly: On- board zero-shot aerial vision-language navigation toward safety and efficiency,”arXiv:2603.10682, 2026

  16. [16]

    Apex: A decoupled memory-based explorer for asynchronous aerial object goal navigation,

    D. Zhang, P. Chen, X. Xia, X. Su, R. Zhen, J. Xiao, and S. Yang, “Apex: A decoupled memory-based explorer for asynchronous aerial object goal navigation,”arXiv:2602.00551, 2026

  17. [17]

    LogisticsVLN: Vision-language navigation for low-altitude terminal delivery based on agentic UA Vs,

    X. Zhang, Y . Tian, F. Lin, Y . Liu, J. Ma, K. S. Szatm ´ary, and F.- Y . Wang, “LogisticsVLN: Vision-language navigation for low-altitude terminal delivery based on agentic UA Vs,” inIEEE 28th International Conference on Intelligent Transportation Systems, pp. 4437–4442, 2025

  18. [18]

    Multimodal large language models-enabled UA V swarm: Towards efficient and intelligent autonomous aerial systems,

    Y . Ping, T. Liang, H. Ding, G. Lei, J. Wu, X. Zou, K. Shi, R. Shao, C. Zhang, W. Zhang, W. Yuan, and T. Zhang, “Multimodal large language models-enabled UA V swarm: Towards efficient and intelligent autonomous aerial systems,”IEEE Wireless Communications, vol. 33, no. 1, pp. 89–97, 2025

  19. [19]

    Language models are few-shot learn- ers,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

  20. [20]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. G...

  21. [21]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...

  22. [22]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021

  23. [23]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3992–4003, 2023

  24. [24]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” inPro- ceedings of the European Conference on Computer Vision, vol. 15105, pp. 38–55, 2025

  25. [25]

    Navgemini: a multi-modal llm agent for vision-and-language navigation,

    G. Zhao, G. Li, and Y . Yu, “Navgemini: a multi-modal llm agent for vision-and-language navigation,”Visual Intelligence, vol. 4, 2026

  26. [26]

    Unemo: Collaborative visual-language reasoning and navigation via a multimodal world model,

    C. Huang, L. Tang, Z. Zhan, L. Yu, R. Zeng, Z. Liu, Z. Wang, and J. Li, “Unemo: Collaborative visual-language reasoning and navigation via a multimodal world model,”ArXiv, vol. 2511.18845, 2025

  27. [27]

    VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View,

    R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y . Wang, “VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 18924–18933, 2024

  28. [28]

    LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,

    D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” inProceedings of the 6th Conference on Robot Learning, pp. 492–504, 2023

  29. [29]

    AeroVerse-review: Comprehensive survey on aerial embodied vision-and-language navigation,

    F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. He, X. Zhao, Z. Wei, H. Xu, Z. Wang, G. Shao, L. Yang, D. Zhao, and Y . Yang, “AeroVerse-review: Comprehensive survey on aerial embodied vision-and-language navigation,”The Innovation Informatics, vol. 1, no. 1, p. 100015, 2025

  30. [30]

    FlightGPT: Towards generalizable and in- terpretable UA V vision-and-language navigation with vision-language models,

    H. Cai, J. Dong, J. Tan, J. Deng, S. Li, Z. Gao, H. Wang, Z. Su, A. Sumalee, and R. Zhong, “FlightGPT: Towards generalizable and in- terpretable UA V vision-and-language navigation with vision-language models,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 6659–6676, 2025

  31. [31]

    Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language naviga- tion.arXiv preprint arXiv:2411.08579, 2024

    Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “NavAgent: Multi- scale urban street view fusion for UA V embodied vision-and-language navigation,”arXiv:2411.08579, 2024

  32. [32]

    SkyVLN: Vision-and-language navigation and NMPC control for UA Vs in urban environments,

    T. Li, T. Huai, Z. Li, Y . Gao, H. Li, and X. Zheng, “SkyVLN: Vision-and-language navigation and NMPC control for UA Vs in urban environments,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 17199–17206, 2025

  33. [33]

    Openfly: A comprehensive platform for aerial vision-language navigation,

    Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, Y . Tang, Y . Tang, S. Liang, S. Zhu, Z. Xiong, Y . Su, X. Ye, J. Li, Y . Ding, D. Wang, Z. Wang, B. Zhao, and X. Li, “OpenFly: A comprehensive platform for aerial vision-language navigation,”arXiv: 2502.18041, 2025

  34. [34]

    ”hi AirStar, guide me to the badminton court

    Z. Wang, J. Chen, X. Zheng, Q. Liao, L. Huang, and S. Liu, “”hi AirStar, guide me to the badminton court.”,” inACM International Conference on Multimedia, pp. 13477–13479, 2025. 25

  35. [35]

    Uav-codeagents: Scalable uav mission planning via multi-agent react and vision- language reasoning,

    O. Sautenkov, Y . Yaqoot, M. A. Mustafa, F. Batool, J. Sam, A. Lykov, C.-Y . Wen, and D. Tsetserukou, “UA V-CodeAgents: Scalable UA V mis- sion planning via multi-agent ReAct and vision-language reasoning,” arXiv: 2505.07236, 2025

  36. [36]

    MMCNav: MLLM- empowered multi-agent collaboration for outdoor visual language navi- gation,

    Z. Zhang, M. Chen, S. Zhu, T. Han, and Z. Yu, “MMCNav: MLLM- empowered multi-agent collaboration for outdoor visual language navi- gation,” inProceedings of the International Conference on Multimedia Retrieval, pp. 1767–1776, 2025

  37. [37]

    Vision-and- language navigation: A survey of tasks, methods, and future directions,

    J. Gu, E. Stefani, Q. Wu, J. Thomason, and X. E. Wang, “Vision-and- language navigation: A survey of tasks, methods, and future directions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7606–7623, 2022

  38. [38]

    Vision-language navigation: A sur- vey and taxonomy,

    W. Wu, T. Chang, and X. Li, “Vision-language navigation: A sur- vey and taxonomy,”Neural Computing and Applications, vol. 36, p. 3291–3316, 2024

  39. [39]

    Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,

    Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,”Transactions on Machine Learning Research, 2024

  40. [40]

    Mapping instructions to actions in 3D environments with visual goal prediction,

    D. Misra, A. Bennett, V . Blukis, E. Niklasson, M. Shatkhin, and Y . Artzi, “Mapping instructions to actions in 3D environments with visual goal prediction,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2667–2678, 2018

  41. [41]

    Mapping naviga- tion instructions to continuous control actions with position-visitation prediction,

    V . Blukis, D. Misra, R. A. Knepper, and Y . Artzi, “Mapping naviga- tion instructions to continuous control actions with position-visitation prediction,” inProceedings of the 2nd Conference on Robot Learning, vol. 87, pp. 505–518, 2018

  42. [42]

    GRAD-NA V++: Vision-language model enabled visual drone naviga- tion with gaussian radiance fields and differentiable dynamics,

    Q. Chen, N. Gao, S. Huang, J. Low, T. Chen, J. Sun, and M. Schwager, “GRAD-NA V++: Vision-language model enabled visual drone naviga- tion with gaussian radiance fields and differentiable dynamics,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1418–1425, 2026

  43. [43]

    Aerial vision-and-language navigation with grid-based view selection and map construction,

    G. Zhao, G. Li, J. Pan, and Y . Yu, “Aerial vision-and-language navigation with grid-based view selection and map construction,” arXiv:2503.11091, 2025

  44. [45]

    Towards realistic UA V vision-language navigation: Plat- form, benchmark, and methodology,

    X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic UA V vision-language navigation: Plat- form, benchmark, and methodology,” inThe 13th International Con- ference on Learning Representations, pp. 75433–75451, 2025

  45. [46]

    CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,

    W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li, “CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31292–31309, 2025

  46. [47]

    CityNav: A large-scale dataset for real-world aerial navigation,

    J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue, “CityNav: A large-scale dataset for real-world aerial navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5912–5922, 2025

  47. [48]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3674–3683, 2018

  48. [49]

    Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,” inProceedings of the European Conference on Computer Vision, vol. 12373, pp. 104–120, 2020

  49. [50]

    TOUCHDOWN: Natural language navigation and spatial reasoning in visual street envi- ronments,

    H. Chen, A. Suhr, D. Misra, N. Snavely, and Y . Artzi, “TOUCHDOWN: Natural language navigation and spatial reasoning in visual street envi- ronments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12530–12539, 2019

  50. [51]

    The StreetLearn Environment and Dataset

    P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell, “The StreetLearn Environment and Dataset,”arXiv:1903.01292, 2019

  51. [52]

    A survey of embodied ai: From simulators to research tasks,

    J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022

  52. [53]

    Recent advances in vision-and-language navigation,

    S. Shuang-Lin, H. Yan, H. Ke-Ji, A. Dong, Y . Hui, and W. Liang, “Recent advances in vision-and-language navigation,”Acta Automatica Sinica, vol. 49, no. 1, pp. 1–14, 2023

  53. [54]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inThe 9th International Conference on Learning Representations, 2021

  54. [55]

    Visual language navigation: A survey and open challenges,

    S.-M. Park and Y .-G. Kim, “Visual language navigation: A survey and open challenges,”Artificial Intelligence Review, vol. 56, no. 1, pp. 365– 427, 2023

  55. [56]

    Stay on the path: Instruction fidelity in vision-and-language navigation,

    V . Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” in Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 1862–1872, Association for Computational Linguistics, 2019

  56. [57]

    Language and visual entity relationship graph for agent navigation,

    Y . Hong, C. Rodriguez, Y . Qi, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 7685–7696, 2020

  57. [58]

    Learning to follow directions in street view,

    K. M. Hermann, M. Malinowski, P. Mirowski, A. Banki-Horvath, K. Anderson, and R. Hadsell, “Learning to follow directions in street view,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11773–11781, 2020

  58. [59]

    Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges,

    S. Aggarwal and N. Kumar, “Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges,”Computer Com- munications, vol. 149, pp. 270–299, 2020

  59. [60]

    Unmanned aerial vehicles (uavs): practical aspects, applica- tions, open challenges, security issues, and future trends,

    S. A. H. Mohsan, N. Q. H. Othman, Y . Li, M. H. Alsharif, and M. A. Khan, “Unmanned aerial vehicles (uavs): practical aspects, applica- tions, open challenges, security issues, and future trends,”Intelligent Service Robotics, vol. 16, pp. 109 – 137, 2023

  60. [61]

    Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024

    S. Li and H. Tang, “Multimodal alignment and fusion: A survey,” arXiv:2411.17040, 2025

  61. [62]

    Enhancing visual aligning and grounding for aerial vision-and-dialog navigation,

    G. Qiao, D. Yi, L. Wu, H. Wu, and J. Wang, “Enhancing visual aligning and grounding for aerial vision-and-dialog navigation,”IEEE Signal Processing Letters, vol. 32, pp. 2853–2857, 2025

  62. [63]

    Adaptive zone-aware hierarchical planner for vision-language navigation,

    C. Gao, X. Peng, M. Yan, H. Wang, L. Yang, H. Ren, H. Li, and S. Liu, “Adaptive zone-aware hierarchical planner for vision-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14911–14920, 2023

  63. [64]

    Towards long- horizon vision-language navigation: Platform, benchmark and method,

    X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long- horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12078–12088, 2025

  64. [65]

    Structured instruction parsing and scene alignment for UA V vision-language navigation,

    L. Zhou, R. Xue, and X. Luo, “Structured instruction parsing and scene alignment for UA V vision-language navigation,” inIEEE International Conference on Image Processing, pp. 2600–2605, 2025

  65. [66]

    Demo abstract: Embodied aerial agent for city-level visual language naviga- tion using large language model,

    W. Zhang, Y . Liu, X. Wang, X. Chen, C. Gao, and X. Chen, “Demo abstract: Embodied aerial agent for city-level visual language naviga- tion using large language model,” inThe 23rd ACM/IEEE International Conference on Information Processing in Sensor Networks, pp. 265– 266, 2024

  66. [67]

    Typefly: Low-latency drone planning with large language models,

    G. Chen, X. Yu, N. Ling, and L. Zhong, “Typefly: Low-latency drone planning with large language models,”IEEE Transactions on Mobile Computing, vol. 24, no. 9, pp. 9068–9079, 2025

  67. [68]

    Research progress on embodied navigation of low-altitude uav,

    G. S. XU Yueyue, DU Huajun, “Research progress on embodied navigation of low-altitude uav,”Aerospace Control, vol. 43, no. 4, pp. 7–14, 2025

  68. [69]

    Follow- ing high-level navigation instructions on a simulated quadcopter with imitation learning,

    V . Blukis, N. Brukhim, A. Bennett, R. Knepper, and Y . Artzi, “Follow- ing high-level navigation instructions on a simulated quadcopter with imitation learning,” inRobotics: Science and Systems XIV, Robotics: Science and Systems Foundation, 2018

  69. [70]

    Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,

    Y . Su, D. An, Y . Xu, K. Chen, and Y . Huang, “Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,”arXiv: 2308.11561, 2023

  70. [71]

    Learning fine-grained alignment for aerial vision-dialog navigation,

    Y . Su, D. An, K. Chen, W. Yu, B. Ning, Y . Ling, Y . Huang, and L. Wang, “Learning fine-grained alignment for aerial vision-dialog navigation,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 7, pp. 7060–7068, 2025

  71. [72]

    Speaker- follower models for vision-and-language navigation,

    D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,” inAdvances in Neural Information Processing Systems, vol. 31, 2018

  72. [73]

    Natural language command of an autonomous micro-air vehicle,

    A. S. Huang, S. Tellex, A. Bachrach, T. Kollar, D. Roy, and N. Roy, “Natural language command of an autonomous micro-air vehicle,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2663–2669, 2010

  73. [74]

    Learning to map natural language instructions to physical quadcopter control using simulated flight,

    V . Blukis, Y . Terme, E. Niklasson, R. A. Knepper, and Y . Artzi, “Learning to map natural language instructions to physical quadcopter control using simulated flight,” inProceedings of the Conference on Robot Learning, vol. 100, pp. 1415–1438, 2020

  74. [75]

    A reduction of imitation learning and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence 26 and Statistics, pp. 627–635, JMLR Workshop and Conference Pro- ceedings, 2011

  75. [76]

    History-enhanced two-stage transformer for aerial vision-and-language navigation,

    X. Ding, J. Gao, C. Pan, W. Wang, and J. Qin, “History-enhanced two-stage transformer for aerial vision-and-language navigation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 18225–18233, 2026

  76. [77]

    Multimodal learning with transform- ers: A survey,

    P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transform- ers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12113–12132, 2023

  77. [78]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

  78. [79]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

  79. [80]

    Dual-branch dynamic perception and interaction framework for aerial vision-and-language navigation,

    Z. Wang, “Dual-branch dynamic perception and interaction framework for aerial vision-and-language navigation,” inThe 4th International Conference on Artificial Intelligence, Internet and Digital Economy, pp. 307–310, 2025

  80. [81]

    BERT: Pre- training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019

Showing first 80 references.