arxiv: 2605.01371 · v1 · submitted 2026-05-02 · 💻 cs.RO

Recognition: unknown

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Daoxuan Zhang , Ping Chen , Jianyi Zhou , Shuo Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:42 UTC · model grok-4.3

classification 💻 cs.RO

keywords UAV search and rescueEmbodied AIMultimodal large language modelsBenchmarkAgent navigationSpatial reasoningAerial robotics

0 comments

The pith

ESARBench is the first benchmark to evaluate multimodal language model agents on embodied unmanned aerial search and rescue tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called Embodied Search and Rescue in which aerial agents must explore unknown terrain, detect clues, and infer victim locations through autonomous decisions. It supplies ESARBench with four large photorealistic environments built directly from real GIS data inside Unreal Engine 5, plus 600 tasks that vary weather, time of day, and clue positions to mimic operational conditions. Evaluation of both classical planners and current MLLM-based agents shows consistent failures in maintaining spatial memory across flights, adapting to top-down views, and managing the tension between rapid coverage and collision avoidance. These gaps matter because they provide a shared, reproducible way to measure whether AI systems can eventually support faster, safer disaster response instead of leaving each research group to build its own isolated test scenes.

Core claim

We present ESARBench, the first comprehensive benchmark for MLLM-driven UAV agents in highly realistic SAR scenarios, built from four GIS-mapped Unreal Engine environments that incorporate dynamic weather, time, and stochastic clue placement across 600 tasks, whose baseline evaluations expose critical bottlenecks in spatial memory, aerial adaptation, and the efficiency-safety trade-off.

What carries the argument

ESARBench, a unified evaluation suite that supplies photorealistic open-world environments derived from GIS data, 600 rescue-modeled tasks with variable dynamics, and metrics that jointly score exploration coverage, clue detection, victim localization, and flight safety.

If this is right

MLLM agents currently lag traditional heuristics on many efficiency and safety metrics, indicating that semantic reasoning alone does not yet solve embodied aerial navigation.
Spatial memory across long flights emerges as a measurable bottleneck that future model architectures must address explicitly.
The documented trade-off between search thoroughness and collision risk supplies a concrete target for reward shaping in reinforcement learning or planning layers.
Dynamic weather and lighting conditions in the benchmark test robustness that static indoor navigation benchmarks omit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents that succeed on ESARBench could be fine-tuned further on real flight logs to close the remaining sim-to-real gap in disaster zones.
The benchmark design could be extended to multi-UAV coordination or integration with ground teams without changing the core task definition.
Public release of the environments and tasks may accelerate standardized comparisons across research groups working on aerial embodied AI.

Load-bearing premise

The four Unreal Engine 5 environments built from GIS data, together with the 600 tasks and dynamic variables, sufficiently capture the essential difficulties of real-world UAV search and rescue operations.

What would settle it

Deploying the same MLLM agents on physical UAVs in outdoor areas that match the benchmark environments and checking whether the observed failure modes in spatial memory and safety decisions match the simulated results.

Figures

Figures reproduced from arXiv: 2605.01371 by Daoxuan Zhang, Jianyi Zhou, Ping Chen, Shuo Yang.

**Figure 1.** Figure 1: Illustration of the Embodied Search and Rescue (ESAR) task workflow. Modeled after real-world cases, the ESAR mission unfolds across four sequential phases: Mission Start, Exploration, Clue Discovery, and Life Search. The UAV agent is initialized with basic environmental conditions and a textual prompt describing the target’s last known trajectory. Throughout the flight, the agent utilizes continuous perce… view at source ↗

**Figure 2.** Figure 2: Overview of the UAV-ESAR Simulator and Benchmark construction pipeline. The framework consists of two parallel processes: (1) Environment Construction, which utilizes satellite imagery and DEM data to reconstruct high-fidelity terrains in Unreal Engine 5. (2) Task Generation, which discretizes continuous real-world SAR events into static time snapshots with varying parameters (weather, time of day, startin… view at source ↗

**Figure 3.** Figure 3: Environment Construction and Scenario Variations. The simulation environments are constructed by integrating real-world GIS data to ensure high terrain fidelity. The figure illustrates four distinct geographic environments with varying physical scales, ranging from 2km × 2km to 5km×5km. The platform also features dynamic environmental configurations, supporting 13 different weather types and customizable t… view at source ↗

**Figure 4.** Figure 4: Dataset Statistics. (a) Word cloud analysis of the task prompts. (b) Proportion of tasks across different difficulty levels. (c) Distribution counts of various visual clues. (d) Histogram showing the distribution of initial distances to the goal. (e) Comprehensiveness of our evaluation metrics. quantified based on a confluence of factors, including weather severity, sky illumination, the average Euclidean … view at source ↗

**Figure 5.** Figure 5: More experimental results analysis. Crash rate, task time, and safe flight distance of different baseline methods. Larger red dots indicate higher crash rates. The results reveal a clear trade-off between search duration and flight safety: methods with stronger exploration ability often require longer task time. Meanwhile, the crash rates across most baselines indicate that safe longhorizon UAV operation … view at source ↗

**Figure 6.** Figure 6: Future Development of Embodied Search and Rescue (ESAR). This diagram illustrates the field’s future development across four dimensions: (1) Operational Scenarios: Scaling from stable, unconstrained open spaces to unpredictable, restricted disaster zones. (2) Architectural Progression: Advancing from robust single-UAV to multi-UAV swarm collaboration. (3) Task Formulation: Diverging from the core mission o… view at source ↗

**Figure 7.** Figure 7: Visualization of four representative event examples. An event represents a complete, longitudinal real-world search and rescue incident that unfolds over an extended period. Each event is discretized into multiple static time snapshots view at source ↗

read the original abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESARBench offers a new benchmark for MLLM UAV agents in embodied SAR but needs better validation of its simulated realism.

read the letter

The main thing to know is that ESARBench defines a new embodied search and rescue task for UAVs and supplies the first benchmark to test MLLM agents on it in simulated realistic conditions. The paper does a good job filling the gap left by traditional non-embodied UAV SAR methods. It uses Unreal Engine 5 and AirSim to make four large environments from GIS data, includes weather and time variations plus random clues, and sets up 600 tasks modeled on real cases. They test a range of baselines and identify specific problems like poor spatial memory and the efficiency-safety balance. This gives the community a shared way to measure progress on agentic approaches in this domain. The soft spot is the validation of the benchmark itself. The environments are built from real GIS data and use stochastic elements, but the paper does not provide comparisons of task statistics or failure modes to actual search and rescue incidents. It also lacks expert validation on photorealism or how the dynamics affect behavior in ways that match field conditions. This leaves the claim about critical bottlenecks on somewhat shaky ground until more evidence is added. This paper is for robotics researchers focused on embodied agents and UAV applications. Anyone looking for a testbed to evaluate new MLLM methods in search tasks would get value from it. It should go to peer review. The contribution is clear enough to warrant referee input, even with room to improve the realism grounding.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Embodied Search and Rescue (ESAR) task for MLLM-driven UAV agents and presents ESARBench as the first comprehensive benchmark for it. The benchmark comprises four large-scale, photorealistic environments constructed in Unreal Engine 5 from real GIS data, 600 tasks modeled on real-world rescue cases, dynamic variables (weather, time of day, stochastic clue placement), a set of evaluation metrics, and baseline evaluations ranging from heuristics to ground/aerial MLLM ObjectNav agents. Experimental results are said to reveal critical bottlenecks in spatial memory, aerial adaptation, and the search-efficiency versus flight-safety trade-off.

Significance. If the simulated environments and tasks can be shown to faithfully capture the essential difficulties of real UAV SAR operations, ESARBench would constitute a valuable standardized resource for the emerging area of embodied MLLM agents in robotics. The use of GIS-mapped UE5/AirSim environments, dynamic stochastic elements, and a sizable task set (600 instances) provides a concrete platform that could accelerate reproducible progress beyond traditional vision/path-planning methods.

major comments (2)

[Abstract] Abstract: The headline finding that the benchmark 'reveals critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety' is not accompanied by any concrete metric definitions, baseline implementation details, quantitative results, or statistical support in the manuscript description, making it impossible to verify whether the reported bottlenecks are data-driven.
[Environment Construction] Environment construction and task design: The four GIS-derived UE5 environments with dynamic variables are presented as highly realistic and representative of actual rescue operations, yet no quantitative comparison of task statistics or failure modes against real SAR incident reports, no expert validation of photorealism/dynamics, and no ablation demonstrating that the dynamic elements measurably alter agent behavior in field-mirroring ways are supplied; this directly undermines the central claim that the benchmark exposes representative bottlenecks.

minor comments (2)

The manuscript would benefit from explicit definitions of the proposed evaluation metrics (e.g., success rate, efficiency, safety) and their formulas in a dedicated section or table.
Baseline implementation details (e.g., prompt templates for MLLM agents, exact AirSim integration parameters) should be expanded to support reproducibility, even if code is linked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline finding that the benchmark 'reveals critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety' is not accompanied by any concrete metric definitions, baseline implementation details, quantitative results, or statistical support in the manuscript description, making it impossible to verify whether the reported bottlenecks are data-driven.

Authors: We agree that the abstract, being a concise summary, does not embed the supporting details. The full manuscript defines the evaluation metrics in Section 4.2, describes baseline implementations (heuristics and MLLM ObjectNav agents) in Section 5.1, and reports quantitative results with statistical analysis, tables, and figures in Section 5.3. To improve immediate verifiability of the headline claims, we will revise the abstract to include key quantitative highlights and specific observations drawn from the experimental results. revision: yes
Referee: [Environment Construction] Environment construction and task design: The four GIS-derived UE5 environments with dynamic variables are presented as highly realistic and representative of actual rescue operations, yet no quantitative comparison of task statistics or failure modes against real SAR incident reports, no expert validation of photorealism/dynamics, and no ablation demonstrating that the dynamic elements measurably alter agent behavior in field-mirroring ways are supplied; this directly undermines the central claim that the benchmark exposes representative bottlenecks.

Authors: Sections 3.1 and 3.2 detail the GIS-based construction and the modeling of the 600 tasks on real-world rescue cases. The manuscript does not include direct quantitative comparisons to real SAR incident statistics, formal expert validation of photorealism, or a dedicated ablation isolating dynamic variables. Our experiments in Section 5 do demonstrate performance differences under varying weather, time-of-day, and clue-placement conditions. We will revise the manuscript to expand the discussion of design rationale with additional references to SAR literature, clarify limitations regarding real-world validation, and add an ablation study on the dynamic elements using existing experimental data to show their measurable impact on agent behavior. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction is self-contained with no derivations or self-referential reductions

full rationale

The paper introduces a new task (ESAR) and benchmark (ESARBench) by constructing four GIS-mapped UE5 environments, 600 tasks, dynamic variables, and evaluation metrics, then runs external baselines (heuristics and MLLM agents) to report empirical performance. No equations, parameter fitting, predictions, or uniqueness theorems appear in the provided text. Claims about bottlenecks follow directly from the defined simulation runs without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The representativeness of the environments is an external-validity question, not a circularity issue in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that photorealistic simulation environments and real-case-modeled tasks are representative of actual SAR operations; no free parameters, mathematical axioms, or new invented entities are introduced.

axioms (1)

domain assumption Unreal Engine 5 and AirSim environments built from real GIS data, together with stochastic weather, time, and clue placement, provide a sufficiently realistic proxy for evaluating embodied UAV SAR agents.
Invoked when constructing the four large-scale open environments and the 600 tasks.

pith-pipeline@v0.9.0 · 5619 in / 1332 out tokens · 39188 ms · 2026-05-09T14:42:22.683963+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 41 canonical work pages · 3 internal anchors

[1]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir. On evaluation of embodied navigation agents. CoRR, abs/1807.06757, 2018

work page internal anchor Pith review arXiv 2018
[3]

Bialas, M

J. Bialas, M. Döller, S. Walch, M. J. V . Veelen, and A. Mejia-Aguilar. Optimizing multi- agent coverage path planning UA V search and rescue missions with prioritizing deep rein- forcement learning. InIEEE International Conference on Robotics and Biomimetics, RO- BIO 2024, Bangkok, Thailand, December 10-14, 2024, pages 85–90. IEEE, 2024. doi: 10.1109/ROB...

work page doi:10.1109/robio64047.2024.10907496 2024
[6]

A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pages 667–676. IEEE Computer Society, 2017. doi: 10.1109/3DV .2017.00081

work page doi:10.1109/3dv 2017
[7]

D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020

2020
[8]

K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. D. Reid, and L. Wang. Constraint-aware zero-shot vision-language navigation in continuous environments.IEEE Trans. Pattern Anal. Mach. Intell., 47(11):10441–10456, 2025. doi: 10.1109/TPAMI.2025.3594204

work page doi:10.1109/tpami.2025.3594204 2025
[9]

In: CVPR, pp

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan. Yolo-world: Real-time open- vocabulary object detection. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 16901–16911. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01599

work page doi:10.1109/cvpr52733.2024.01599 2024
[10]

M. Chu, Z. Zheng, W. Ji, T. Wang, and T. Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XI, volume ...

work page doi:10.1007/978-3-031-73247-8 2024
[14]

Döschl and J

B. Döschl and J. J. Kiam. Say’n’fly: An llm-modulo online planning framework to automate UA V command and control. In34th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2025, Eindhoven, Netherlands, August 25-29, 2025, pages 1693–1698. IEEE, 2025. doi: 10.1109/RO-MAN63969.2025.11217764

work page doi:10.1109/ro-man63969.2025.11217764 2025
[15]

Taniguchi

B. Döschl, K. Sommer, and J. J. Kiam. AUSPEX: an integrated open-source decision-making framework for uavs in rescue missions.Frontiers Robotics AI, 12, 2025. doi: 10.3389/FROBT. 2025.1583479

work page doi:10.3389/frobt 2025
[16]

Dumencic, L

S. Dumencic, L. Lanca, K. Jakac, and S. Ivic. Experimental validation of UA V search and detection system in real wilderness environment.CoRR, abs/2502.17372, 2025. doi: 10.48550/ ARXIV .2502.17372

work page arXiv 2025
[17]

T. Feng, X. Wang, F. Han, L. Zhang, and W. Zhu. U2udata: A large-scale cooperative perception dataset for swarm uavs autonomous flight. In J. Cai, M. S. Kankanhalli, B. Prabhakaran, S. Boll, R. Subramanian, L. Zheng, V . K. Singh, P. César, L. Xie, and D. Xu, editors,Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, V...

work page doi:10.1145/3664647 2024
[18]

T. Feng, X. Wang, F. Han, L. Zhang, and W. Zhu. U2udata-2: A scalable swarm uavs autonomous flight dataset for long-horizon tasks.CoRR, abs/2509.00055, 2025. doi: 10.48550/ ARXIV .2509.00055

work page arXiv 2025
[23]

Y . Hou, J. Zhao, R. Zhang, X. Cheng, and L. Yang. UA V swarm cooperative target search: A multi-agent reinforcement learning approach.IEEE Trans. Intell. Veh., 9(1):568–578, 2024. doi: 10.1109/TIV .2023.3316196

work page doi:10.1109/tiv 2024
[30]

AI2-THOR: An Interactive 3D Environment for Visual AI

E. Kolve, R. Mottaghi, D. Gordon, Y . Zhu, A. Gupta, and A. Farhadi. AI2-THOR: an interactive 3d environment for visual AI.CoRR, abs/1712.05474, 2017

work page internal anchor Pith review arXiv 2017
[31]

J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue. Citynav: Language-goal aerial navigation dataset with geographic information.CoRR, abs/2406.14240,

work page arXiv
[33]

N. Li, M. Ye, L. Zhou, S. Tang, Y . Gan, Z. Liang, and X. Zhu. Self-prompting analogical reasoning for UA V object detection. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI- 25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 18412–18420. AAAI Press, 2025. doi: 10.16...

work page doi:10.1609/aaai.v39i17.34026 2025
[36]

S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu. Aerialvln: Vision-and-language navigation for uavs. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 15338–15348. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01411

work page doi:10.1109/iccv51070 2023
[39]

Lykov, V

A. Lykov, V . Serpiva, M. H. Khan, O. Sautenkov, A. Myshlyaev, G. Tadevosyan, Y . Yaqoot, and D. Tsetserukou. Cognitivedrone: A VLA model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs.CoRR, abs/2503.01378, 2025. doi: 10.48550/ ARXIV .2503.01378

work page arXiv 2025
[40]

Mishra, M

A. Mishra, M. Narendra, A. Sinha, and A. Kumar. Dynamic backbone optimization of yolov10 for real-time object detection in uav-based search and rescue missions.IEEE Access, 13: 195975–195986, 2025. doi: 10.1109/ACCESS.2025.3633379

work page doi:10.1109/access.2025.3633379 2025
[41]

Panagopoulos, A

D. Panagopoulos, A. Perrusquía, and W. Guo. Selective exploration and information gathering in search and rescue using hierarchical learning guided by natural language input. InIEEE International Conference on Systems, Man, and Cybernetics, SMC 2024, Kuching, Malaysia, October 6-10, 2024, pages 1175–1180. IEEE, 2024. doi: 10.1109/SMC54092.2024.10831125

work page doi:10.1109/smc54092.2024.10831125 2024
[42]

Assran, Q

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das. Pirlnav: Pretraining with imitation and RL finetuning for OBJECTNA V. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 17896–17906. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01716

work page doi:10.1109/cvpr52729.2023.01716 2023
[43]

Sapkota, K

R. Sapkota, K. I. Roumeliotis, and M. Karkee. Uavs meet agentic AI: A multidomain survey of autonomous aerial intelligence and agentic uavs.CoRR, abs/2506.08045, 2025. doi: 10.48550/ ARXIV .2506.08045. 12

work page arXiv 2025
[45]

S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In M. Hutter and R. Siegwart, editors,Field and Service Robotics, Results of the 11th International Conference, FSR 2017, Zurich, Switzerland, 12-15 September 2017, volume 5 ofSpringer Proceedings in Advanced Robotics, pages 621–635. Sp...

work page doi:10.1007/978-3-319-67361-5 2017
[46]

M. I. R. Shuvo, N. Alam, A. A. Fime, H. Lee, X. Lin, and J. Kim. A novel large language model (LLM) based approach for robotic collaboration in search and rescue operations. In50th Annual Conference of the IEEE Industrial Electronics Society, IECON 2024, Chicago, IL, USA, November 3-6, 2024, pages 1–6. IEEE, 2024. doi: 10.1109/IECON55916.2024.10905094

work page doi:10.1109/iecon55916.2024.10905094 2024
[47]

Siktar, B

L. Siktar, B. Caran, B. Sekoranja, and M. Svaco. Autonomous UA V navigation for search and res- cue missions using computer vision and convolutional neural networks.CoRR, abs/2507.18160,

work page arXiv
[50]

Strand, T

S. Strand, T. Wiedemann, B. Burczek, and D. Shutin. Enhancing UA V search under occlusion using next best view planning.IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 19:1085–1096,
[51]

doi: 10.1109/JSTARS.2025.3638881

work page doi:10.1109/jstars.2025.3638881 2025
[52]

Y . Su, D. An, K. Chen, W. Yu, B. Ning, Y . Ling, Y . Huang, and L. Wang. Learning fine- grained alignment for aerial vision-dialog navigation. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7060–7068. AAAI Press, 2...

work page doi:10.1609/aaai.v39i7.32758 2025
[53]

Z. Sun, Y . Liu, H. Zhu, Y . Gu, Y . Zou, Z. Liu, G. Xia, B. Du, and Y . Xu. Refdrone: A challenging benchmark for referring expression comprehension in drone scenes.CoRR, abs/2502.00392,

work page arXiv
[55]

Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, B. Li, Y . Lv, L. Kovács, and F. Wang. Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility.Inf. Fusion, 122:103158, 2025. doi: 10.1016/J.INFFUS.2025.103158

work page doi:10.1016/j.inffus.2025.103158 2025
[56]

X. Wang, D. Yang, Y . Liao, W. Zheng, W. Wu, B. Dai, H. Li, and S. Liu. Uav-flow colosseo: A real-world benchmark for flying-on-a-word UA V imitation learning.CoRR, abs/2505.15725,

work page arXiv
[58]

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[59]

P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu. V oronav: V oronoi-based zero-shot object navigation with large language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

2024
[63]

H. Xu, Y . Hu, C. Gao, Z. Zhu, Y . Zhao, Y . Li, and Q. Yin. Geonav: Empowering mllms with ex- plicit geospatial reasoning abilities for language-goal aerial navigation.CoRR, abs/2504.09587,

work page arXiv
[66]

Yamauchi

B. Yamauchi. A frontier-based approach for autonomous exploration. InProceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97 - Towards New Computational Principles for Robotics and Automation, July 10- 11, 1997, Monterey, California, USA, pages 146–151. IEEE Computer Society, 1997. doi: 10.1109/CIRA.1...

work page doi:10.1109/cira.1997.613851 1997
[67]

G. Yang, Y . Mo, C. Lv, Y . Zhang, J. Li, and S. Wei. A dual-layer task planning algorithm based on uavs-human cooperation for search and rescue.Appl. Soft Comput., 181:113488, 2025. doi: 10.1016/J.ASOC.2025.113488

work page doi:10.1016/j.asoc.2025.113488 2025
[69]

F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. He, X. Zhao, Z. Wei, H. Xu, Z. Wang, G. Shao, L. Yang, D. Zhao, and Y . Yang. Aeroverse-review: Comprehensive survey on aerial embodied vision-and-language navigation.The Innovation Informatics, 1(1): 100015, 2025. ISSN 3105-8515. doi: 10.59717/j.xinn-inform.2025.100015

work page doi:10.59717/j.xinn-inform.2025.100015 2025
[71]

H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu. Unigoal: Towards universal zero-shot goal- oriented navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19057–19066. Computer Vision Foundation / IEEE, 2025. doi: 10.1109/CVPR52734.2025.01775

work page doi:10.1109/cvpr52734.2025.01775 2025
[72]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher. VLFM: vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024, pages 42–48. IEEE, 2024. doi: 10.1109/ICRA57147.2024.10610712

work page doi:10.1109/icra57147.2024.10610712 2024
[73]

B. Yu, H. Kasaei, and M. Cao. L3MVN: leveraging large language models for visual target navigation. InIROS, pages 3554–3560, 2023. doi: 10.1109/IROS55552.2023.10342512

work page doi:10.1109/iros55552.2023.10342512 2023
[75]

Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:5625– 5644, 2023

J. Zhang, J. Huang, S. Jin, and S. Lu. Vision-language models for vision tasks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 46(8):5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024
[76]

Zhang, C

W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li. Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

2025
[79]

CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023

J. Zhao and X. Lin. General-purpose aerial intelligent agents empowered by large language models.CoRR, abs/2503.08302, 2025. doi: 10.48550/ARXIV .2503.08302

work page internal anchor Pith review doi:10.48550/arxiv 2025
[80]

S. Zhao, F. Zhou, and Q. Wu. AA V visual navigation in the large-scale outdoor environment: A semantic-map-based cognitive escape reinforcement learning method.IEEE Internet Things J., 12(11):15926–15938, 2025. doi: 10.1109/JIOT.2025.3532164

work page doi:10.1109/jiot.2025.3532164 2025
[81]

G. Zhou, Y . Hong, Z. Wang, X. E. Wang, and Q. Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part VII, volume 1...

work page doi:10.1007/978-3-031-72667-5 2024
[82]

NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,

G. Zhou, Y . Hong, and Q. Wu. Navgpt: Explicit reasoning in vision-and-language naviga- tion with large language models. In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposiu...

work page doi:10.1609/aaai.v38i7.28597 2024
[83]

red bag" matches

Perform semantic matching. (e.g., "red bag" matches "Backpack", "fire" matches "Campfire")
[84]

A drone report matches a ground truth if they refer to the same physical object
[85]

matches": [ {

Multiple reports might refer to the same ground truth clue; ensure you count unique ground truth matches. Output Format (Strict JSON only, no conversational filler): { "matches": [ {"agent_cue":"reported_term","gt_cue":"matched_ground_truth_term"} ], "matched_gt_count": <number-of-unique-ground-truth-clues-found> } E Justification of the Static Snapshot F...
[86]

Basic Modalities 2. Extended Modalities RGB + Text -> Actions + LiDAR, Thermal, Audio, Interactive Language Multi-UAV Swarms Single UAV Progression Targeted Airdrop Core Mission: Life Search Human-Robot Interaction Environment Assessment ... Figure 6:Future Development of Embodied Search and Rescue (ESAR).This diagram illustrates the field’s future develo...
[87]

Three days later, he got injury while descending from the 2800 Campsite and agreed with his teammates to remain in place to wait for rescue

On September 25, 2021, a hiker Wu began a travel along the Aotai Trail. Three days later, he got injury while descending from the 2800 Campsite and agreed with his teammates to remain in place to wait for rescue. However, several days later, Wu unexpectedly ascended toward the summit, resulting in a missed encounter with the rescue team. Although he was d...

2021
[88]

On June 11, 1996, the renowned explorer Yu began a solo traverse of Lop Nur. During the expedition, he encountered a severe sandstorm that forced him off his intended route, preventing him from accessing the supply caches pre-positioned along his planned path. Several days later, Yu perished near a trail intersection; the cause of death was dehydration an...

1996
[89]

However, during their descent, the team encountered a severe blizzard

On June 20, 1986, five mountaineers successfully summited K2. However, during their descent, the team encountered a severe blizzard. While three climbers rapidly descended to a secure area, the remaining two became disoriented in the storm and ultimately died from a fall

1986
[90]

Telemetry data from his sports watch indicated a suspected wildlife attack during the hike

On October 16, 2025, a hiker Zhong proceeded to explore an undeveloped area of the Dapeng Peninsula without carrying any professional equipment. Telemetry data from his sports watch indicated a suspected wildlife attack during the hike. He subsequently descended to the coastal area, where he was ultimately found deceased. The exact cause of death remains ...

2025