pith. machine review for the scientific record. sign in

arxiv: 2605.01371 · v1 · submitted 2026-05-02 · 💻 cs.RO

Recognition: unknown

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:42 UTC · model grok-4.3

classification 💻 cs.RO
keywords UAV search and rescueEmbodied AIMultimodal large language modelsBenchmarkAgent navigationSpatial reasoningAerial robotics
0
0 comments X

The pith

ESARBench is the first benchmark to evaluate multimodal language model agents on embodied unmanned aerial search and rescue tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called Embodied Search and Rescue in which aerial agents must explore unknown terrain, detect clues, and infer victim locations through autonomous decisions. It supplies ESARBench with four large photorealistic environments built directly from real GIS data inside Unreal Engine 5, plus 600 tasks that vary weather, time of day, and clue positions to mimic operational conditions. Evaluation of both classical planners and current MLLM-based agents shows consistent failures in maintaining spatial memory across flights, adapting to top-down views, and managing the tension between rapid coverage and collision avoidance. These gaps matter because they provide a shared, reproducible way to measure whether AI systems can eventually support faster, safer disaster response instead of leaving each research group to build its own isolated test scenes.

Core claim

We present ESARBench, the first comprehensive benchmark for MLLM-driven UAV agents in highly realistic SAR scenarios, built from four GIS-mapped Unreal Engine environments that incorporate dynamic weather, time, and stochastic clue placement across 600 tasks, whose baseline evaluations expose critical bottlenecks in spatial memory, aerial adaptation, and the efficiency-safety trade-off.

What carries the argument

ESARBench, a unified evaluation suite that supplies photorealistic open-world environments derived from GIS data, 600 rescue-modeled tasks with variable dynamics, and metrics that jointly score exploration coverage, clue detection, victim localization, and flight safety.

If this is right

  • MLLM agents currently lag traditional heuristics on many efficiency and safety metrics, indicating that semantic reasoning alone does not yet solve embodied aerial navigation.
  • Spatial memory across long flights emerges as a measurable bottleneck that future model architectures must address explicitly.
  • The documented trade-off between search thoroughness and collision risk supplies a concrete target for reward shaping in reinforcement learning or planning layers.
  • Dynamic weather and lighting conditions in the benchmark test robustness that static indoor navigation benchmarks omit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents that succeed on ESARBench could be fine-tuned further on real flight logs to close the remaining sim-to-real gap in disaster zones.
  • The benchmark design could be extended to multi-UAV coordination or integration with ground teams without changing the core task definition.
  • Public release of the environments and tasks may accelerate standardized comparisons across research groups working on aerial embodied AI.

Load-bearing premise

The four Unreal Engine 5 environments built from GIS data, together with the 600 tasks and dynamic variables, sufficiently capture the essential difficulties of real-world UAV search and rescue operations.

What would settle it

Deploying the same MLLM agents on physical UAVs in outdoor areas that match the benchmark environments and checking whether the observed failure modes in spatial memory and safety decisions match the simulated results.

Figures

Figures reproduced from arXiv: 2605.01371 by Daoxuan Zhang, Jianyi Zhou, Ping Chen, Shuo Yang.

Figure 1
Figure 1. Figure 1: Illustration of the Embodied Search and Rescue (ESAR) task workflow. Modeled after real-world cases, the ESAR mission unfolds across four sequential phases: Mission Start, Exploration, Clue Discovery, and Life Search. The UAV agent is initialized with basic environmental conditions and a textual prompt describing the target’s last known trajectory. Throughout the flight, the agent utilizes continuous perce… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the UAV-ESAR Simulator and Benchmark construction pipeline. The framework consists of two parallel processes: (1) Environment Construction, which utilizes satellite imagery and DEM data to reconstruct high-fidelity terrains in Unreal Engine 5. (2) Task Generation, which discretizes continuous real-world SAR events into static time snapshots with varying parameters (weather, time of day, startin… view at source ↗
Figure 3
Figure 3. Figure 3: Environment Construction and Scenario Variations. The simulation environments are constructed by integrating real-world GIS data to ensure high terrain fidelity. The figure illustrates four distinct geographic environments with varying physical scales, ranging from 2km × 2km to 5km×5km. The platform also features dynamic environmental configurations, supporting 13 different weather types and customizable t… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset Statistics. (a) Word cloud analysis of the task prompts. (b) Proportion of tasks across different difficulty levels. (c) Distribution counts of various visual clues. (d) Histogram showing the distribution of initial distances to the goal. (e) Comprehensiveness of our evaluation metrics. quantified based on a confluence of factors, including weather severity, sky illumination, the average Euclidean … view at source ↗
Figure 5
Figure 5. Figure 5: More experimental results analysis. Crash rate, task time, and safe flight distance of different baseline methods. Larger red dots indicate higher crash rates. The results reveal a clear trade-off between search duration and flight safety: methods with stronger exploration ability often require longer task time. Meanwhile, the crash rates across most baselines indicate that safe long￾horizon UAV operation … view at source ↗
Figure 6
Figure 6. Figure 6: Future Development of Embodied Search and Rescue (ESAR). This diagram illustrates the field’s future development across four dimensions: (1) Operational Scenarios: Scaling from stable, unconstrained open spaces to unpredictable, restricted disaster zones. (2) Architectural Progression: Advancing from robust single-UAV to multi-UAV swarm collaboration. (3) Task Formulation: Diverging from the core mission o… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of four representative event examples. An event represents a complete, longitudinal real-world search and rescue incident that unfolds over an extended period. Each event is discretized into multiple static time snapshots view at source ↗
read the original abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Embodied Search and Rescue (ESAR) task for MLLM-driven UAV agents and presents ESARBench as the first comprehensive benchmark for it. The benchmark comprises four large-scale, photorealistic environments constructed in Unreal Engine 5 from real GIS data, 600 tasks modeled on real-world rescue cases, dynamic variables (weather, time of day, stochastic clue placement), a set of evaluation metrics, and baseline evaluations ranging from heuristics to ground/aerial MLLM ObjectNav agents. Experimental results are said to reveal critical bottlenecks in spatial memory, aerial adaptation, and the search-efficiency versus flight-safety trade-off.

Significance. If the simulated environments and tasks can be shown to faithfully capture the essential difficulties of real UAV SAR operations, ESARBench would constitute a valuable standardized resource for the emerging area of embodied MLLM agents in robotics. The use of GIS-mapped UE5/AirSim environments, dynamic stochastic elements, and a sizable task set (600 instances) provides a concrete platform that could accelerate reproducible progress beyond traditional vision/path-planning methods.

major comments (2)
  1. [Abstract] Abstract: The headline finding that the benchmark 'reveals critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety' is not accompanied by any concrete metric definitions, baseline implementation details, quantitative results, or statistical support in the manuscript description, making it impossible to verify whether the reported bottlenecks are data-driven.
  2. [Environment Construction] Environment construction and task design: The four GIS-derived UE5 environments with dynamic variables are presented as highly realistic and representative of actual rescue operations, yet no quantitative comparison of task statistics or failure modes against real SAR incident reports, no expert validation of photorealism/dynamics, and no ablation demonstrating that the dynamic elements measurably alter agent behavior in field-mirroring ways are supplied; this directly undermines the central claim that the benchmark exposes representative bottlenecks.
minor comments (2)
  1. The manuscript would benefit from explicit definitions of the proposed evaluation metrics (e.g., success rate, efficiency, safety) and their formulas in a dedicated section or table.
  2. Baseline implementation details (e.g., prompt templates for MLLM agents, exact AirSim integration parameters) should be expanded to support reproducibility, even if code is linked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline finding that the benchmark 'reveals critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety' is not accompanied by any concrete metric definitions, baseline implementation details, quantitative results, or statistical support in the manuscript description, making it impossible to verify whether the reported bottlenecks are data-driven.

    Authors: We agree that the abstract, being a concise summary, does not embed the supporting details. The full manuscript defines the evaluation metrics in Section 4.2, describes baseline implementations (heuristics and MLLM ObjectNav agents) in Section 5.1, and reports quantitative results with statistical analysis, tables, and figures in Section 5.3. To improve immediate verifiability of the headline claims, we will revise the abstract to include key quantitative highlights and specific observations drawn from the experimental results. revision: yes

  2. Referee: [Environment Construction] Environment construction and task design: The four GIS-derived UE5 environments with dynamic variables are presented as highly realistic and representative of actual rescue operations, yet no quantitative comparison of task statistics or failure modes against real SAR incident reports, no expert validation of photorealism/dynamics, and no ablation demonstrating that the dynamic elements measurably alter agent behavior in field-mirroring ways are supplied; this directly undermines the central claim that the benchmark exposes representative bottlenecks.

    Authors: Sections 3.1 and 3.2 detail the GIS-based construction and the modeling of the 600 tasks on real-world rescue cases. The manuscript does not include direct quantitative comparisons to real SAR incident statistics, formal expert validation of photorealism, or a dedicated ablation isolating dynamic variables. Our experiments in Section 5 do demonstrate performance differences under varying weather, time-of-day, and clue-placement conditions. We will revise the manuscript to expand the discussion of design rationale with additional references to SAR literature, clarify limitations regarding real-world validation, and add an ablation study on the dynamic elements using existing experimental data to show their measurable impact on agent behavior. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction is self-contained with no derivations or self-referential reductions

full rationale

The paper introduces a new task (ESAR) and benchmark (ESARBench) by constructing four GIS-mapped UE5 environments, 600 tasks, dynamic variables, and evaluation metrics, then runs external baselines (heuristics and MLLM agents) to report empirical performance. No equations, parameter fitting, predictions, or uniqueness theorems appear in the provided text. Claims about bottlenecks follow directly from the defined simulation runs without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The representativeness of the environments is an external-validity question, not a circularity issue in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that photorealistic simulation environments and real-case-modeled tasks are representative of actual SAR operations; no free parameters, mathematical axioms, or new invented entities are introduced.

axioms (1)
  • domain assumption Unreal Engine 5 and AirSim environments built from real GIS data, together with stochastic weather, time, and clue placement, provide a sufficiently realistic proxy for evaluating embodied UAV SAR agents.
    Invoked when constructing the four large-scale open environments and the 600 tasks.

pith-pipeline@v0.9.0 · 5619 in / 1332 out tokens · 39188 ms · 2026-05-09T14:42:22.683963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    On Evaluation of Embodied Navigation Agents

    P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir. On evaluation of embodied navigation agents. CoRR, abs/1807.06757, 2018

  2. [3]

    Bialas, M

    J. Bialas, M. Döller, S. Walch, M. J. V . Veelen, and A. Mejia-Aguilar. Optimizing multi- agent coverage path planning UA V search and rescue missions with prioritizing deep rein- forcement learning. InIEEE International Conference on Robotics and Biomimetics, RO- BIO 2024, Bangkok, Thailand, December 10-14, 2024, pages 85–90. IEEE, 2024. doi: 10.1109/ROB...

  3. [6]

    A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pages 667–676. IEEE Computer Society, 2017. doi: 10.1109/3DV .2017.00081

  4. [7]

    D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020

  5. [8]

    K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. D. Reid, and L. Wang. Constraint-aware zero-shot vision-language navigation in continuous environments.IEEE Trans. Pattern Anal. Mach. Intell., 47(11):10441–10456, 2025. doi: 10.1109/TPAMI.2025.3594204

  6. [9]

    In: CVPR, pp

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan. Yolo-world: Real-time open- vocabulary object detection. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 16901–16911. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01599

  7. [10]

    M. Chu, Z. Zheng, W. Ji, T. Wang, and T. Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XI, volume ...

  8. [14]

    Döschl and J

    B. Döschl and J. J. Kiam. Say’n’fly: An llm-modulo online planning framework to automate UA V command and control. In34th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2025, Eindhoven, Netherlands, August 25-29, 2025, pages 1693–1698. IEEE, 2025. doi: 10.1109/RO-MAN63969.2025.11217764

  9. [15]

    Taniguchi

    B. Döschl, K. Sommer, and J. J. Kiam. AUSPEX: an integrated open-source decision-making framework for uavs in rescue missions.Frontiers Robotics AI, 12, 2025. doi: 10.3389/FROBT. 2025.1583479

  10. [16]

    Dumencic, L

    S. Dumencic, L. Lanca, K. Jakac, and S. Ivic. Experimental validation of UA V search and detection system in real wilderness environment.CoRR, abs/2502.17372, 2025. doi: 10.48550/ ARXIV .2502.17372

  11. [17]

    T. Feng, X. Wang, F. Han, L. Zhang, and W. Zhu. U2udata: A large-scale cooperative perception dataset for swarm uavs autonomous flight. In J. Cai, M. S. Kankanhalli, B. Prabhakaran, S. Boll, R. Subramanian, L. Zheng, V . K. Singh, P. César, L. Xie, and D. Xu, editors,Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, V...

  12. [18]

    T. Feng, X. Wang, F. Han, L. Zhang, and W. Zhu. U2udata-2: A scalable swarm uavs autonomous flight dataset for long-horizon tasks.CoRR, abs/2509.00055, 2025. doi: 10.48550/ ARXIV .2509.00055

  13. [23]

    Y . Hou, J. Zhao, R. Zhang, X. Cheng, and L. Yang. UA V swarm cooperative target search: A multi-agent reinforcement learning approach.IEEE Trans. Intell. Veh., 9(1):568–578, 2024. doi: 10.1109/TIV .2023.3316196

  14. [30]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    E. Kolve, R. Mottaghi, D. Gordon, Y . Zhu, A. Gupta, and A. Farhadi. AI2-THOR: an interactive 3d environment for visual AI.CoRR, abs/1712.05474, 2017

  15. [31]

    J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue. Citynav: Language-goal aerial navigation dataset with geographic information.CoRR, abs/2406.14240,

  16. [33]

    N. Li, M. Ye, L. Zhou, S. Tang, Y . Gan, Z. Liang, and X. Zhu. Self-prompting analogical reasoning for UA V object detection. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI- 25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 18412–18420. AAAI Press, 2025. doi: 10.16...

  17. [36]

    S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu. Aerialvln: Vision-and-language navigation for uavs. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 15338–15348. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01411

  18. [39]

    Lykov, V

    A. Lykov, V . Serpiva, M. H. Khan, O. Sautenkov, A. Myshlyaev, G. Tadevosyan, Y . Yaqoot, and D. Tsetserukou. Cognitivedrone: A VLA model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs.CoRR, abs/2503.01378, 2025. doi: 10.48550/ ARXIV .2503.01378

  19. [40]

    Mishra, M

    A. Mishra, M. Narendra, A. Sinha, and A. Kumar. Dynamic backbone optimization of yolov10 for real-time object detection in uav-based search and rescue missions.IEEE Access, 13: 195975–195986, 2025. doi: 10.1109/ACCESS.2025.3633379

  20. [41]

    Panagopoulos, A

    D. Panagopoulos, A. Perrusquía, and W. Guo. Selective exploration and information gathering in search and rescue using hierarchical learning guided by natural language input. InIEEE International Conference on Systems, Man, and Cybernetics, SMC 2024, Kuching, Malaysia, October 6-10, 2024, pages 1175–1180. IEEE, 2024. doi: 10.1109/SMC54092.2024.10831125

  21. [42]

    Assran, Q

    R. Ramrakhya, D. Batra, E. Wijmans, and A. Das. Pirlnav: Pretraining with imitation and RL finetuning for OBJECTNA V. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 17896–17906. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01716

  22. [43]

    Sapkota, K

    R. Sapkota, K. I. Roumeliotis, and M. Karkee. Uavs meet agentic AI: A multidomain survey of autonomous aerial intelligence and agentic uavs.CoRR, abs/2506.08045, 2025. doi: 10.48550/ ARXIV .2506.08045. 12

  23. [45]

    S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In M. Hutter and R. Siegwart, editors,Field and Service Robotics, Results of the 11th International Conference, FSR 2017, Zurich, Switzerland, 12-15 September 2017, volume 5 ofSpringer Proceedings in Advanced Robotics, pages 621–635. Sp...

  24. [46]

    M. I. R. Shuvo, N. Alam, A. A. Fime, H. Lee, X. Lin, and J. Kim. A novel large language model (LLM) based approach for robotic collaboration in search and rescue operations. In50th Annual Conference of the IEEE Industrial Electronics Society, IECON 2024, Chicago, IL, USA, November 3-6, 2024, pages 1–6. IEEE, 2024. doi: 10.1109/IECON55916.2024.10905094

  25. [47]

    Siktar, B

    L. Siktar, B. Caran, B. Sekoranja, and M. Svaco. Autonomous UA V navigation for search and res- cue missions using computer vision and convolutional neural networks.CoRR, abs/2507.18160,

  26. [50]

    Strand, T

    S. Strand, T. Wiedemann, B. Burczek, and D. Shutin. Enhancing UA V search under occlusion using next best view planning.IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 19:1085–1096,

  27. [51]

    doi: 10.1109/JSTARS.2025.3638881

  28. [52]

    Y . Su, D. An, K. Chen, W. Yu, B. Ning, Y . Ling, Y . Huang, and L. Wang. Learning fine- grained alignment for aerial vision-dialog navigation. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7060–7068. AAAI Press, 2...

  29. [53]

    Z. Sun, Y . Liu, H. Zhu, Y . Gu, Y . Zou, Z. Liu, G. Xia, B. Du, and Y . Xu. Refdrone: A challenging benchmark for referring expression comprehension in drone scenes.CoRR, abs/2502.00392,

  30. [55]

    Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, B. Li, Y . Lv, L. Kovács, and F. Wang. Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility.Inf. Fusion, 122:103158, 2025. doi: 10.1016/J.INFFUS.2025.103158

  31. [56]

    X. Wang, D. Yang, Y . Liao, W. Zheng, W. Wu, B. Dai, H. Li, and S. Liu. Uav-flow colosseo: A real-world benchmark for flying-on-a-word UA V imitation learning.CoRR, abs/2505.15725,

  32. [58]

    X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  33. [59]

    P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu. V oronav: V oronoi-based zero-shot object navigation with large language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  34. [63]

    H. Xu, Y . Hu, C. Gao, Z. Zhu, Y . Zhao, Y . Li, and Q. Yin. Geonav: Empowering mllms with ex- plicit geospatial reasoning abilities for language-goal aerial navigation.CoRR, abs/2504.09587,

  35. [66]

    Yamauchi

    B. Yamauchi. A frontier-based approach for autonomous exploration. InProceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97 - Towards New Computational Principles for Robotics and Automation, July 10- 11, 1997, Monterey, California, USA, pages 146–151. IEEE Computer Society, 1997. doi: 10.1109/CIRA.1...

  36. [67]

    G. Yang, Y . Mo, C. Lv, Y . Zhang, J. Li, and S. Wei. A dual-layer task planning algorithm based on uavs-human cooperation for search and rescue.Appl. Soft Comput., 181:113488, 2025. doi: 10.1016/J.ASOC.2025.113488

  37. [69]

    F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. He, X. Zhao, Z. Wei, H. Xu, Z. Wang, G. Shao, L. Yang, D. Zhao, and Y . Yang. Aeroverse-review: Comprehensive survey on aerial embodied vision-and-language navigation.The Innovation Informatics, 1(1): 100015, 2025. ISSN 3105-8515. doi: 10.59717/j.xinn-inform.2025.100015

  38. [71]

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu. Unigoal: Towards universal zero-shot goal- oriented navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19057–19066. Computer Vision Foundation / IEEE, 2025. doi: 10.1109/CVPR52734.2025.01775

  39. [72]

    In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher. VLFM: vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024, pages 42–48. IEEE, 2024. doi: 10.1109/ICRA57147.2024.10610712

  40. [73]

    B. Yu, H. Kasaei, and M. Cao. L3MVN: leveraging large language models for visual target navigation. InIROS, pages 3554–3560, 2023. doi: 10.1109/IROS55552.2023.10342512

  41. [75]

    Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:5625– 5644, 2023

    J. Zhang, J. Huang, S. Jin, and S. Lu. Vision-language models for vision tasks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 46(8):5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699

  42. [76]

    Zhang, C

    W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li. Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  43. [79]

    CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023

    J. Zhao and X. Lin. General-purpose aerial intelligent agents empowered by large language models.CoRR, abs/2503.08302, 2025. doi: 10.48550/ARXIV .2503.08302

  44. [80]

    S. Zhao, F. Zhou, and Q. Wu. AA V visual navigation in the large-scale outdoor environment: A semantic-map-based cognitive escape reinforcement learning method.IEEE Internet Things J., 12(11):15926–15938, 2025. doi: 10.1109/JIOT.2025.3532164

  45. [81]

    G. Zhou, Y . Hong, Z. Wang, X. E. Wang, and Q. Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part VII, volume 1...

  46. [82]

    NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,

    G. Zhou, Y . Hong, and Q. Wu. Navgpt: Explicit reasoning in vision-and-language naviga- tion with large language models. In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposiu...

  47. [83]

    red bag" matches

    Perform semantic matching. (e.g., "red bag" matches "Backpack", "fire" matches "Campfire")

  48. [84]

    A drone report matches a ground truth if they refer to the same physical object

  49. [85]

    matches": [ {

    Multiple reports might refer to the same ground truth clue; ensure you count unique ground truth matches. Output Format (Strict JSON only, no conversational filler): { "matches": [ {"agent_cue":"reported_term","gt_cue":"matched_ground_truth_term"} ], "matched_gt_count": <number-of-unique-ground-truth-clues-found> } E Justification of the Static Snapshot F...

  50. [86]

    Basic Modalities 2. Extended Modalities RGB + Text -> Actions + LiDAR, Thermal, Audio, Interactive Language Multi-UAV Swarms Single UAV Progression Targeted Airdrop Core Mission: Life Search Human-Robot Interaction Environment Assessment ... Figure 6:Future Development of Embodied Search and Rescue (ESAR).This diagram illustrates the field’s future develo...

  51. [87]

    Three days later, he got injury while descending from the 2800 Campsite and agreed with his teammates to remain in place to wait for rescue

    On September 25, 2021, a hiker Wu began a travel along the Aotai Trail. Three days later, he got injury while descending from the 2800 Campsite and agreed with his teammates to remain in place to wait for rescue. However, several days later, Wu unexpectedly ascended toward the summit, resulting in a missed encounter with the rescue team. Although he was d...

  52. [88]

    On June 11, 1996, the renowned explorer Yu began a solo traverse of Lop Nur. During the expedition, he encountered a severe sandstorm that forced him off his intended route, preventing him from accessing the supply caches pre-positioned along his planned path. Several days later, Yu perished near a trail intersection; the cause of death was dehydration an...

  53. [89]

    However, during their descent, the team encountered a severe blizzard

    On June 20, 1986, five mountaineers successfully summited K2. However, during their descent, the team encountered a severe blizzard. While three climbers rapidly descended to a secure area, the remaining two became disoriented in the storm and ultimately died from a fall

  54. [90]

    Telemetry data from his sports watch indicated a suspected wildlife attack during the hike

    On October 16, 2025, a hiker Zhong proceeded to explore an undeveloped area of the Dapeng Peninsula without carrying any professional equipment. Telemetry data from his sports watch indicated a suspected wildlife attack during the hike. He subsequently descended to the coastal area, where he was ultimately found deceased. The exact cause of death remains ...