arxiv: 2604.07973 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Baining Zhao , Ziyou Wang , Jianjie Fang , Zile Zhou , Yanggang Xu , Yatai Ji , Jiacheng Xu , Qian Zhang

show 3 more authors

Weichen Zhang Chen Gao Xinlei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords embodied navigationlarge multimodal modelsurban airspacespatial decision-makingbenchmarkdecision bifurcation3D navigationgoal-oriented action

0 comments

The pith

Large multimodal models show initial spatial navigation skills in urban 3D airspace but remain far from human performance, with errors diverging rapidly after critical decision points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark of 5,037 goal-oriented navigation samples in complex urban 3D environments to measure whether large multimodal models can match human spatial action and decision-making. Tests across 17 models reveal emerging action capabilities that still lag human baselines by a wide margin. The work identifies that navigation mistakes do not grow steadily but instead cause quick divergence from the target once a key choice point is mishandled. Analysis of model behavior at these bifurcation points points to specific weaknesses, and four targeted improvement areas are tested experimentally.

Core claim

Large multimodal models possess emerging capacities for goal-oriented embodied navigation in complex urban three-dimensional spaces, demonstrated through performance on a dataset of 5,037 samples that emphasize vertical actions and rich semantic cues, yet they remain substantially below human baselines; navigation errors do not accumulate linearly but diverge rapidly from the destination after a critical decision bifurcation, with limitations traceable to behavior at those points.

What carries the argument

The critical decision bifurcation, the point in a navigation path where a single choice causes subsequent errors to branch away rapidly from the goal rather than accumulate gradually.

If this is right

Improvements in geometric perception directly address the models' difficulty with 3D vertical movements and urban structures.
Better cross-view understanding allows models to integrate information across different camera angles during navigation.
Stronger spatial imagination and long-term memory reduce the rapid divergence that follows wrong choices at bifurcation points.
Agent-based and vision-language-action approaches show partial gains but still require the same targeted upgrades to approach human performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rapid divergence pattern suggests that future benchmarks should isolate and score performance specifically at decision bifurcation points rather than only final goal distance.
The same evaluation approach could be extended to other embodied tasks such as drone-based delivery or robotic inspection in 3D environments.
Architectural changes focused on 3D reasoning may prove more effective than simply increasing model scale for closing the gap to human spatial action.

Load-bearing premise

The 5,037 samples and associated human baselines accurately capture the full scope of human-level spatial decision-making in urban 3D airspace without selection bias or annotation artifacts.

What would settle it

Measure whether models that receive explicit training or prompting to detect and correct choices at the identified decision bifurcations achieve substantially higher success rates and reduced divergence compared with the baseline models on the same 5,037 tasks.

Figures

Figures reproduced from arXiv: 2604.07973 by Baining Zhao, Chen Gao, Jiacheng Xu, Jianjie Fang, Qian Zhang, Weichen Zhang, Xinlei Chen, Yanggang Xu, Yatai Ji, Zile Zhou, Ziyou Wang.

**Figure 1.** Figure 1: Overview of the proposed benchmark. Goal-oriented embodied navigation in urban airspace is defined as: given [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: a. Dataset Construction Pipeline. b. The length distribution of navigation trajectories. c. Proportion of various types [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The change in navigation completion progress (%) as a function of navigation steps. Navigation completion progress is [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: A goal-oriented embodied navigation case: GPT [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The gaps between LMMs and humans in spatial actions can be summarized into four aspects: a. Insufficient ability in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Experimental design to enhance spatial action capability of LMMs: a. Geometric Perception Enhancement. b. Cross [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Navigation dataset examples. It has an output token limit of 16,384 . In this experiment, we set the resolution to 560 * 560 and adopted the model’s default settings. GPT-4.1. Released on April 14, 2025, with an API service, GPT4.1 is a multimodal LMM from OpenAI, featuring a 1M context length. It has an output token limit of 32,768. In this experiment, we set the resolution to 560 * 560 and adopted the m… view at source ↗

**Figure 8.** Figure 8: Navigation prompt details. Qwen2.5-VL-7B fails at an earlier stage of navigation. As shown in Table Y, the model exhibits extensive exploratory behavior, including repeated forward movements, camera gimbal adjustments, and frequent turns. Despite prolonged exploration, the model never successfully localizes the pavilion. A key issue lies in the model’s inability to maintain a coherent spatial belief about… view at source ↗

read the original abstract

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a new 3D urban navigation benchmark with 5k samples and tests 17 models, but the human baseline and dataset sampling details are too thin to support the performance gap claims.

read the letter

The paper's main contribution is a dataset of 5,037 goal-oriented navigation tasks in urban 3D airspace, built over 500 hours, plus an evaluation of 17 models ranging from basic LMMs to agent-based and vision-language-action systems. It highlights that current models show some emerging spatial action ability but lag humans, and it notes that errors diverge sharply after certain decision points rather than growing steadily. They also test four directions for improvement like geometric perception and spatial imagination. That setup extends past the usual 2D or indoor benchmarks, which is useful for anyone working on outdoor embodied tasks such as drone navigation. The model comparisons and the bifurcation observation are straightforward to follow and give a concrete starting point for further work. The weak part is the lack of visible detail on how the samples were chosen or validated against real urban distributions, and the abstract gives no numbers on human data collection, agreement rates, or controls for bias. Without those, the size of the gap and the non-linear error claim are hard to assess fully. This is aimed at embodied AI and robotics researchers who need benchmarks with vertical actions and city semantics. A reader looking for practical evaluation setups in 3D navigation will find value in the model results and suggested fixes. It deserves peer review because the domain is relevant and the benchmark could be adopted if the construction is clarified. I would send it to referees with specific asks on the data pipeline and human protocol.

Referee Report

2 major / 2 minor

Summary. The paper constructs a dataset of 5,037 goal-oriented navigation samples in urban 3D airspace over 500 hours, evaluates 17 LMMs and related methods (non-reasoning, reasoning, agent-based, and vision-language-action models), reports that current LMMs exhibit emerging action capabilities but remain far from human-level performance, identifies a non-linear divergence of navigation errors after critical decision bifurcations, analyzes model behavior at these points, and experimentally tests four improvement directions (geometric perception, cross-view understanding, spatial imagination, and long-term memory).

Significance. If the benchmark holds, the work offers a timely empirical assessment of spatial decision-making gaps in LMMs for complex 3D embodied tasks, with the bifurcation phenomenon providing a concrete lens on failure modes and the tested directions offering actionable paths forward for model improvement in embodied AI.

major comments (2)

[Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.
[Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.

minor comments (2)

The abstract and results could more explicitly define the navigation success metrics and how error divergence is quantified (e.g., distance thresholds or trajectory metrics).
A summary table listing the 17 models by category, size, and key results would improve readability of the model coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, proposing specific revisions to improve the manuscript's rigor and transparency.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.

Authors: We acknowledge that the current manuscript provides limited quantitative characterization of the dataset beyond the total sample count, construction time, and qualitative emphasis on 3D vertical actions and urban semantics. To address verifiability, the revised version will include a new subsection in the methods detailing the sampling frame, quantitative diversity metrics (e.g., distributions of vertical displacement, action sequence lengths, and semantic category coverage), and any available alignment with real-world urban airspace statistics. These additions will strengthen support for the reported error divergence and human-level gap claims. revision: yes
Referee: [Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.

Authors: We agree that the human baseline protocol requires fuller documentation to substantiate the performance comparison. The revised manuscript will expand the evaluation section with explicit details on the collection method, number of participants, inter-annotator agreement measures, and quality controls employed. This will make the human-LMM gap more transparent and verifiable without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential predictions

full rationale

This is an empirical benchmark paper that constructs a dataset of 5,037 navigation samples over 500 hours and evaluates 17 external models against human baselines. No mathematical derivations, equations, fitted parameters, or first-principles claims are present. The reported phenomena (emerging capabilities, non-linear error divergence) are direct experimental observations, not reductions to inputs by construction. No self-citation load-bearing steps or ansatz smuggling occur; the work is self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that models remain far from human-level rests primarily on the unverified quality and representativeness of the newly constructed dataset and the assumption that the tested models are representative of current LMM capabilities.

axioms (1)

domain assumption The 5,037 navigation samples accurately capture the distribution of human-level spatial decision-making required for goal-oriented tasks in urban 3D airspace.
Invoked to support the claim that current models fall short of human performance.

pith-pipeline@v0.9.0 · 5560 in / 1398 out tokens · 58146 ms · 2026-05-10T18:06:45.181182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 30 canonical work pages · 6 internal anchors

[1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. 2018. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

work page internal anchor Pith review arXiv 2018
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Alibaba Cloud. 2025. Qwen Documentation. https://tongyi.aliyun.com/. Accessed: 2025-09-24

2025
[4]

Malik Doole, Joost Ellerbroek, and Jacco Hoekstra. 2020. Estimation of traffic density from drone-based delivery in very low level urban airspace. Journal of Air Transport Management 88 (2020), 101862

2020
[5]

Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thoma- son, and Gaurav S Sukhatme. 2022. Clip-nav: Using clip for zero-shot vision-and- language navigation. arXiv preprint arXiv:2211.16649 (2022)

work page arXiv 2022
[6]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdh- ery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review arXiv 2023
[9]

Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Wang
[10]

In Findings of the Association for Com- putational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.)

Aerial Vision-and-Dialog Navigation. In Findings of the Association for Com- putational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 3043–3061. doi:10.18653/v1/2023.findings-acl.190

work page doi:10.18653/v1/2023.findings-acl.190 2023
[11]

Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al . 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 (2025)

work page arXiv 2025
[12]

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. 2023. Cows on pasture: Baselines and benchmarks for language- driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23171–23181

2023
[13]

Chen Gao, Baining Zhao, Weichen Zhang, Jun Zhang, Jinzhu Mao, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, and Yong Li. 2024. EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment. arXiv preprint arXiv:2410.09604 (2024)

work page arXiv 2024
[14]

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. 2025. OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation. arXiv preprint arXiv:2502.18041 (2025)

work page arXiv 2025
[15]

Yunpeng Gao, Zhigang Wang, Linglin Jing, Dong Wang, Xuelong Li, and Bin Zhao. 2024. Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning. arXiv preprint arXiv:2410.08500 (2024)

work page arXiv 2024
[16]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review arXiv 2024
[17]

Google. 2025. Gemini API. https://ai.google.dev/gemini-api. Accessed: 2025-04- 12

2025
[18]

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. 2023. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

work page internal anchor Pith review arXiv 2023
[19]

Yatai Ji, Zhengqiu Zhu, Yong Zhao, Beidan Liu, Chen Gao, Yihao Zhao, Sihang Qiu, Yue Hu, Quanjun Yin, and Yong Li. 2025. Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology.arXiv preprint arXiv:2505.08765 (2025)

work page arXiv 2025
[20]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al
[21]

OpenVLA: An Open-Source Vision-Language-Action Model

Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Nakamasa Inoue. 2024. CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information. arXiv preprint arXiv:2406.14240 (2024)

work page arXiv 2024
[23]

Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, and Mohit Bansal. 2024. Vln- video: Utilizing driving videos for outdoor vision-and-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18517–18526

2024
[24]

Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics: ACL 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Lingu...

work page doi:10.18653/v1/2024.findings-acl.745 2024
[25]

Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics ACL 2024. 12538–12559

2024
[26]

Jinzhou Lin, Han Gao, Rongtao Xu, Changwei Wang, Li Guo, and Shibiao Xu. 2023. The development of llms for embodied navigation.arXiv preprint arXiv:2311.00530 (2023)

work page arXiv 2023
[27]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. In European conference on computer vision . Springer, 38–55

2024
[28]

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. 2023. AerialVLN: Vision-and-Language Navigation for UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 15384–15394

2023
[29]

Xiangguo Liu, Qiuhuan Yuan, Guoying Wang, Yuan Bian, Feng Xu, and Yuguo Chen. 2023. Drones delivering automated external defibrillators: A new strategy to improve the prognosis of out-of-hospital cardiac arrest. Resuscitation 182 (2023), 109669

2023
[30]

Youzhi Liu, Fanglong Yao, Yuanchang Yue, Guangluan Xu, Xian Sun, and Kun Fu. 2024. NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation. arXiv preprint arXiv:2411.08579 (2024)

work page arXiv 2024
[31]

Yi Ma, Xiaotian Hao, Jianye Hao, Jiawen Lu, Xing Liu, Tong Xialiang, Mingxuan Yuan, Zhigang Li, Jie Tang, and Zhaopeng Meng. 2021. A hierarchical reinforce- ment learning based optimization framework for large-scale dynamic pickup and delivery problems. Advances in neural information processing systems 34 (2021), 23609–23620

2021
[32]

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. 2022. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems 35 (2022), 32340– 32352

2022
[33]

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al . 2023. Where are we in the search for an artificial visual cortex for embodied intelli- gence? Advances in Neural Information Processing Systems 36 (2023), 655–677

2023
[34]

Riccardo Mangiaracina, Alessandro Perego, Arianna Seghezzi, and Angela Tu- mino. 2019. Innovative solutions to increase last-mile delivery efficiency in B2C e-commerce: a literature review. International Journal of Physical Distribution & Logistics Management 49, 9 (2019), 901–920

2019
[35]

OpenAI. 2025. GPT-4o API. https://openai.com/api/. Accessed: 2025-04-12

2025
[36]

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. 2024. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6892–6903

2024
[37]

Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, and Alvaro Velasquez. 2023. Saynav: Grounding large language models for dynamic planning to navigation in new environments. arXiv preprint arXiv:2309.04077 (2023)

work page arXiv 2023
[38]

Lei Ren, Jiabao Dong, Shuai Liu, Lin Zhang, and Lihui Wang. 2024. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. IEEE/ASME Transactions on Mechatronics (2024)

2024
[39]

Nathan B Roberts, Emily Ager, Thomas Leith, Isabel Lott, Marlee Mason-Maready, Tyler Nix, Adam Gottula, Nathaniel Hunt, and Christine Brent. 2023. Current summary of the evidence in drone-based emergency medical services care. Re- suscitation Plus 13 (2023), 100347

2023
[40]

Ramiz Salama, Fadi Al-Turjman, and Rosario Culmone. 2023. AI-powered drone to address smart city security issues. In International Conference on Advanced Information Networking and Applications. Springer, 292–300

2023
[41]

Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. 2024. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18924–18933

2024
[42]

Dhruv Shah, Błażej Osiński, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. PMLR, 492–504

2023
[43]

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. 2023. ViNT: A foundation model for visual navigation. arXiv preprint arXiv:2306.14846 (2023)

work page arXiv 2023
[44]

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2017. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics . arXiv:arXiv:1705.05065 https://arxiv.org/abs/1705.05065

work page Pith review arXiv 2017
[45]

Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. 2025. Towards long-horizon vision-language navigation: Platform, bench- mark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference. 12078–12088

2025
[46]

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. 2024. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 63–70

2024
[47]

2023.ChatGPT for Robotics: Design Principles and Model Abilities

Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. 2023.ChatGPT for Robotics: Design Principles and Model Abilities. Technical Report MSR-TR-2023-

2023
[48]

https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/

Microsoft. https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/
[49]

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. 2024. Towards realistic uav vision- language navigation: Platform, benchmark, and methodology. arXiv preprint arXiv:2410.07087 (2024)

work page arXiv 2024
[50]

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang
[51]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023)

work page arXiv 2023
[52]

Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. 2024. Vision- language navigation: a survey and taxonomy. Neural Computing and Applications 36, 7 (2024), 3291–3316

2024
[53]

Jianqiang Xiao, Yuexuan Sun, Yixin Shao, Boxi Gan, Rongqiang Liu, Yanjin Wu, Weili Guan, and Xiang Deng. 2025. Uav-on: A benchmark for open-world object goal navigation with aerial agents. In Proceedings of the 33rd ACM International Conference on Multimedia. 13023–13029

2025
[54]

Fengli Xu, Jun Zhang, Chen Gao, Jie Feng, and Yong Li. 2023. Urban gener- ative intelligence (ugi): A foundational platform for agents in embodied city environment. arXiv preprint arXiv:2312.11813 (2023)

work page arXiv 2023
[55]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
[56]

In Proceedings of the Computer Vision and Pattern Recognition Conference

Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643
[57]

Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, and Kun Fu. 2024. Aero- verse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied world models. arXiv preprint arXiv:2408.15511 (2024)

work page arXiv 2024
[58]

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. 2025. Unigoal: Towards universal zero-shot goal-oriented navigation. In Proceedings of the Computer Vision and Pattern Recognition Conference . 19057–19066

2025
[59]

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 3554–3560

2023
[60]

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. 2024. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 (2024)

work page arXiv 2024
[61]

Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. CityNavAgent: Aerial Vision-and- Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Wanxiang ...

work page doi:10.18653/v1/2025.acl-long.1511 2025
[62]

Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567 (2025)

work page arXiv 2025
[63]

Yue Zhang and Parisa Kordjamshidi. 2023. VLN-Trans: Translator for the Vision and Language Navigation Agent. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13219–13233...

work page doi:10.18653/v1/2023.acl-long.737 2023
[64]

Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, et al. 2025. Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces. arXiv preprint arXiv:2503.06157 (2025)

work page arXiv 2025
[65]

Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. UrbanVideo- Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume ...

2025
[66]

Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems 36 (2024)

2024
[67]

Gengze Zhou, Yicong Hong, and Qi Wu. 2023. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023)

work page arXiv 2023
[68]

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. 2023. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning . PMLR, 42829–42842

2023
[69]

Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang
[70]

the entrance of the red building on the left front

Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12689–12699. A Details of Dataset The following Figure 7 presents the first-person view images of the goal-riented embodied navigation, showing several examples and the navigation tasks. The tasks ar...

2025