pith. machine review for the scientific record. sign in

arxiv: 2604.07973 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3

classification 💻 cs.AI
keywords embodied navigationlarge multimodal modelsurban airspacespatial decision-makingbenchmarkdecision bifurcation3D navigationgoal-oriented action
0
0 comments X

The pith

Large multimodal models show initial spatial navigation skills in urban 3D airspace but remain far from human performance, with errors diverging rapidly after critical decision points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark of 5,037 goal-oriented navigation samples in complex urban 3D environments to measure whether large multimodal models can match human spatial action and decision-making. Tests across 17 models reveal emerging action capabilities that still lag human baselines by a wide margin. The work identifies that navigation mistakes do not grow steadily but instead cause quick divergence from the target once a key choice point is mishandled. Analysis of model behavior at these bifurcation points points to specific weaknesses, and four targeted improvement areas are tested experimentally.

Core claim

Large multimodal models possess emerging capacities for goal-oriented embodied navigation in complex urban three-dimensional spaces, demonstrated through performance on a dataset of 5,037 samples that emphasize vertical actions and rich semantic cues, yet they remain substantially below human baselines; navigation errors do not accumulate linearly but diverge rapidly from the destination after a critical decision bifurcation, with limitations traceable to behavior at those points.

What carries the argument

The critical decision bifurcation, the point in a navigation path where a single choice causes subsequent errors to branch away rapidly from the goal rather than accumulate gradually.

If this is right

  • Improvements in geometric perception directly address the models' difficulty with 3D vertical movements and urban structures.
  • Better cross-view understanding allows models to integrate information across different camera angles during navigation.
  • Stronger spatial imagination and long-term memory reduce the rapid divergence that follows wrong choices at bifurcation points.
  • Agent-based and vision-language-action approaches show partial gains but still require the same targeted upgrades to approach human performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rapid divergence pattern suggests that future benchmarks should isolate and score performance specifically at decision bifurcation points rather than only final goal distance.
  • The same evaluation approach could be extended to other embodied tasks such as drone-based delivery or robotic inspection in 3D environments.
  • Architectural changes focused on 3D reasoning may prove more effective than simply increasing model scale for closing the gap to human spatial action.

Load-bearing premise

The 5,037 samples and associated human baselines accurately capture the full scope of human-level spatial decision-making in urban 3D airspace without selection bias or annotation artifacts.

What would settle it

Measure whether models that receive explicit training or prompting to detect and correct choices at the identified decision bifurcations achieve substantially higher success rates and reduced divergence compared with the baseline models on the same 5,037 tasks.

Figures

Figures reproduced from arXiv: 2604.07973 by Baining Zhao, Chen Gao, Jiacheng Xu, Jianjie Fang, Qian Zhang, Weichen Zhang, Xinlei Chen, Yanggang Xu, Yatai Ji, Zile Zhou, Ziyou Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed benchmark. Goal-oriented embodied navigation in urban airspace is defined as: given [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: a. Dataset Construction Pipeline. b. The length distribution of navigation trajectories. c. Proportion of various types [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The change in navigation completion progress (%) as a function of navigation steps. Navigation completion progress is [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A goal-oriented embodied navigation case: GPT [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The gaps between LMMs and humans in spatial actions can be summarized into four aspects: a. Insufficient ability in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experimental design to enhance spatial action capability of LMMs: a. Geometric Perception Enhancement. b. Cross [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Navigation dataset examples. It has an output token limit of 16,384 . In this experiment, we set the resolution to 560 * 560 and adopted the model’s default settings. GPT-4.1. Released on April 14, 2025, with an API service, GPT￾4.1 is a multimodal LMM from OpenAI, featuring a 1M context length. It has an output token limit of 32,768. In this experiment, we set the resolution to 560 * 560 and adopted the m… view at source ↗
Figure 8
Figure 8. Figure 8: Navigation prompt details. Qwen2.5-VL-7B fails at an earlier stage of navigation. As shown in Table Y, the model exhibits extensive exploratory behavior, in￾cluding repeated forward movements, camera gimbal adjustments, and frequent turns. Despite prolonged exploration, the model never successfully localizes the pavilion. A key issue lies in the model’s inability to maintain a coherent spatial belief about… view at source ↗
read the original abstract

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs a dataset of 5,037 goal-oriented navigation samples in urban 3D airspace over 500 hours, evaluates 17 LMMs and related methods (non-reasoning, reasoning, agent-based, and vision-language-action models), reports that current LMMs exhibit emerging action capabilities but remain far from human-level performance, identifies a non-linear divergence of navigation errors after critical decision bifurcations, analyzes model behavior at these points, and experimentally tests four improvement directions (geometric perception, cross-view understanding, spatial imagination, and long-term memory).

Significance. If the benchmark holds, the work offers a timely empirical assessment of spatial decision-making gaps in LMMs for complex 3D embodied tasks, with the bifurcation phenomenon providing a concrete lens on failure modes and the tested directions offering actionable paths forward for model improvement in embodied AI.

major comments (2)
  1. [Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.
  2. [Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.
minor comments (2)
  1. The abstract and results could more explicitly define the navigation success metrics and how error divergence is quantified (e.g., distance thresholds or trajectory metrics).
  2. A summary table listing the 17 models by category, size, and key results would improve readability of the model coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, proposing specific revisions to improve the manuscript's rigor and transparency.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.

    Authors: We acknowledge that the current manuscript provides limited quantitative characterization of the dataset beyond the total sample count, construction time, and qualitative emphasis on 3D vertical actions and urban semantics. To address verifiability, the revised version will include a new subsection in the methods detailing the sampling frame, quantitative diversity metrics (e.g., distributions of vertical displacement, action sequence lengths, and semantic category coverage), and any available alignment with real-world urban airspace statistics. These additions will strengthen support for the reported error divergence and human-level gap claims. revision: yes

  2. Referee: [Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.

    Authors: We agree that the human baseline protocol requires fuller documentation to substantiate the performance comparison. The revised manuscript will expand the evaluation section with explicit details on the collection method, number of participants, inter-annotator agreement measures, and quality controls employed. This will make the human-LMM gap more transparent and verifiable without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential predictions

full rationale

This is an empirical benchmark paper that constructs a dataset of 5,037 navigation samples over 500 hours and evaluates 17 external models against human baselines. No mathematical derivations, equations, fitted parameters, or first-principles claims are present. The reported phenomena (emerging capabilities, non-linear error divergence) are direct experimental observations, not reductions to inputs by construction. No self-citation load-bearing steps or ansatz smuggling occur; the work is self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that models remain far from human-level rests primarily on the unverified quality and representativeness of the newly constructed dataset and the assumption that the tested models are representative of current LMM capabilities.

axioms (1)
  • domain assumption The 5,037 navigation samples accurately capture the distribution of human-level spatial decision-making required for goal-oriented tasks in urban 3D airspace.
    Invoked to support the claim that current models fall short of human performance.

pith-pipeline@v0.9.0 · 5560 in / 1398 out tokens · 58146 ms · 2026-05-10T18:06:45.181182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 30 canonical work pages · 6 internal anchors

  1. [1]

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. 2018. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    Alibaba Cloud. 2025. Qwen Documentation. https://tongyi.aliyun.com/. Accessed: 2025-09-24

  4. [4]

    Malik Doole, Joost Ellerbroek, and Jacco Hoekstra. 2020. Estimation of traffic density from drone-based delivery in very low level urban airspace. Journal of Air Transport Management 88 (2020), 101862

  5. [5]

    Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thoma- son, and Gaurav S Sukhatme. 2022. Clip-nav: Using clip for zero-shot vision-and- language navigation. arXiv preprint arXiv:2211.16649 (2022)

  6. [6]

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdh- ery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  7. [9]

    Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Wang

  8. [10]

    In Findings of the Association for Com- putational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.)

    Aerial Vision-and-Dialog Navigation. In Findings of the Association for Com- putational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 3043–3061. doi:10.18653/v1/2023.findings-acl.190

  9. [11]

    Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al . 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 (2025)

  10. [12]

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. 2023. Cows on pasture: Baselines and benchmarks for language- driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23171–23181

  11. [13]

    Chen Gao, Baining Zhao, Weichen Zhang, Jun Zhang, Jinzhu Mao, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, and Yong Li. 2024. EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment. arXiv preprint arXiv:2410.09604 (2024)

  12. [14]

    Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. 2025. OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation. arXiv preprint arXiv:2502.18041 (2025)

  13. [15]

    Yunpeng Gao, Zhigang Wang, Linglin Jing, Dong Wang, Xuelong Li, and Bin Zhao. 2024. Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning. arXiv preprint arXiv:2410.08500 (2024)

  14. [16]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

  15. [17]

    Google. 2025. Gemini API. https://ai.google.dev/gemini-api. Accessed: 2025-04- 12

  16. [18]

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. 2023. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

  17. [19]

    Yatai Ji, Zhengqiu Zhu, Yong Zhao, Beidan Liu, Chen Gao, Yihao Zhao, Sihang Qiu, Yue Hu, Quanjun Yin, and Yong Li. 2025. Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology.arXiv preprint arXiv:2505.08765 (2025)

  18. [20]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al

  19. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  20. [22]

    Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Nakamasa Inoue. 2024. CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information. arXiv preprint arXiv:2406.14240 (2024)

  21. [23]

    Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, and Mohit Bansal. 2024. Vln- video: Utilizing driving videos for outdoor vision-and-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18517–18526

  22. [24]

    Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics: ACL 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Lingu...

  23. [25]

    Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics ACL 2024. 12538–12559

  24. [26]

    Jinzhou Lin, Han Gao, Rongtao Xu, Changwei Wang, Li Guo, and Shibiao Xu. 2023. The development of llms for embodied navigation.arXiv preprint arXiv:2311.00530 (2023)

  25. [27]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. In European conference on computer vision . Springer, 38–55

  26. [28]

    Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. 2023. AerialVLN: Vision-and-Language Navigation for UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 15384–15394

  27. [29]

    Xiangguo Liu, Qiuhuan Yuan, Guoying Wang, Yuan Bian, Feng Xu, and Yuguo Chen. 2023. Drones delivering automated external defibrillators: A new strategy to improve the prognosis of out-of-hospital cardiac arrest. Resuscitation 182 (2023), 109669

  28. [30]

    Youzhi Liu, Fanglong Yao, Yuanchang Yue, Guangluan Xu, Xian Sun, and Kun Fu. 2024. NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation. arXiv preprint arXiv:2411.08579 (2024)

  29. [31]

    Yi Ma, Xiaotian Hao, Jianye Hao, Jiawen Lu, Xing Liu, Tong Xialiang, Mingxuan Yuan, Zhigang Li, Jie Tang, and Zhaopeng Meng. 2021. A hierarchical reinforce- ment learning based optimization framework for large-scale dynamic pickup and delivery problems. Advances in neural information processing systems 34 (2021), 23609–23620

  30. [32]

    Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. 2022. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems 35 (2022), 32340– 32352

  31. [33]

    Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al . 2023. Where are we in the search for an artificial visual cortex for embodied intelli- gence? Advances in Neural Information Processing Systems 36 (2023), 655–677

  32. [34]

    Riccardo Mangiaracina, Alessandro Perego, Arianna Seghezzi, and Angela Tu- mino. 2019. Innovative solutions to increase last-mile delivery efficiency in B2C e-commerce: a literature review. International Journal of Physical Distribution & Logistics Management 49, 9 (2019), 901–920

  33. [35]

    OpenAI. 2025. GPT-4o API. https://openai.com/api/. Accessed: 2025-04-12

  34. [36]

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. 2024. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6892–6903

  35. [37]

    Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, and Alvaro Velasquez. 2023. Saynav: Grounding large language models for dynamic planning to navigation in new environments. arXiv preprint arXiv:2309.04077 (2023)

  36. [38]

    Lei Ren, Jiabao Dong, Shuai Liu, Lin Zhang, and Lihui Wang. 2024. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. IEEE/ASME Transactions on Mechatronics (2024)

  37. [39]

    Nathan B Roberts, Emily Ager, Thomas Leith, Isabel Lott, Marlee Mason-Maready, Tyler Nix, Adam Gottula, Nathaniel Hunt, and Christine Brent. 2023. Current summary of the evidence in drone-based emergency medical services care. Re- suscitation Plus 13 (2023), 100347

  38. [40]

    Ramiz Salama, Fadi Al-Turjman, and Rosario Culmone. 2023. AI-powered drone to address smart city security issues. In International Conference on Advanced Information Networking and Applications. Springer, 292–300

  39. [41]

    Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. 2024. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18924–18933

  40. [42]

    Dhruv Shah, Błażej Osiński, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. PMLR, 492–504

  41. [43]

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. 2023. ViNT: A foundation model for visual navigation. arXiv preprint arXiv:2306.14846 (2023)

  42. [44]

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2017. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics . arXiv:arXiv:1705.05065 https://arxiv.org/abs/1705.05065

  43. [45]

    Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. 2025. Towards long-horizon vision-language navigation: Platform, bench- mark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference. 12078–12088

  44. [46]

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. 2024. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 63–70

  45. [47]

    2023.ChatGPT for Robotics: Design Principles and Model Abilities

    Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. 2023.ChatGPT for Robotics: Design Principles and Model Abilities. Technical Report MSR-TR-2023-

  46. [48]

    https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/

    Microsoft. https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/

  47. [49]

    Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. 2024. Towards realistic uav vision- language navigation: Platform, benchmark, and methodology. arXiv preprint arXiv:2410.07087 (2024)

  48. [50]

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang

  49. [51]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

    Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023)

  50. [52]

    Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. 2024. Vision- language navigation: a survey and taxonomy. Neural Computing and Applications 36, 7 (2024), 3291–3316

  51. [53]

    Jianqiang Xiao, Yuexuan Sun, Yixin Shao, Boxi Gan, Rongqiang Liu, Yanjin Wu, Weili Guan, and Xiang Deng. 2025. Uav-on: A benchmark for open-world object goal navigation with aerial agents. In Proceedings of the 33rd ACM International Conference on Multimedia. 13023–13029

  52. [54]

    Fengli Xu, Jun Zhang, Chen Gao, Jie Feng, and Yong Li. 2023. Urban gener- ative intelligence (ugi): A foundational platform for agents in embodied city environment. arXiv preprint arXiv:2312.11813 (2023)

  53. [55]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  54. [56]

    In Proceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  55. [57]

    Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, and Kun Fu. 2024. Aero- verse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied world models. arXiv preprint arXiv:2408.15511 (2024)

  56. [58]

    Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. 2025. Unigoal: Towards universal zero-shot goal-oriented navigation. In Proceedings of the Computer Vision and Pattern Recognition Conference . 19057–19066

  57. [59]

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 3554–3560

  58. [60]

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. 2024. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 (2024)

  59. [61]

    Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. CityNavAgent: Aerial Vision-and- Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Wanxiang ...

  60. [62]

    Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567 (2025)

  61. [63]

    Yue Zhang and Parisa Kordjamshidi. 2023. VLN-Trans: Translator for the Vision and Language Navigation Agent. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13219–13233...

  62. [64]

    Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, et al. 2025. Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces. arXiv preprint arXiv:2503.06157 (2025)

  63. [65]

    Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. UrbanVideo- Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume ...

  64. [66]

    Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems 36 (2024)

  65. [67]

    Gengze Zhou, Yicong Hong, and Qi Wu. 2023. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023)

  66. [68]

    Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. 2023. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning . PMLR, 42829–42842

  67. [69]

    Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang

  68. [70]

    the entrance of the red building on the left front

    Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12689–12699. A Details of Dataset The following Figure 7 presents the first-person view images of the goal-riented embodied navigation, showing several examples and the navigation tasks. The tasks ar...