Recognition: 2 theorem links
· Lean TheoremHow Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3
The pith
Large multimodal models show initial spatial navigation skills in urban 3D airspace but remain far from human performance, with errors diverging rapidly after critical decision points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large multimodal models possess emerging capacities for goal-oriented embodied navigation in complex urban three-dimensional spaces, demonstrated through performance on a dataset of 5,037 samples that emphasize vertical actions and rich semantic cues, yet they remain substantially below human baselines; navigation errors do not accumulate linearly but diverge rapidly from the destination after a critical decision bifurcation, with limitations traceable to behavior at those points.
What carries the argument
The critical decision bifurcation, the point in a navigation path where a single choice causes subsequent errors to branch away rapidly from the goal rather than accumulate gradually.
If this is right
- Improvements in geometric perception directly address the models' difficulty with 3D vertical movements and urban structures.
- Better cross-view understanding allows models to integrate information across different camera angles during navigation.
- Stronger spatial imagination and long-term memory reduce the rapid divergence that follows wrong choices at bifurcation points.
- Agent-based and vision-language-action approaches show partial gains but still require the same targeted upgrades to approach human performance.
Where Pith is reading between the lines
- The rapid divergence pattern suggests that future benchmarks should isolate and score performance specifically at decision bifurcation points rather than only final goal distance.
- The same evaluation approach could be extended to other embodied tasks such as drone-based delivery or robotic inspection in 3D environments.
- Architectural changes focused on 3D reasoning may prove more effective than simply increasing model scale for closing the gap to human spatial action.
Load-bearing premise
The 5,037 samples and associated human baselines accurately capture the full scope of human-level spatial decision-making in urban 3D airspace without selection bias or annotation artifacts.
What would settle it
Measure whether models that receive explicit training or prompting to detect and correct choices at the identified decision bifurcations achieve substantially higher success rates and reduced divergence compared with the baseline models on the same 5,037 tasks.
Figures
read the original abstract
Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs a dataset of 5,037 goal-oriented navigation samples in urban 3D airspace over 500 hours, evaluates 17 LMMs and related methods (non-reasoning, reasoning, agent-based, and vision-language-action models), reports that current LMMs exhibit emerging action capabilities but remain far from human-level performance, identifies a non-linear divergence of navigation errors after critical decision bifurcations, analyzes model behavior at these points, and experimentally tests four improvement directions (geometric perception, cross-view understanding, spatial imagination, and long-term memory).
Significance. If the benchmark holds, the work offers a timely empirical assessment of spatial decision-making gaps in LMMs for complex 3D embodied tasks, with the bifurcation phenomenon providing a concrete lens on failure modes and the tested directions offering actionable paths forward for model improvement in embodied AI.
major comments (2)
- [Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.
- [Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.
minor comments (2)
- The abstract and results could more explicitly define the navigation success metrics and how error divergence is quantified (e.g., distance thresholds or trajectory metrics).
- A summary table listing the 17 models by category, size, and key results would improve readability of the model coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, proposing specific revisions to improve the manuscript's rigor and transparency.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction (abstract and corresponding methods section): the 5,037 samples are described as high-quality with emphasis on 3D vertical actions and urban semantics, but no quantitative diversity metrics, sampling frame, or validation against real-world urban flight distributions are provided. This directly affects verifiability of the human-level gap and the reported non-linear error divergence after bifurcations.
Authors: We acknowledge that the current manuscript provides limited quantitative characterization of the dataset beyond the total sample count, construction time, and qualitative emphasis on 3D vertical actions and urban semantics. To address verifiability, the revised version will include a new subsection in the methods detailing the sampling frame, quantitative diversity metrics (e.g., distributions of vertical displacement, action sequence lengths, and semantic category coverage), and any available alignment with real-world urban airspace statistics. These additions will strengthen support for the reported error divergence and human-level gap claims. revision: yes
-
Referee: [Evaluation] Human baseline protocol (abstract and evaluation section): no details on collection method, number of participants, inter-annotator agreement, or quality controls are given. Without these, the central claim that LMMs remain 'far from human-level' rests on an unverified comparison.
Authors: We agree that the human baseline protocol requires fuller documentation to substantiate the performance comparison. The revised manuscript will expand the evaluation section with explicit details on the collection method, number of participants, inter-annotator agreement measures, and quality controls employed. This will make the human-LMM gap more transparent and verifiable without altering the reported results. revision: yes
Circularity Check
No circularity: purely empirical benchmark with no derivations or self-referential predictions
full rationale
This is an empirical benchmark paper that constructs a dataset of 5,037 navigation samples over 500 hours and evaluates 17 external models against human baselines. No mathematical derivations, equations, fitted parameters, or first-principles claims are present. The reported phenomena (emerging capabilities, non-linear error divergence) are direct experimental observations, not reductions to inputs by construction. No self-citation load-bearing steps or ansatz smuggling occur; the work is self-contained against external model testing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 5,037 navigation samples accurately capture the distribution of human-level spatial decision-making required for goal-oriented tasks in urban 3D airspace.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. 2018. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
work page internal anchor Pith review arXiv 2018
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Alibaba Cloud. 2025. Qwen Documentation. https://tongyi.aliyun.com/. Accessed: 2025-09-24
2025
-
[4]
Malik Doole, Joost Ellerbroek, and Jacco Hoekstra. 2020. Estimation of traffic density from drone-based delivery in very low level urban airspace. Journal of Air Transport Management 88 (2020), 101862
2020
- [5]
-
[6]
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdh- ery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
work page internal anchor Pith review arXiv 2023
-
[9]
Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Wang
-
[10]
Aerial Vision-and-Dialog Navigation. In Findings of the Association for Com- putational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 3043–3061. doi:10.18653/v1/2023.findings-acl.190
-
[11]
Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al . 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 (2025)
-
[12]
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. 2023. Cows on pasture: Baselines and benchmarks for language- driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23171–23181
2023
-
[13]
Chen Gao, Baining Zhao, Weichen Zhang, Jun Zhang, Jinzhu Mao, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, and Yong Li. 2024. EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment. arXiv preprint arXiv:2410.09604 (2024)
- [14]
- [15]
-
[16]
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)
work page internal anchor Pith review arXiv 2024
-
[17]
Google. 2025. Gemini API. https://ai.google.dev/gemini-api. Accessed: 2025-04- 12
2025
-
[18]
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. 2023. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)
work page internal anchor Pith review arXiv 2023
- [19]
-
[20]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al
-
[21]
OpenVLA: An Open-Source Vision-Language-Action Model
Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
-
[23]
Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, and Mohit Bansal. 2024. Vln- video: Utilizing driving videos for outdoor vision-and-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18517–18526
2024
-
[24]
Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics: ACL 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Lingu...
-
[25]
Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, and Xiaodan Liang. 2024. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation. In Findings of the Association for Computational Linguistics ACL 2024. 12538–12559
2024
- [26]
-
[27]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. In European conference on computer vision . Springer, 38–55
2024
-
[28]
Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. 2023. AerialVLN: Vision-and-Language Navigation for UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 15384–15394
2023
-
[29]
Xiangguo Liu, Qiuhuan Yuan, Guoying Wang, Yuan Bian, Feng Xu, and Yuguo Chen. 2023. Drones delivering automated external defibrillators: A new strategy to improve the prognosis of out-of-hospital cardiac arrest. Resuscitation 182 (2023), 109669
2023
- [30]
-
[31]
Yi Ma, Xiaotian Hao, Jianye Hao, Jiawen Lu, Xing Liu, Tong Xialiang, Mingxuan Yuan, Zhigang Li, Jie Tang, and Zhaopeng Meng. 2021. A hierarchical reinforce- ment learning based optimization framework for large-scale dynamic pickup and delivery problems. Advances in neural information processing systems 34 (2021), 23609–23620
2021
-
[32]
Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. 2022. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems 35 (2022), 32340– 32352
2022
-
[33]
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al . 2023. Where are we in the search for an artificial visual cortex for embodied intelli- gence? Advances in Neural Information Processing Systems 36 (2023), 655–677
2023
-
[34]
Riccardo Mangiaracina, Alessandro Perego, Arianna Seghezzi, and Angela Tu- mino. 2019. Innovative solutions to increase last-mile delivery efficiency in B2C e-commerce: a literature review. International Journal of Physical Distribution & Logistics Management 49, 9 (2019), 901–920
2019
-
[35]
OpenAI. 2025. GPT-4o API. https://openai.com/api/. Accessed: 2025-04-12
2025
-
[36]
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. 2024. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6892–6903
2024
- [37]
-
[38]
Lei Ren, Jiabao Dong, Shuai Liu, Lin Zhang, and Lihui Wang. 2024. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. IEEE/ASME Transactions on Mechatronics (2024)
2024
-
[39]
Nathan B Roberts, Emily Ager, Thomas Leith, Isabel Lott, Marlee Mason-Maready, Tyler Nix, Adam Gottula, Nathaniel Hunt, and Christine Brent. 2023. Current summary of the evidence in drone-based emergency medical services care. Re- suscitation Plus 13 (2023), 100347
2023
-
[40]
Ramiz Salama, Fadi Al-Turjman, and Rosario Culmone. 2023. AI-powered drone to address smart city security issues. In International Conference on Advanced Information Networking and Applications. Springer, 292–300
2023
-
[41]
Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. 2024. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18924–18933
2024
-
[42]
Dhruv Shah, Błażej Osiński, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. PMLR, 492–504
2023
- [43]
-
[44]
Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2017. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics . arXiv:arXiv:1705.05065 https://arxiv.org/abs/1705.05065
work page Pith review arXiv 2017
-
[45]
Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. 2025. Towards long-horizon vision-language navigation: Platform, bench- mark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference. 12078–12088
2025
-
[46]
Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. 2024. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 63–70
2024
-
[47]
2023.ChatGPT for Robotics: Design Principles and Model Abilities
Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. 2023.ChatGPT for Robotics: Design Principles and Model Abilities. Technical Report MSR-TR-2023-
2023
-
[48]
https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/
Microsoft. https://www.microsoft.com/en-us/research/publication/chatgpt- for-robotics-design-principles-and-model-abilities/
- [49]
-
[50]
Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang
-
[51]
Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023)
-
[52]
Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. 2024. Vision- language navigation: a survey and taxonomy. Neural Computing and Applications 36, 7 (2024), 3291–3316
2024
-
[53]
Jianqiang Xiao, Yuexuan Sun, Yixin Shao, Boxi Gan, Rongqiang Liu, Yanjin Wu, Weili Guan, and Xiang Deng. 2025. Uav-on: A benchmark for open-world object goal navigation with aerial agents. In Proceedings of the 33rd ACM International Conference on Multimedia. 13023–13029
2025
- [54]
-
[55]
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
-
[56]
In Proceedings of the Computer Vision and Pattern Recognition Conference
Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643
- [57]
-
[58]
Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. 2025. Unigoal: Towards universal zero-shot goal-oriented navigation. In Proceedings of the Computer Vision and Pattern Recognition Conference . 19057–19066
2025
-
[59]
Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 3554–3560
2023
- [60]
-
[61]
Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. CityNavAgent: Aerial Vision-and- Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Wanxiang ...
-
[62]
Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567 (2025)
-
[63]
Yue Zhang and Parisa Kordjamshidi. 2023. VLN-Trans: Translator for the Vision and Language Navigation Agent. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13219–13233...
-
[64]
Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, et al. 2025. Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces. arXiv preprint arXiv:2503.06157 (2025)
-
[65]
Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. UrbanVideo- Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume ...
2025
-
[66]
Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems 36 (2024)
2024
- [67]
-
[68]
Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. 2023. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning . PMLR, 42829–42842
2023
-
[69]
Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang
-
[70]
the entrance of the red building on the left front
Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12689–12699. A Details of Dataset The following Figure 7 presents the first-person view images of the goal-riented embodied navigation, showing several examples and the navigation tasks. The tasks ar...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.