pith. machine review for the scientific record. sign in

arxiv: 2605.09441 · v1 · submitted 2026-05-10 · 💻 cs.RO

Recognition: no theorem link

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 💻 cs.RO
keywords general-purpose navigationembodied AIbenchmarkcross-embodimentcomposite tasksrobot navigationunified agentshuman demonstrations
0
0 comments X

The pith

Current unified navigation methods struggle with the interleaved, cross-embodiment demands of general-purpose tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents OmniNavBench as a new evaluation platform designed to test navigation agents on tasks that combine multiple skills in sequence within single episodes. It allows testing across different robot types like humanoids, quadrupeds, and wheeled robots, using realistic human-operated demonstration trajectories instead of simple shortest paths. The evaluations reveal that existing approaches, even those claiming unified designs, perform poorly on these complex scenarios, pointing to a mismatch with what real-world deployment requires. This matters because it pushes the field toward developing more versatile agents capable of handling mixed navigation challenges in varied physical forms.

Core claim

OmniNavBench advances evaluation by introducing composite instructions that interleave sub-tasks from PointNav, VLN, ObjectNav, SocialNav, Human Following, and EQA categories, forcing agents to switch between exploration, interaction, and social behaviors. The platform supports multiple robot morphologies through a modular sensor interface across 170 environments mixing synthetic and real scans. Expert trajectories are collected via human teleoperation to capture natural behaviors like exploratory glances and anticipatory avoidance. Evaluations on this setup demonstrate that current methods fail to handle the interleaved nature effectively, underscoring the need for better generalist navigat

What carries the argument

OmniNavBench benchmark, which enables testing of cross-skill coordination via composite instructions from six navigation categories and cross-embodiment generalization across humanoid, quadrupedal, and wheeled robots using human teleoperation data.

Load-bearing premise

The composite instructions interleaving the six navigation categories and the human teleoperation trajectories sufficiently represent the demands and behavioral nuances of real-world general-purpose navigation scenarios.

What would settle it

A concrete test would be whether any existing or new navigation method can achieve high success rates on the composite instruction episodes across multiple robot morphologies in the OmniNavBench environments.

Figures

Figures reproduced from arXiv: 2605.09441 by Chao Liang, Lingming Zhang, Qichen Zhang, Samson Sun, Tengyue Wang, Tianyi Yang, Yikai Xue, Zhengjie Xu, Zhipeng Zhang.

Figure 1
Figure 1. Figure 1: Overview of the proposed OmniNavBench. We introduce a unified benchmark designed to evaluate cross-skill coordination and cross-embodiment generalization in embodied navigation. The benchmark constructs composite instructions by dynamically composing a sequence of primary sub-tasks (i.e., VLN, PointNav, ObjectNav, Human Following), which must be executed while concurrently satisfying overarching constraint… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-embodiment navigation across diverse en￾vironments. Each row shows a different robot morphology (Carter, Aliengo, H1); each column shows a different envi￾ronment source (synthetic, Matterport3D). Insets display ego￾centric RGB observations, highlighting viewpoint variations across embodiments. any trajectory failing to capture specified landmarks or objects clearly must be revised by the original ope… view at source ↗
Figure 4
Figure 4. Figure 4: Failure Mode Analysis across Embodiments. Radar charts comparing four failure types for four models. Bar chats comparing total failure on two scenes. investigating how the structural complexity of different envi￾ronments influences the distribution of failure modes. Findings6: Failure Modes Various across Embodiments. The distribution of failure modes differs substantially across robot morphologies ( [PIT… view at source ↗
Figure 5
Figure 5. Figure 5: Semantically Consistent Instruction Style Genera￾tion Pipeline. The figure illustrates the process of transforming a human-annotated original instruction into three distinct vari￾ants (Concise, Verbose, and First Person) using Qwen3-Max. Visual annotations highlight specific linguistic modifications: removed redundancy (Concise), added elaborations (Verbose), and viewpoint shifts (First Person). TABLE VI: … view at source ↗
Figure 6
Figure 6. Figure 6: Dynamic human character appearances. The six characters span diverse genders, ethnicities, and attire to improve visual diversity in Social Navigation and Human Following tasks [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sim vs. Real validation. Same instructions, approxi [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces OmniNavBench, a unified benchmark for general-purpose navigation that shifts from isolated skill evaluation to composite instructions interleaving six categories (PointNav, VLN, ObjectNav, SocialNav, Human Following, EQA) within single episodes. It supports cross-embodiment testing across humanoid, quadrupedal, and wheeled robots via a modular sensor interface in 170 environments (synthetic + real-world scans) and replaces shortest-path trajectories with 1779 human teleoperated demonstrations that capture exploratory and anticipatory behaviors. Evaluations of existing methods on this benchmark show struggles with interleaved tasks, which the authors interpret as exposing a critical gap between current capabilities and real-world deployment demands. The dataset, code, and leaderboard are released publicly.

Significance. If the benchmark's proxy assumptions hold, OmniNavBench could meaningfully advance embodied AI by providing a more realistic testbed that rewards cross-skill coordination and morphological generalization rather than narrow specialization. The open release of data, code, and leaderboard is a clear strength that supports reproducibility and community follow-up. The emphasis on human trajectories over algorithmic paths adds behavioral nuance that is often missing from simulation benchmarks.

major comments (1)
  1. [Abstract] Abstract: The claim that results 'expose a critical disparity between existing capabilities and real-world deployment demands' is load-bearing for the paper's broader impact statement yet rests on an unvalidated proxy assumption. All reported evaluations occur inside the simulator (including the human trajectories themselves), with no physical-robot experiments, sim-to-real transfer metrics, or quantification of how composite-task performance degrades outside simulation. This gap directly affects whether the observed struggles can be extrapolated to real-world deployment.
minor comments (2)
  1. The description of the 170 environments would benefit from explicit details on how real-world scans are integrated, any domain randomization applied, and quantitative measures of visual or geometric fidelity to the source scans.
  2. Consider adding a dedicated limitations or future-work subsection that explicitly discusses the simulation-only nature of the current results and planned physical-robot validation steps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on the abstract claim below, acknowledging the simulation-only nature of the evaluations while defending the benchmark's design as a meaningful proxy.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that results 'expose a critical disparity between existing capabilities and real-world deployment demands' is load-bearing for the paper's broader impact statement yet rests on an unvalidated proxy assumption. All reported evaluations occur inside the simulator (including the human trajectories themselves), with no physical-robot experiments, sim-to-real transfer metrics, or quantification of how composite-task performance degrades outside simulation. This gap directly affects whether the observed struggles can be extrapolated to real-world deployment.

    Authors: We agree that the evaluations, including human teleoperated trajectories, are conducted entirely in simulation and that no physical-robot experiments or explicit sim-to-real transfer metrics are provided. This is a limitation of the current work. However, the benchmark incorporates design choices intended to strengthen its relevance as a proxy: the 170 environments blend synthetic assets with real-world scans, the modular sensor interface supports cross-embodiment testing, and the 1779 trajectories were collected via human teleoperation specifically to capture exploratory glances and anticipatory behaviors absent from shortest-path baselines. The composite instructions further require interleaving of skills in ways that reflect real deployment scenarios. The observed struggles of existing methods even under these controlled yet more realistic conditions suggest that the gap to viable real-world performance is substantial. We will revise the abstract to clarify that the results highlight challenges likely to be amplified outside simulation, rather than directly asserting an unvalidated disparity in deployment demands. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and evaluations are independent of inputs

full rationale

The paper defines OmniNavBench via new composite instructions interleaving 6 navigation categories, a modular simulation platform across embodiments, and 1779 human-teleoperated trajectories in 170 environments. Evaluations measure external methods' performance on these tasks. No equations, fitted parameters, or self-referential derivations appear in the provided text. Claims about method struggles follow directly from measured outcomes on the new benchmark rather than reducing to the benchmark definition itself by construction. Self-citations are absent from the abstract and description; the work is self-contained as an empirical testbed without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new benchmark resting on standard simulation assumptions and newly collected human data; no free parameters are fitted to produce a result, no new physical entities are postulated, and no ad-hoc axioms beyond domain-standard simulation fidelity are required.

axioms (1)
  • domain assumption Simulation environments with synthetic and real-world scan assets accurately model physics and sensor observations for navigation tasks.
    Invoked in the description of the 170 environments and modular sensor interface; standard for sim-to-real benchmarks.

pith-pipeline@v0.9.0 · 5596 in / 1285 out tokens · 50648 ms · 2026-05-12T04:30:44.469413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- lis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

  2. [2]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 3674–3683, 2018

  3. [3]

    Socnavbench: A grounded simulation testing framework for evaluating social navi- gation.ACM Transactions on Human-Robot Interaction (THRI), 11(3):1–24, 2022

    Abhijat Biswas, Allan Wang, Gustavo Silvera, Aaron Steinfeld, and Henny Admoni. Socnavbench: A grounded simulation testing framework for evaluating social navi- gation.ACM Transactions on Human-Robot Interaction (THRI), 11(3):1–24, 2022

  4. [4]

    Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

  5. [5]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

  6. [6]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33: 4247–4258, 2020

  7. [7]

    Navila: Legged robot vision-language- action model for navigation,

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

  8. [8]

    Yifei Dong, Fengyi Wu, Qi He, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Zhi-Qi Cheng, et al. Ha-vln: A benchmark for human- aware navigation in discrete-continuous environments with dynamic multi-human interactions, real-world val- idation, and an open leaderboard.arXiv preprint arXiv:2503.14229, 2025

  9. [9]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

  10. [10]

    Room-object entity prompting and reasoning for embodied referring expression.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):994–1010, 2023

    Chen Gao, Si Liu, Jinyu Chen, Luting Wang, Qi Wu, Bo Li, and Qi Tian. Room-object entity prompting and reasoning for embodied referring expression.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):994–1010, 2023

  11. [11]

    Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

    Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

  12. [12]

    Anchor, 1966

    Edmund T Hall and Edward T Hall.The hidden dimen- sion, volume 609. Anchor, 1966

  13. [13]

    Yulong Huang, Yonggang Zhang, Peng Shi, Zhemin Wu, Junhui Qian, and Jonathon A Chambers. Robust kalman filters based on gaussian scale mixture distributions with application to target tracking.IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(10):2082– 2096, 2017

  14. [14]

    Goat-bench: A benchmark for multi-modal lifelong navigation

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373–16383, 2024

  15. [15]

    Beyond the nav-graph: Vision- and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In European Conference on Computer Vision, pages 104–

  16. [16]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotempo- ral grounding.arXiv preprint arXiv:2010.07954, 2020

  17. [17]

    Citynav: Language-goal aerial navigation dataset with geographic information.arXiv preprint arXiv:2406.14240, 2024

    Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Naka- masa Inoue. Citynav: Language-goal aerial navigation dataset with geographic information.arXiv preprint arXiv:2406.14240, 2024

  18. [18]

    Vlnverse: A benchmark for vision-language nav- igation with versatile, embodied, realistic simulation and evaluation, 2025

    Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, and Qi Wu. Vlnverse: A benchmark for vision-language nav- igation with versatile, embodied, realistic simulation and evaluation, 2025. URL https://arxiv.org/abs/2512.19021

  19. [19]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024

  20. [20]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

  21. [21]

    Isaac Sim

    NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/ IsaacSim

  22. [22]

    Reverie: Remote embodied visual referring expression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

  23. [23]

    Egocognav: Cognition-aware human egocentric navigation.arXiv preprint arXiv:2511.17581, 2025

    Zhiwen Qiu, Ziang Liu, Wenqian Niu, Tapomayukh Bhattacharjee, and Saleh Kalantari. Egocognav: Cognition-aware human egocentric navigation.arXiv preprint arXiv:2511.17581, 2025

  24. [24]

    Th ¨or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Jour- nal of Robotics Research, 44(4):568–591, 2025

    Tim Schreiter, Tiago Rodrigues de Almeida, Yufei Zhu, Eduardo Gutierrez Maestro, Lucas Morillo-Mendez, An- drey Rudenko, Luigi Palmieri, Tomasz P Kucner, Martin Magnusson, and Achim J Lilienthal. Th ¨or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Jour- nal of Robotics Research, 44(4):568–591, 2025

  25. [25]

    Towards long-horizon vision- language navigation: Platform, benchmark and method

    Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12078–12088, 2025

  26. [26]

    Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

  27. [27]

    Grutopia: Dream general robots in a city at scale, 2024

    Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, et al. Grutopia: Dream general robots in a city at scale.arXiv preprint arXiv:2407.10943, 2024

  28. [28]

    Rethinking the embodied gap in vision- and-language navigation: A holistic study of physical and visual disparities

    Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision- and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

  29. [29]

    Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025a

    Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual track- ing in the wild.arXiv preprint arXiv:2505.23189, 2025

  30. [30]

    Scaling data generation in vision-and-language naviga- tion

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language naviga- tion. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 12009–12020, 2023

  31. [31]

    Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,

    Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for gener- alizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

  32. [32]

    Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

  33. [33]

    arXiv preprint arXiv:2509.25687 , year=

    Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jin- tao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation. arXiv preprint arXiv:2509.25687, 2025

  34. [34]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

  35. [35]

    Unigoal: Towards universal zero- shot goal-oriented navigation

    Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero- shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

  36. [36]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navi- gation

    Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navi- gation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550. IEEE, 2024

  37. [37]

    Multi-target embodied question answering

    Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L Berg, and Dhruv Batra. Multi-target embodied question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6309–6318, 2019

  38. [38]

    Poliformer: Scaling on-policy rl with transformers results in masterful navigators.arXiv preprint arXiv:2406.20083,

    Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hen- drix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, and Luca Weihs. Poliformer: Scal- ing on-policy rl with transformers results in masterful navigators.arXiv preprint arXiv:2406.20083, 2024

  39. [39]

    Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

  40. [40]

    Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jia- hang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

  41. [41]

    Efficient motion planning based on kinodynamic model for quadruped robots following per- sons in confined spaces.IEEE/ASME Transactions on Mechatronics, 26(4):1997–2006, 2021

    Zhen Zhang, Jiaqing Yan, Xin Kong, Guangyao Zhai, and Yong Liu. Efficient motion planning based on kinodynamic model for quadruped robots following per- sons in confined spaces.IEEE/ASME Transactions on Mechatronics, 26(4):1997–2006, 2021

  42. [42]

    The surprising effectiveness of visual odometry techniques for embodied pointgoal nav- igation

    Xiaoming Zhao, Harsh Agrawal, Dhruv Batra, and Alexander G Schwing. The surprising effectiveness of visual odometry techniques for embodied pointgoal nav- igation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16127–16136, 2021

  43. [43]

    Towards learning a generalist model for embodied navigation

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624–13634, 2024

  44. [44]

    Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts

    Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, and Qi Wu. Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7794–7807, 2025

  45. [45]

    Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation

    Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8120–8132, 2025. APPENDIX ...

  46. [46]

    Robot Pool

    Robot Embodiment Specifications:The specific param- eters for the different robot embodiments used in our “Robot Pool” are detailed in Table VI.Heightrefers to the overall robot height.Cam (Stand)denotes the camera height in the default standing pose, whileCam (Obs.)indicates the camera height during active navigation, which may differ due to postural cha...

  47. [47]

    Human characters follow predefined waypoint sequences spec- ified, which can be chained to form longer trajectories

    Simulation of Dynamic Humans:For Social Navigation and Human Following tasks, dynamic humans are simulated using theomni.anim.peopleextension in Isaac Sim. Human characters follow predefined waypoint sequences spec- ified, which can be chained to form longer trajectories. All characters walk at a fixed speed of about 1.0 m/s. Given the complexity of indoo...

  48. [48]

    one”, “two

    Metrics Calculation:Distance Computation a) Geodesic Distance via NavMesh.:For VLN and Ob- jectNav tasks, we compute geodesic distanced geo(·,·)via Navigation Mesh (NavMesh). Existing ObjectNav evaluations typically measure the Euclidean distance between the robot and the object’s geometric center. However, for large objects (e.g., beds, sofas), the geome...

  49. [49]

    Matterport3D Scenes:We utilize 85 scenes from the Matterport3D dataset. The complete list of scene identifiers is provided below: 17DRP5sb8fy, 1LXtFkjw3qL, 1pXnuDYAj8r, 29hnd4uzFmX, 2azQ1b91cZZ, 2n8kARJN3HM, 2t7WUuJeko7, 5LpN3gDmAk7, 5q7pvUzZiYa, 759xd9YjKW5, 7y3sRwLe3Va, 8194nk5LbLH, 82sE5b5pLXE, 8WUmhLawc2A, ARNzJeq3xxb, B6ByNegPMKs, D7N2EKCX4Sj, E9uDoF...

  50. [50]

    GRScenes-Home Scenes:We utilize 61 residential scenes from GRScenes. The scene identifiers are: MV7J6NIKTKJZ2AABAAAAADI8, MV7J6NIKTKJZ2AABAAAAADQ8, MV7J6NIKTKJZ2AABAAAAADY8, MV7J6NIKTKJZ2AABAAAAAEA8, MV7J6NIKTKJZ2AABAAAAAEI8, MVUCSQAKTKJ5EAABAAAAAAA8, MVUCSQAKTKJ5EAABAAAAAAI8, MVUCSQAKTKJ5EAABAAAAAAQ8, MVUCSQAKTKJ5EAABAAAAAAY8, MVUCSQAKTKJ5EAABAAAAABA8, M...

  51. [51]

    GRScenes-Commercial Scenes:We utilize 24 commer- cial scenes from GRScenes. The scene identifiers are: MV4AFHQKTKJZ2AABAAAAAEA8, MV4AFHQKTKJZ2AABAAAAAEI8, MV5M25QKTKJZ2AABAAAAAAA8, MV5M25QKTKJZ2AABAAAAAAI8, MV5M25QKTKJZ2AABAAAAAAQ8, MV5M25QKTKJZ2AABAAAAAAY8, MV5M25QKTKJZ2AABAAAAAEI8, MV7J6NIKTKJZ2AABAAAAAAA8, MV7J6NIKTKJZ2AABAAAAAAI8, MVJWVGYKTLDAYAABAAAA...