pith. machine review for the scientific record. sign in

arxiv: 2604.21363 · v1 · submitted 2026-04-23 · 💻 cs.RO

Recognition: unknown

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language navigationembodied roboticscognitive memory graphcontext-aware explorationdeployable VLNasynchronous modulesweighted traveling repairman problemreal-world robot deployment
0
0 comments X

The pith

A modular vision-language navigation system separates sensing from reasoning to run efficiently on real robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a deployable VLN system that splits the robot into three asynchronous parts: one for continuous environment sensing, one for building a shared memory graph of space and meaning, and one for high-level decisions using a vision-language model. The memory graph is broken into smaller pieces so the language model can reason without overload, and exploration is turned into a problem of minimizing weighted wait times at important viewpoints. Experiments in simulation and on physical robots show higher success rates and faster goal reaching than prior methods, all while keeping real-time speed on limited hardware. A sympathetic reader would care because it makes sophisticated embodied reasoning practical for robots that cannot carry heavy computers.

Core claim

The system decouples perception, memory integration, and reasoning into asynchronous modules, incrementally builds a cognitive memory graph that is decomposed into subgraphs for VLM reasoning, and formulates exploration as a context-aware Weighted Traveling Repairman Problem to minimize weighted waiting time of viewpoints, yielding improved navigation success and efficiency with real-time performance on resource-constrained hardware.

What carries the argument

The cognitive memory graph, which aggregates spatial-semantic scene information and is decomposed into subgraphs to support VLM reasoning, together with the asynchronous three-module architecture and the WTRP-based exploration planner.

If this is right

  • Higher navigation success and efficiency in both simulated and physical robot tests compared with existing VLN methods.
  • Real-time operation maintained on hardware with limited computation, memory, and energy.
  • Exploration paths that reduce the weighted waiting time at selected viewpoints.
  • Robust high-level decision making without requiring the full environment model at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular split could be tested on other embodied tasks such as object manipulation to see whether asynchronous memory sharing scales beyond navigation.
  • If the subgraph decomposition loses critical long-range context, performance would degrade in large or highly dynamic spaces; this remains untested in the reported experiments.
  • Treating exploration as a weighted repairman problem opens the possibility of borrowing exact solvers or approximations from operations research for other robot path-planning problems.

Load-bearing premise

Decoupling the system into asynchronous modules and splitting the memory graph into subgraphs for the vision-language model will keep all needed information intact and avoid delays that hurt performance in changing real environments.

What would settle it

A real-world test in a rapidly changing scene where the modular system shows lower success rates or misses real-time deadlines compared with a single integrated baseline would show the decoupling and subgraph approach fails to preserve necessary information.

Figures

Figures reproduced from arXiv: 2604.21363 by Chen Wang, Denan Liang, Kuan Xu, Lihua Xie, Ruimeng Liu, Shenghai Yuan, Tongxing Jin, Yizhuo Yang.

Figure 1
Figure 1. Figure 1: We develop a deployable vision–language navigation system that [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed system architecture. The framework decouples perception, memory, and reasoning into three layers, enabling real-time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The memory graph is decomposed into subgraphs and prioritized [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure case analysis of our system on the MP3D and HM3D datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We deploy our system on a quadruped robot, which performs high-level [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-language navigation (VLN), existing approaches often face a fundamental trade-off between strong reasoning capabilities and efficient deployment on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and robust high-level reasoning on real-world robotic platforms. To achieve this, we decouple the system into three asynchronous modules: a real-time perception module for continuous environment sensing, a memory integration module for spatial-semantic aggregation, and a reasoning module for high-level decision making. We incrementally construct a cognitive memory graph to encode scene information, which is further decomposed into subgraphs to enable reasoning with a vision-language model (VLM). To further improve navigation efficiency and accuracy, we also leverage the cognitive memory graph to formulate the exploration problem as a context-aware Weighted Traveling Repairman Problem (WTRP), which minimizes the weighted waiting time of viewpoints. Extensive experiments in both simulation and real-world robotic platforms demonstrate improved navigation success and efficiency over existing VLN approaches, while maintaining real-time performance on resource-constrained hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a deployable embodied vision-language navigation (VLN) system that decouples the architecture into three asynchronous modules: real-time perception, memory integration for building a cognitive memory graph, and reasoning using a vision-language model on decomposed subgraphs. Exploration is formulated as a context-aware Weighted Traveling Repairman Problem (WTRP). The paper claims that extensive experiments in simulation and real-world platforms show improved navigation success and efficiency compared to existing VLN approaches, while achieving real-time performance on resource-constrained hardware.

Significance. If the experimental claims hold, this work would be significant as it addresses the key trade-off in VLN between sophisticated reasoning and deployability on embedded systems. The hierarchical cognition approach and WTRP formulation could provide a practical framework for real-world robotic navigation, potentially advancing the field towards more efficient and robust embodied AI systems.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'extensive experiments... demonstrate improved navigation success and efficiency' is presented without any quantitative metrics, specific baselines, error bars, or statistical analysis, which is load-bearing for evaluating the paper's contribution since the abstract is the primary summary of results.
  2. [Memory Integration Module] Memory Integration Module: The decomposition of the cognitive memory graph into subgraphs for VLM reasoning is described as enabling 'robust high-level reasoning,' but no details are provided on the decomposition criteria (e.g., spatial, semantic) or any analysis showing that critical context is preserved, which directly impacts the validity of the claimed improvements in dynamic real-world environments.
minor comments (1)
  1. [Abstract] The acronym WTRP is introduced without prior expansion, though it is later described as Weighted Traveling Repairman Problem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarification and strengthening of the presentation. We address each major comment point-by-point below and have prepared revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'extensive experiments... demonstrate improved navigation success and efficiency' is presented without any quantitative metrics, specific baselines, error bars, or statistical analysis, which is load-bearing for evaluating the paper's contribution since the abstract is the primary summary of results.

    Authors: We agree that the abstract would benefit from more specific quantitative anchors to support the central claim. In the revised manuscript, we have updated the abstract to include key performance highlights (e.g., success rate and efficiency gains relative to baselines) drawn directly from the experimental results, while directing readers to the full tables, error bars, and statistical analysis in Section 5. This keeps the abstract concise yet informative. revision: yes

  2. Referee: [Memory Integration Module] Memory Integration Module: The decomposition of the cognitive memory graph into subgraphs for VLM reasoning is described as enabling 'robust high-level reasoning,' but no details are provided on the decomposition criteria (e.g., spatial, semantic) or any analysis showing that critical context is preserved, which directly impacts the validity of the claimed improvements in dynamic real-world environments.

    Authors: The original description in the memory integration module relies on spatial-semantic aggregation to construct the cognitive memory graph before subgraph decomposition. We acknowledge that explicit criteria and preservation analysis were insufficiently detailed. In the revision, we have expanded this section to specify the decomposition criteria (combining spatial distance thresholds with semantic similarity via embedding clustering) and added supporting analysis, including an ablation study on context retention and its effect on navigation performance in dynamic settings. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural description with no derivations or self-referential fits

full rationale

The paper describes a system architecture consisting of asynchronous modules, a cognitive memory graph, subgraph decomposition for VLM reasoning, and reformulation of exploration as a context-aware WTRP. No equations, parameter fits, or first-principles derivations are presented in the provided text. Claims of improved performance rest on experimental results rather than any reduction of outputs to inputs by construction. The WTRP formulation is presented as a modeling choice to minimize weighted waiting time, not as a derived prediction equivalent to fitted data. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This is a standard non-circular engineering paper whose central contributions are the proposed decomposition and integration strategy, validated externally via simulation and real-world tests.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; the listed items are inferred from high-level claims.

axioms (1)
  • domain assumption Decomposing the cognitive memory graph into subgraphs preserves sufficient information for effective VLM-based reasoning.
    This is required for the reasoning module to function as described.

pith-pipeline@v0.9.0 · 5552 in / 1201 out tokens · 41762 ms · 2026-05-09T22:06:31.798863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Intelligent multisource autonomous navigation: Review and perspectives,

    W. Wang, F. Meng, and X. Yu, “Intelligent multisource autonomous navigation: Review and perspectives,”IEEE/ASME Transactions on Mechatronics, vol. 30, no. 6, pp. 4081–4091, 2025. 10

  2. [2]

    Autonomous visual navigation with head stabilization control for a salamander-like robot,

    Z. Liu, Y . Liu, Y . Fang, and X. Guo, “Autonomous visual navigation with head stabilization control for a salamander-like robot,”IEEE/ASME Transactions on Mechatronics, 2025

  3. [3]

    Rpf-search: Field-based search for robot person following in unknown dynamic environments,

    H. Ye, K. Cai, Y . Zhan, B. Xia, A. Ajoudani, and H. Zhang, “Rpf-search: Field-based search for robot person following in unknown dynamic environments,”IEEE/ASME Transactions on Mechatronics, 2025

  4. [4]

    Emobipednav: Emotion-aware social navigation for bipedal robots with deep reinforcement learning,

    W. Zhu, A. Raju, A. Shamsah, A. Wu, S. Hutchinson, and Y . Zhao, “Emobipednav: Emotion-aware social navigation for bipedal robots with deep reinforcement learning,”IEEE/ASME Transactions on Mechatronics, 2026

  5. [5]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai,

    Y . Liu, W. Chen, Y . Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”IEEE/ASME Transactions on Mechatronics, 2025

  6. [6]

    A comprehensive review of recent advancements in vision-and-language navigation,

    J. Khan, N. Aafaq, Q. Ali, and M. Mohsin, “A comprehensive review of recent advancements in vision-and-language navigation,”Discover Computing, vol. 29, no. 1, p. 167, 2026

  7. [7]

    A survey of optimization-based task and motion planning: From classical to learning approaches,

    Z. Zhao, S. Cheng, Y . Ding, Z. Zhou, S. Zhang, D. Xu, and Y . Zhao, “A survey of optimization-based task and motion planning: From classical to learning approaches,”IEEE/ASME Transactions On Mechatronics, vol. 30, no. 4, pp. 2799–2825, 2024

  8. [8]

    Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 42–48

  9. [9]

    Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,

    M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou, “Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,”IEEE Robotics and Automation Letters, 2025

  10. [10]

    Vl-nav: real- time vision-language navigation with spatial reasoning,

    Y . Du, T. Fu, Z. Chen, B. Li, S. Su, Z. Zhao, and C. Wang, “Vl-nav: real- time vision-language navigation with spatial reasoning,”arXiv preprint arXiv:2502.00931, 2025

  11. [11]

    Global planning for object navigation via a weighted traveling repairman problem formulation,

    R. Liu, X. Xu, S. Yuan, and L. Xie, “Global planning for object navigation via a weighted traveling repairman problem formulation,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  12. [12]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

  13. [13]

    Unigoal: Towards universal zero-shot goal-oriented navigation,

    H. Yin, X. Xu, L. Zhao, Z. Wang, J. Zhou, and J. Lu, “Unigoal: Towards universal zero-shot goal-oriented navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 057–19 066

  14. [14]

    Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,

    Y . Kuang, H. Lin, and M. Jiang, “Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 338–351

  15. [15]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  16. [16]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.arXiv preprint arXiv:2407.07035, 2024

    Y . Zhang, Z. Ma, J. Liet al., “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,”arXiv preprint arXiv:2407.07035, 2024

  17. [17]

    Speaker-follower models for vision-and-language navigation,

    D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg- Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” inNeural Information Processing Systems (NeurIPS), 2018

  18. [18]

    A recurrent vision-and-language bert for navigation,

    Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “A recurrent vision-and-language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 1643–1653

  19. [19]

    Dreamwalker: Mental planning for continuous vision-language navigation,

    H. Wang, W. Liang, L. Van Gool, and W. Wang, “Dreamwalker: Mental planning for continuous vision-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 10 873–10 883

  20. [20]

    V olumetric environment representation for vision-language navigation,

    R. Liu, W. Wang, and Y . Yang, “V olumetric environment representation for vision-language navigation,” inCVPR, 2024, pp. 16 317–16 328

  21. [21]

    Object goal navigation using goal-oriented semantic exploration,

    D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,”Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020

  22. [22]

    Clip on wheels: Zero-shot object navigation as object localization and exploration,

    S. Y . Gadre, M. Wortsman, G. Mehrotra, L. Schmidt, and S. S. Gordon, “Clip on wheels: Zero-shot object navigation as object localization and exploration,”arXiv preprint arXiv:2303.08234, 2023

  23. [23]

    Imagine before go: Self-supervised generative map for object goal navigation,

    S. Zhang, X. Yu, X. Song, X. Wang, and S. Jiang, “Imagine before go: Self-supervised generative map for object goal navigation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 414–16 425

  24. [24]

    3d-mem: 3d scene memory for embodied exploration and reasoning,

    Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan, “3d-mem: 3d scene memory for embodied exploration and reasoning,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 294–17 303

  25. [25]

    Zson: Zero-shot object-goal navigation using multimodal goal embeddings,

    A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embeddings,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 340– 32 352, 2022

  26. [26]

    Esc: Exploration with soft commonsense constraints for zero-shot object navigation,

    K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero-shot object navigation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 42 829–42 842

  27. [27]

    L3mvn: Leveraging large language models for visual target navigation,

    B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560

  28. [28]

    2311.06430 , archiveprefix =

    M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batraet al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023

  29. [29]

    Wmnav: Integrating vision-language models into world models for object goal navigation,

    D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen, “Wmnav: Integrating vision-language models into world models for object goal navigation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2392–2399

  30. [30]

    Tango: training-free embodied ai agents for open-world tasks,

    F. Ziliotto, T. Campari, L. Serafini, and L. Ballan, “Tango: training-free embodied ai agents for open-world tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 603–24 613

  31. [31]

    Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

    W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5228–5234

  32. [32]

    Fast-lio2: Fast direct lidar-inertial odometry,

    W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-inertial odometry,”IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2053–2073, 2022

  33. [33]

    Yolo- world: Real-time open-vocabulary object detection,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 901–16 911

  34. [34]

    Faster segment anything: Towards lightweight sam for mobile applications,

    C. Zhang, D. Han, Y . Qiao, J. U. Kim, S. H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

  35. [35]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Changet al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

  36. [36]

    Matterport3d: Learning from rgb-d data in indoor environments,

    A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”International Conference on 3D Vision (3DV), 2017

  37. [37]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 5543–5550

  38. [38]

    Prioritized semantic learning for zero-shot instance navigation,

    X. Sun, L. Liu, H. Zhi, R. Qiu, and J. Liang, “Prioritized semantic learning for zero-shot instance navigation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–178

  39. [39]

    Habitat-web: Learning embodied object-search strategies from human demonstrations at scale,

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-web: Learning embodied object-search strategies from human demonstrations at scale,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5173–5183

  40. [40]

    Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

    L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu, “Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,”arXiv preprint arXiv:2411.16425, 2024

  41. [41]

    Goat- bench: A benchmark for multi-modal lifelong navigation,

    M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat- bench: A benchmark for multi-modal lifelong navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 16 373–16 383

  42. [42]

    Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

    Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Denget al., “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8120–8132

  43. [43]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProceedings of Robotics: Science and Systems (RSS), 2025