pith. sign in

arxiv: 2605.22816 · v1 · pith:HOAJWYO4new · submitted 2026-05-21 · 💻 cs.RO · cs.CV

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Pith reviewed 2026-05-22 04:41 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language navigationself-awarenessstructural reasoningembodied AIHabitat simulatorend-to-end learningtask progress
0
0 comments X

The pith

A structural reasoning module gives navigation agents self-awareness of their state and task progress without building maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AwareVLN to give vision-language navigation models an explicit sense of where the agent is and how much of the instruction remains. It does so by inserting a structural reasoning module that tracks spatial relations and task progress, plus an automatic data engine that splits training examples by progress level. The whole system stays end-to-end and learns only from visual and language inputs. Experiments in the Habitat simulator report higher success rates than earlier vision-language navigation methods across multiple datasets. A reader would care because the work offers a middle path between pure end-to-end prediction and explicit 3D mapping.

Core claim

AwareVLN equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner through a structural reasoning module and an automatic data engine with progress division.

What carries the argument

The structural reasoning module, which processes relationships between the agent, the instruction, and the scene to build spatial and task-oriented self-awareness.

If this is right

  • Navigation success rates rise on standard vision-language navigation benchmarks in the Habitat simulator.
  • Models can be trained without additional 3D sensors or hand-built scene maps.
  • Decision making becomes more explainable because the agent tracks its own state and remaining task.
  • An automatic data engine with progress division supports effective end-to-end training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-awareness mechanism may reduce failures on long or multi-step instructions where agents commonly lose track of progress.
  • Similar reasoning modules could be tested in other embodied tasks that combine vision, language, and physical movement.
  • The gains might be checked in real-world robot platforms to see whether simulator results transfer.

Load-bearing premise

The structural reasoning module can create genuine self-awareness of space and task progress just from visual and language data without extra 3D sensors or explicit scene mapping.

What would settle it

Training and testing a version of the model without the structural reasoning module on the same Habitat datasets and finding that its navigation success rates match or exceed those of the full AwareVLN.

Figures

Figures reproduced from arXiv: 2605.22816 by Hang Yin, Huangxing Chen, Jianjiang Feng, Jie Zhou, Jiwen Lu, Wenxuan Guo, Wenzhao Zheng, Xiangyu Li, Xiuwei Xu, Yichen Liu.

Figure 1
Figure 1. Figure 1: AwareVLN equips a VLN agent with self-aware, structured reasoning that is selectively triggered at key navigation points. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of AwareVLN. (a) AwareVLN equips a unified vision-language model with both action prediction and self-reflective [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Automatic Data Engine. Key reasoning nodes are automatically identified using room-level semantics and ground-truth waypoints [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rollout in Habitat simulator. AwareVLN performs self-aware reasoning during navigation. As shown in the left part, the agent [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Rollout in real world. AwareVLN, trained solely on simulator-based navigation data, is deployed on a quadruped robot and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of our multi-turn reasoning supervision process [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example 1 of an automatically collected training trajectory, illustrating three key nodes: path deviation, correction completion, [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example 2 of an automatically collected training trajectory. This case includes multiple subtask completion nodes and room [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AwareVLN, a VLN framework that adds a structural reasoning module to foster spatial and task-oriented self-awareness of the agent's state and task progress, together with an automatic data engine using progress division for training. It claims this enables fully end-to-end, data-driven self-awareness without explicit 3D mapping or additional sensors, and reports significant outperformance over prior SOTA VLN methods on multiple datasets in the Habitat simulator.

Significance. If the structural reasoning module demonstrably produces explicit self-awareness (rather than merely improving navigation metrics through other means), the approach could meaningfully narrow the gap between pure VLM end-to-end policies and map-based planners while preserving scalability for vision-language pre-training. The automatic data engine is a practical contribution for generating progress-aware supervision.

major comments (2)
  1. Abstract and §1: The central claim that the structural reasoning module 'fosters spatial and task-oriented self-awareness' in a fully end-to-end data-driven manner is load-bearing for the paper's novelty, yet the experiments only report improved navigation success rates and SPL. No probes, auxiliary prediction tasks, or ablations are described that isolate whether the module improves explicit representation or prediction of agent pose, progress, or instruction grounding beyond what the base VLM already provides.
  2. §4 (Experiments): The reported gains over prior SOTA are presented without ablations that remove the structural reasoning module while keeping the data engine fixed, or vice versa. This makes it impossible to attribute performance improvements specifically to the self-awareness mechanism rather than to the progress-division training data or other implementation details.
minor comments (2)
  1. The abstract states 'significantly outperforms' but supplies no numerical values; the main text should include a concise table of key metrics (success rate, SPL, etc.) against the strongest baselines in the introduction or abstract for quick assessment.
  2. Notation for the structural reasoning module (e.g., how self-awareness is encoded in the VLM hidden states or loss terms) should be defined explicitly with equations in §3 to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our design choices and committing to revisions that strengthen the evidence for the self-awareness claims without misrepresenting the current results.

read point-by-point responses
  1. Referee: Abstract and §1: The central claim that the structural reasoning module 'fosters spatial and task-oriented self-awareness' in a fully end-to-end data-driven manner is load-bearing for the paper's novelty, yet the experiments only report improved navigation success rates and SPL. No probes, auxiliary prediction tasks, or ablations are described that isolate whether the module improves explicit representation or prediction of agent pose, progress, or instruction grounding beyond what the base VLM already provides.

    Authors: We acknowledge that the primary reported metrics are navigation success rate and SPL, which demonstrate overall performance gains. The structural reasoning module is architecturally designed to enable explicit reasoning over agent state, spatial relations, and task progress within the end-to-end VLM pipeline, distinguishing it from base VLMs that lack this structured component. However, we agree that direct isolation via probes or auxiliary tasks would provide stronger evidence. In the revised manuscript we will add such analyses, including auxiliary prediction heads for agent pose estimation, progress regression, and instruction grounding accuracy, to quantify the explicit self-awareness improvements. revision: yes

  2. Referee: §4 (Experiments): The reported gains over prior SOTA are presented without ablations that remove the structural reasoning module while keeping the data engine fixed, or vice versa. This makes it impossible to attribute performance improvements specifically to the self-awareness mechanism rather than to the progress-division training data or other implementation details.

    Authors: We agree that the current presentation would benefit from more granular ablations to disentangle the two contributions. The manuscript reports end-to-end results against prior SOTA methods that use neither component. To address the concern directly, the revised version will include controlled ablations: (1) the full model minus the structural reasoning module (retaining the progress-division data engine) and (2) the structural reasoning module trained with standard (non-progress-divided) data. These will clarify the specific role of the self-awareness mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outperformance claims are independent of inputs

full rationale

The paper introduces AwareVLN as a new end-to-end framework with a structural reasoning module and automatic data engine, then reports superior navigation metrics on Habitat datasets. No derivation chain, equations, or load-bearing steps reduce the claimed self-awareness or performance gains to fitted parameters, self-definitions, or self-citation chains by construction. The central innovations are presented as additions to a VLM backbone and are validated externally via simulator experiments rather than being presupposed in the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified effectiveness of two newly introduced components and the domain assumption that end-to-end VLM training can produce genuine self-awareness without explicit mapping.

axioms (1)
  • domain assumption Vision-Language Models can be extended with a structural reasoning module to achieve spatial and task-oriented self-awareness in an end-to-end manner
    Invoked when the abstract states that the module fosters self-awareness without additional 3D sensors.
invented entities (2)
  • Structural reasoning module no independent evidence
    purpose: To foster spatial and task-oriented self-awareness in the navigation model
    New component introduced as one of the two key innovations.
  • Automatic data engine with progress division no independent evidence
    purpose: To enable effective training of the self-aware navigation model
    New component introduced as the second key innovation.

pith-pipeline@v0.9.0 · 5761 in / 1596 out tokens · 59632 ms · 2026-05-22T04:41:27.462602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

  1. [1]

    Bevbert: Multimodal map pre-training for language-guided navigation

    Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InICCV, 2022. 6

  2. [2]

    1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022)

    Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022). InCVPRW, 2022. 6

  3. [3]

    Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 2, 6

  4. [4]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018. 6

  5. [5]

    Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

  6. [6]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

  7. [7]

    Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6

  8. [8]

    Object goal naviga- tion using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal naviga- tion using goal-oriented semantic exploration. InNeurIPS, pages 4247–4258, 2020. 3

  9. [9]

    Affordances-oriented planning using foundation models for continuous vision- language navigation

    Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

  10. [10]

    Topological planning with transform- ers for vision-and-language navigation

    Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transform- ers for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021. 6

  11. [11]

    Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

  12. [12]

    a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

    Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 2

  13. [13]

    Navila: Legged robot vision-language-action model for navigation

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 5, 6

  14. [14]

    Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2

  15. [15]

    Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

    Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024. 2

  16. [16]

    Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

    Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025. 6

  17. [17]

    Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

    Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

  18. [18]

    Cross-modal map learning for vision and language navigation

    Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 2

  19. [19]

    Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 6

  20. [20]

    Learning navigational visual representations with semantic map su- pervision

    Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernon- court, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map su- pervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023. 6

  21. [21]

    General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019

    Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019. 6

  22. [22]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 2

  23. [23]

    Sim-2-sim transfer for vision- and-language navigation in continuous environments

    Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision- and-language navigation in continuous environments. In European conference on computer vision, pages 588–603. Springer, 2022. 2, 6

  24. [24]

    Beyond the nav-graph: Vision and language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. InEuropean Con- ference on Computer Vision (ECCV), 2020. 2, 6

  25. [25]

    Waypoint models for instruction- guided navigation in continuous environments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15162–15171, 2021. 6

  26. [26]

    Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing. InEMNLP, 2020. 6

  27. [27]

    Onetwovla: A unified vision-language-action model with adaptive reasoning,

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Jun- ming Zhao, and Yang Gao. Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025. 2

  28. [28]

    Nav-r1: Reasoning and navigation in embodied scenes

    Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

  29. [29]

    Bird’s-eye-view scene graph for vision-language navigation

    Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InICCV, pages 10968–10980, 2023. 2

  30. [30]

    Dis- cuss before moving: Visual language navigation via multi- expert discussions

    Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Dis- cuss before moving: Visual language navigation via multi- expert discussions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 17380–17387. IEEE, 2024. 2

  31. [31]

    Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation

    Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 8249–8257. IEEE, 2025. 2

  32. [32]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

    Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

  33. [33]

    Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6, 2

  34. [34]

    Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

    Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024. 2

  35. [35]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 3, 6

  36. [36]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision- language-action models.arXiv preprint arXiv:2502.19417,

  37. [37]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 2

  38. [38]

    Dreamwalker: Mental planning for contin- uous vision-language navigation

    Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for contin- uous vision-language navigation. InProceedings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023. 1, 2, 6

  39. [39]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5

  40. [40]

    Scaling data generation in vision-and-language navigation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 6

  41. [41]

    Gridmm: Grid memory map for vision- and-language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625– 15636, 2023. 2, 6

  42. [42]

    Lookahead exploration with neural radiance representation for continuous vision- language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13753–13762, 2024. 2

  43. [43]

    Lookahead exploration with neural radiance representation for continuous vision- language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 6

  44. [44]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2

  45. [45]

    Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 6

  46. [46]

    Any-point Trajectory Modeling for Policy Learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023. 2

  47. [47]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025. 2

  48. [48]

    Learning Interactive Real-World Simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learn- ing interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 2

  49. [49]

    Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025

    Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, and Hao Dong. Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025. 2

  50. [50]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 2

  51. [51]

    Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024. 1, 2, 6

  52. [52]

    Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024. 1, 2, 6

  53. [53]

    Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 2

  54. [54]

    I see xxx. And I'm in xxx

    Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understand- ing and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 2 AwareVLN: Rea...

  55. [55]

    Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness

    All models are trained exclusively on RxR-CE training set and then evaluated on RxR-CE Val-Unseen split. Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness. C. Visualization of Automatically Collected Trajectories Figure 8 and Figure 9 present two representative navigation trajectories collected by ...