AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Hang Yin; Huangxing Chen; Jianjiang Feng; Jie Zhou; Jiwen Lu; Wenxuan Guo; Wenzhao Zheng; Xiangyu Li; Xiuwei Xu; Yichen Liu

arxiv: 2605.22816 · v1 · pith:HOAJWYO4new · submitted 2026-05-21 · 💻 cs.RO · cs.CV

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Wenxuan Guo , Xiuwei Xu , Yichen Liu , Xiangyu Li , Hang Yin , Huangxing Chen , Wenzhao Zheng , Jianjiang Feng

show 2 more authors

Jie Zhou Jiwen Lu

This is my paper

Pith reviewed 2026-05-22 04:41 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords vision-language navigationself-awarenessstructural reasoningembodied AIHabitat simulatorend-to-end learningtask progress

0 comments

The pith

A structural reasoning module gives navigation agents self-awareness of their state and task progress without building maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AwareVLN to give vision-language navigation models an explicit sense of where the agent is and how much of the instruction remains. It does so by inserting a structural reasoning module that tracks spatial relations and task progress, plus an automatic data engine that splits training examples by progress level. The whole system stays end-to-end and learns only from visual and language inputs. Experiments in the Habitat simulator report higher success rates than earlier vision-language navigation methods across multiple datasets. A reader would care because the work offers a middle path between pure end-to-end prediction and explicit 3D mapping.

Core claim

AwareVLN equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner through a structural reasoning module and an automatic data engine with progress division.

What carries the argument

The structural reasoning module, which processes relationships between the agent, the instruction, and the scene to build spatial and task-oriented self-awareness.

If this is right

Navigation success rates rise on standard vision-language navigation benchmarks in the Habitat simulator.
Models can be trained without additional 3D sensors or hand-built scene maps.
Decision making becomes more explainable because the agent tracks its own state and remaining task.
An automatic data engine with progress division supports effective end-to-end training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-awareness mechanism may reduce failures on long or multi-step instructions where agents commonly lose track of progress.
Similar reasoning modules could be tested in other embodied tasks that combine vision, language, and physical movement.
The gains might be checked in real-world robot platforms to see whether simulator results transfer.

Load-bearing premise

The structural reasoning module can create genuine self-awareness of space and task progress just from visual and language data without extra 3D sensors or explicit scene mapping.

What would settle it

Training and testing a version of the model without the structural reasoning module on the same Habitat datasets and finding that its navigation success rates match or exceed those of the full AwareVLN.

Figures

Figures reproduced from arXiv: 2605.22816 by Hang Yin, Huangxing Chen, Jianjiang Feng, Jie Zhou, Jiwen Lu, Wenxuan Guo, Wenzhao Zheng, Xiangyu Li, Xiuwei Xu, Yichen Liu.

**Figure 1.** Figure 1: AwareVLN equips a VLN agent with self-aware, structured reasoning that is selectively triggered at key navigation points. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Framework of AwareVLN. (a) AwareVLN equips a unified vision-language model with both action prediction and self-reflective [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Automatic Data Engine. Key reasoning nodes are automatically identified using room-level semantics and ground-truth waypoints [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rollout in Habitat simulator. AwareVLN performs self-aware reasoning during navigation. As shown in the left part, the agent [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Rollout in real world. AwareVLN, trained solely on simulator-based navigation data, is deployed on a quadruped robot and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Example of our multi-turn reasoning supervision process [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Example 1 of an automatically collected training trajectory, illustrating three key nodes: path deviation, correction completion, [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Example 2 of an automatically collected training trajectory. This case includes multiple subtask completion nodes and room [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AwareVLN adds a structural reasoning module and progress-based data engine to VLN, reports Habitat gains, but does not directly test whether the module creates actual self-awareness of state or progress.

read the letter

The main point is that AwareVLN combines a structural reasoning module meant to build spatial and task self-awareness with an automatic data engine that splits training by progress. The paper shows this setup beats prior state-of-the-art on several Habitat VLN datasets and tasks. That combination is the clearest new piece, as it tries to keep the end-to-end VLM style while adding explicit awareness without extra 3D sensors or maps. The experiments across datasets and the reported outperformance are the strongest part of the work; they give concrete evidence that the overall framework moves the needle on navigation metrics in simulation. The data engine in particular looks like a practical way to create better training signals without heavy manual work. The citation pattern follows the usual VLN and VLM references and does not seem circular. The soft spot is the link between the reasoning module and genuine self-awareness. The navigation results improve, but the paper does not include direct probes, auxiliary predictions of agent pose or progress, or ablations that isolate whether the module improves explicit state understanding beyond what the base VLM already provides. The gains could come mostly from the data engine or other training tweaks. This leaves the central claim about self-awareness less tightly supported than the performance numbers. The work is aimed at researchers in vision-language navigation and embodied AI who want to add reasoning to VLM agents without full mapping. Readers focused on practical simulation results and incremental improvements to end-to-end methods will get the most from it. It has enough new components and experimental backing to deserve a serious referee, even if the self-awareness validation needs more attention in review.

Referee Report

2 major / 2 minor

Summary. The paper proposes AwareVLN, a VLN framework that adds a structural reasoning module to foster spatial and task-oriented self-awareness of the agent's state and task progress, together with an automatic data engine using progress division for training. It claims this enables fully end-to-end, data-driven self-awareness without explicit 3D mapping or additional sensors, and reports significant outperformance over prior SOTA VLN methods on multiple datasets in the Habitat simulator.

Significance. If the structural reasoning module demonstrably produces explicit self-awareness (rather than merely improving navigation metrics through other means), the approach could meaningfully narrow the gap between pure VLM end-to-end policies and map-based planners while preserving scalability for vision-language pre-training. The automatic data engine is a practical contribution for generating progress-aware supervision.

major comments (2)

Abstract and §1: The central claim that the structural reasoning module 'fosters spatial and task-oriented self-awareness' in a fully end-to-end data-driven manner is load-bearing for the paper's novelty, yet the experiments only report improved navigation success rates and SPL. No probes, auxiliary prediction tasks, or ablations are described that isolate whether the module improves explicit representation or prediction of agent pose, progress, or instruction grounding beyond what the base VLM already provides.
§4 (Experiments): The reported gains over prior SOTA are presented without ablations that remove the structural reasoning module while keeping the data engine fixed, or vice versa. This makes it impossible to attribute performance improvements specifically to the self-awareness mechanism rather than to the progress-division training data or other implementation details.

minor comments (2)

The abstract states 'significantly outperforms' but supplies no numerical values; the main text should include a concise table of key metrics (success rate, SPL, etc.) against the strongest baselines in the introduction or abstract for quick assessment.
Notation for the structural reasoning module (e.g., how self-awareness is encoded in the VLM hidden states or loss terms) should be defined explicitly with equations in §3 to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our design choices and committing to revisions that strengthen the evidence for the self-awareness claims without misrepresenting the current results.

read point-by-point responses

Referee: Abstract and §1: The central claim that the structural reasoning module 'fosters spatial and task-oriented self-awareness' in a fully end-to-end data-driven manner is load-bearing for the paper's novelty, yet the experiments only report improved navigation success rates and SPL. No probes, auxiliary prediction tasks, or ablations are described that isolate whether the module improves explicit representation or prediction of agent pose, progress, or instruction grounding beyond what the base VLM already provides.

Authors: We acknowledge that the primary reported metrics are navigation success rate and SPL, which demonstrate overall performance gains. The structural reasoning module is architecturally designed to enable explicit reasoning over agent state, spatial relations, and task progress within the end-to-end VLM pipeline, distinguishing it from base VLMs that lack this structured component. However, we agree that direct isolation via probes or auxiliary tasks would provide stronger evidence. In the revised manuscript we will add such analyses, including auxiliary prediction heads for agent pose estimation, progress regression, and instruction grounding accuracy, to quantify the explicit self-awareness improvements. revision: yes
Referee: §4 (Experiments): The reported gains over prior SOTA are presented without ablations that remove the structural reasoning module while keeping the data engine fixed, or vice versa. This makes it impossible to attribute performance improvements specifically to the self-awareness mechanism rather than to the progress-division training data or other implementation details.

Authors: We agree that the current presentation would benefit from more granular ablations to disentangle the two contributions. The manuscript reports end-to-end results against prior SOTA methods that use neither component. To address the concern directly, the revised version will include controlled ablations: (1) the full model minus the structural reasoning module (retaining the progress-division data engine) and (2) the structural reasoning module trained with standard (non-progress-divided) data. These will clarify the specific role of the self-awareness mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outperformance claims are independent of inputs

full rationale

The paper introduces AwareVLN as a new end-to-end framework with a structural reasoning module and automatic data engine, then reports superior navigation metrics on Habitat datasets. No derivation chain, equations, or load-bearing steps reduce the claimed self-awareness or performance gains to fitted parameters, self-definitions, or self-citation chains by construction. The central innovations are presented as additions to a VLM backbone and are validated externally via simulator experiments rather than being presupposed in the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified effectiveness of two newly introduced components and the domain assumption that end-to-end VLM training can produce genuine self-awareness without explicit mapping.

axioms (1)

domain assumption Vision-Language Models can be extended with a structural reasoning module to achieve spatial and task-oriented self-awareness in an end-to-end manner
Invoked when the abstract states that the module fosters self-awareness without additional 3D sensors.

invented entities (2)

Structural reasoning module no independent evidence
purpose: To foster spatial and task-oriented self-awareness in the navigation model
New component introduced as one of the two key innovations.
Automatic data engine with progress division no independent evidence
purpose: To enable effective training of the self-aware navigation model
New component introduced as the second key innovation.

pith-pipeline@v0.9.0 · 5761 in / 1596 out tokens · 59632 ms · 2026-05-22T04:41:27.462602+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

structural reasoning module that fosters spatial and task-oriented self-awareness... triplet-based structural reasoning format: Scene description, Progress assessment, Plan for the next step
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sparse reasoning mechanism... triggered at key navigation nodes (subtask completion, path deviation, stopping error)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

[1]

Bevbert: Multimodal map pre-training for language-guided navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InICCV, 2022. 6

work page 2022
[2]

1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022)

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022). InCVPRW, 2022. 6

work page 2022
[3]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 2, 6

work page 2024
[4]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page
[6]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6

work page 2017
[8]

Object goal naviga- tion using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal naviga- tion using goal-oriented semantic exploration. InNeurIPS, pages 4247–4258, 2020. 3

work page 2020
[9]

Affordances-oriented planning using foundation models for continuous vision- language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

work page 2025
[10]

Topological planning with transform- ers for vision-and-language navigation

Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transform- ers for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021. 6

work page 2021
[11]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

work page arXiv
[12]

a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 2

work page arXiv 2023
[13]

Navila: Legged robot vision-language-action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 5, 6

work page arXiv 2024
[14]

Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2

work page 2023
[15]

Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024. 2

work page arXiv 2024
[16]

Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025. 6

work page arXiv 2025
[17]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

work page arXiv 2025
[18]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 2

work page 2022
[19]

Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 6

work page 2022
[20]

Learning navigational visual representations with semantic map su- pervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernon- court, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map su- pervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023. 6

work page 2023
[21]

General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019

Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019. 6

work page arXiv 1907
[22]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 2

work page 2025
[23]

Sim-2-sim transfer for vision- and-language navigation in continuous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision- and-language navigation in continuous environments. In European conference on computer vision, pages 588–603. Springer, 2022. 2, 6

work page 2022
[24]

Beyond the nav-graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. InEuropean Con- ference on Computer Vision (ECCV), 2020. 2, 6

work page 2020
[25]

Waypoint models for instruction- guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15162–15171, 2021. 6

work page 2021
[26]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing. InEMNLP, 2020. 6

work page 2020
[27]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Jun- ming Zhao, and Yang Gao. Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025. 2

work page arXiv 2025
[28]

Nav-r1: Reasoning and navigation in embodied scenes

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

work page arXiv 2025
[29]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InICCV, pages 10968–10980, 2023. 2

work page 2023
[30]

Dis- cuss before moving: Visual language navigation via multi- expert discussions

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Dis- cuss before moving: Visual language navigation via multi- expert discussions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 17380–17387. IEEE, 2024. 2

work page 2024
[31]

Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation

Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 8249–8257. IEEE, 2025. 2

work page 2025
[32]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

work page arXiv
[33]

Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6, 2

work page 2021
[34]

Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024. 2

work page arXiv 2024
[35]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 3, 6

work page 2019
[36]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision- language-action models.arXiv preprint arXiv:2502.19417,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 2

work page internal anchor Pith review arXiv 2024
[38]

Dreamwalker: Mental planning for contin- uous vision-language navigation

Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for contin- uous vision-language navigation. InProceedings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023. 1, 2, 6

work page 2023
[39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 6

work page 2023
[41]

Gridmm: Grid memory map for vision- and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625– 15636, 2023. 2, 6

work page 2023
[42]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13753–13762, 2024. 2

work page 2024
[43]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 6

work page 2024
[44]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2

work page 2022
[45]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 6

work page arXiv 2025
[46]

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learn- ing interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, and Hao Dong. Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025. 2

work page arXiv 2025
[50]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024. 1, 2, 6

work page 2024
[52]

Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024. 1, 2, 6

work page 2024
[53]

Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 2

work page 2025
[54]

I see xxx. And I'm in xxx

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understand- ing and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 2 AwareVLN: Rea...

work page 2025
[55]

Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness

All models are trained exclusively on RxR-CE training set and then evaluated on RxR-CE Val-Unseen split. Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness. C. Visualization of Automatically Collected Trajectories Figure 8 and Figure 9 present two representative navigation trajectories collected by ...

work page

[1] [1]

Bevbert: Multimodal map pre-training for language-guided navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InICCV, 2022. 6

work page 2022

[2] [2]

1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022)

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place so- lutions for rxr-habitat vision-and-language navigation com- petition (cvpr 2022). InCVPRW, 2022. 6

work page 2022

[3] [3]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 2, 6

work page 2024

[4] [4]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page

[6] [6]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6

work page 2017

[8] [8]

Object goal naviga- tion using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal naviga- tion using goal-oriented semantic exploration. InNeurIPS, pages 4247–4258, 2020. 3

work page 2020

[9] [9]

Affordances-oriented planning using foundation models for continuous vision- language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

work page 2025

[10] [10]

Topological planning with transform- ers for vision-and-language navigation

Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transform- ers for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021. 6

work page 2021

[11] [11]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.arXiv preprint arXiv:2210.07506,

work page arXiv

[12] [12]

a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 2

work page arXiv 2023

[13] [13]

Navila: Legged robot vision-language-action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 5, 6

work page arXiv 2024

[14] [14]

Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2

work page 2023

[15] [15]

Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024. 2

work page arXiv 2024

[16] [16]

Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025. 6

work page arXiv 2025

[17] [17]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

work page arXiv 2025

[18] [18]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 2

work page 2022

[19] [19]

Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 6

work page 2022

[20] [20]

Learning navigational visual representations with semantic map su- pervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernon- court, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map su- pervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023. 6

work page 2023

[21] [21]

General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019

Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction con- ditioned navigation using dynamic time warping.arXiv preprint arXiv:1907.05446, 2019. 6

work page arXiv 1907

[22] [22]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 2

work page 2025

[23] [23]

Sim-2-sim transfer for vision- and-language navigation in continuous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision- and-language navigation in continuous environments. In European conference on computer vision, pages 588–603. Springer, 2022. 2, 6

work page 2022

[24] [24]

Beyond the nav-graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. InEuropean Con- ference on Computer Vision (ECCV), 2020. 2, 6

work page 2020

[25] [25]

Waypoint models for instruction- guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15162–15171, 2021. 6

work page 2021

[26] [26]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing. InEMNLP, 2020. 6

work page 2020

[27] [27]

Onetwovla: A unified vision-language-action model with adaptive reasoning,

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Jun- ming Zhao, and Yang Gao. Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025. 2

work page arXiv 2025

[28] [28]

Nav-r1: Reasoning and navigation in embodied scenes

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

work page arXiv 2025

[29] [29]

Bird’s-eye-view scene graph for vision-language navigation

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InICCV, pages 10968–10980, 2023. 2

work page 2023

[30] [30]

Dis- cuss before moving: Visual language navigation via multi- expert discussions

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Dis- cuss before moving: Visual language navigation via multi- expert discussions. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 17380–17387. IEEE, 2024. 2

work page 2024

[31] [31]

Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation

Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate rep- resentations for robot manipulation. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 8249–8257. IEEE, 2025. 2

work page 2025

[32] [32]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

work page arXiv

[33] [33]

Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6, 2

work page 2021

[34] [34]

Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024. 2

work page arXiv 2024

[35] [35]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 3, 6

work page 2019

[36] [36]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision- language-action models.arXiv preprint arXiv:2502.19417,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 2

work page internal anchor Pith review arXiv 2024

[38] [38]

Dreamwalker: Mental planning for contin- uous vision-language navigation

Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for contin- uous vision-language navigation. InProceedings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023. 1, 2, 6

work page 2023

[39] [39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 6

work page 2023

[41] [41]

Gridmm: Grid memory map for vision- and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625– 15636, 2023. 2, 6

work page 2023

[42] [42]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13753–13762, 2024. 2

work page 2024

[43] [43]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 6

work page 2024

[44] [44]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2

work page 2022

[45] [45]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 6

work page arXiv 2025

[46] [46]

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learn- ing interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, and Hao Dong. Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025. 2

work page arXiv 2025

[50] [50]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks, 2024. 1, 2, 6

work page 2024

[52] [52]

Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Sys- tems, 2024. 1, 2, 6

work page 2024

[53] [53]

Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 2

work page 2025

[54] [54]

I see xxx. And I'm in xxx

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understand- ing and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 2 AwareVLN: Rea...

work page 2025

[55] [55]

Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness

All models are trained exclusively on RxR-CE training set and then evaluated on RxR-CE Val-Unseen split. Our AwareVLN also achieves leading performance under this transfer setting, demonstrating strong robustness. C. Visualization of Automatically Collected Trajectories Figure 8 and Figure 9 present two representative navigation trajectories collected by ...

work page