Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Gengze Zhou; Jiajun Liu; Qi Wu; Shijie Li; Sihao Lin; Wei Tao; Xunyi Zhao; Zerui Li

arxiv: 2606.03175 · v2 · pith:34EZA57Pnew · submitted 2026-06-02 · 💻 cs.CV · cs.RO

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Xunyi Zhao , Sihao Lin , Gengze Zhou , Zerui Li , Shijie Li , Wei Tao , Jiajun Liu , Qi Wu This is my paper

Pith reviewed 2026-06-28 10:58 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords instance goal navigationinteractive navigationcost-sensitive interactionuncertainty reductionembodied agentsoracle queryingmultimodal language models

0 comments

The pith

An agent in instance goal navigation should ask an oracle question only when its expected reduction in navigation uncertainty exceeds the query's cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats interactive instance goal navigation as a problem of selecting oracle questions that maximize uncertainty reduction per unit cost rather than maximizing information alone. It first runs an information-gain study on existing navigation datasets to rank question types by how much each lowers the agent's uncertainty about the target location, then converts those rankings into fixed relative costs. With those costs in hand the authors build a diagnostic benchmark and a weighted success metric that subtracts a penalty for every query issued. They finally show a zero-shot multimodal large language model that, at each step, computes the expected gain of possible questions and issues one only when the ratio justifies the cost.

Core claim

Interactive instance goal navigation is recast as cost-sensitive uncertainty reduction: the agent selects the question whose answer yields the largest drop in navigation uncertainty relative to its derived penalty. An information-gain analysis performed on prior navigation corpora supplies a compact taxonomy of question types together with empirical weights that quantify each type's typical contribution to uncertainty reduction. These weights are used both to construct a new benchmark that records query cost and to drive a decision rule inside a zero-shot MLLM navigator that queries only when the expected reduction exceeds the penalty.

What carries the argument

The information-gain analysis that converts navigation corpora into a ranked set of question types and their relative cost weights for uncertainty reduction.

If this is right

Agents reach target instances with fewer total queries while preserving success rate.
The weighted success metric ranks methods by both accuracy and interaction efficiency.
A single zero-shot MLLM can implement the cost-sensitive policy without task-specific fine-tuning.
Benchmarks that ignore query cost will overestimate the value of high-frequency questioning strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cost-sensitive selection rule could be applied to other embodied tasks that involve open-ended clarification, such as visual dialog or instruction following.
If the derived weights prove stable across environments, they could serve as a lightweight prior for training future interactive agents rather than learning costs from scratch.
Extending the analysis to include the cost of waiting for an answer or the risk of receiving noisy oracle responses would make the model more realistic for real-world deployment.

Load-bearing premise

The question types and relative weights obtained from information-gain analysis on existing corpora continue to predict useful uncertainty reduction in new, previously unseen environments.

What would settle it

Run the same navigator on a fresh set of episodes drawn from environments never seen in the original corpora; if the weighted success rate drops sharply or the model begins issuing many low-value queries, the derived weights no longer transfer.

Figures

Figures reproduced from arXiv: 2606.03175 by Gengze Zhou, Jiajun Liu, Qi Wu, Shijie Li, Sihao Lin, Wei Tao, Xunyi Zhao, Zerui Li.

**Figure 1.** Figure 1: Benchmark statistics. Overview of the dataset composition, including episode distribution by difficulty, distractor room and instance counts, target object categories, goal-distance distribution, and goal-room distribution. 3.3 TANDEM: Two-stage Navigation with Disentangled Planning and Metric Grounding TANDEM instantiates the benchmark protocol as a stateful zero-shot MLLM navigator. Each step has exactly… view at source ↗

**Figure 2.** Figure 2: TANDEM decomposes interactive instance image-goal navigation into two coupled stages. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal and spatial patterns of interaction for the full TANDEM agent. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of spatial interaction. A spatial QA cue helps the agent resolve ambiguity and reach the target more directly, instead of following uncertain exploratory paths [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames interactive instance goal navigation as a cost-sensitive uncertainty reduction task using corpus-derived question weights, but those weights lack shown transfer to new environments.

read the letter

The main takeaway is that this work pushes interactive navigation toward efficiency by making the agent weigh query cost against uncertainty reduction. They run information-gain analysis on navigation corpora to pick a small set of question types and assign data-derived penalties, then build a benchmark and Weighted Success Rate metric around that taxonomy. A zero-shot MLLM policy is proposed that only queries when the expected gain justifies the cost.

What stands out is the clear diagnosis of prior interactive methods: they treat all queries the same and let agents rack up success through volume rather than targeted disambiguation. Introducing cost into the decision rule and the evaluation metric is a reasonable step toward practical robot settings where asking is not free.

The soft spot is the transfer assumption on the weights themselves. The stress-test note flags that the question types and costs come from analysis on existing corpora, yet the abstract gives no held-out splits, cross-corpus checks, or sensitivity tests. If those weights do not predict well on new episodes or environments, both the benchmark and the policy evaluation rest on the same unverified foundation. The abstract also supplies no experimental numbers, ablations, or comparisons, so the actual performance of the MLLM policy cannot be judged from the given text.

This is aimed at the embodied navigation community that already works on interactive agents and wants metrics that reflect real deployment costs. A reader who cares about active learning or information-gain methods in robotics might find the taxonomy and metric useful to build on.

I would send it to peer review. The idea is coherent and the problem it targets is real, but the referee will need to see the full experiments and any validation of the weight transfer before the claims can be assessed.

Referee Report

1 major / 1 minor

Summary. The paper claims to recast interactive Instance Goal Navigation (IGN) as a cost-sensitive uncertainty-reduction problem. It performs information-gain analysis on existing navigation corpora to derive a compact set of question types and data-derived weights, constructs a new benchmark for diagnosing interaction behavior together with a Weighted Success Rate metric that penalizes queries by derived cost, and proposes a zero-shot MLLM navigator that selectively queries only when expected uncertainty reduction justifies the interaction cost.

Significance. If the derived weights generalize beyond the source corpora and the selective-query policy is shown to improve efficiency, the work would supply a principled, cost-aware framework for open-ended interaction in embodied navigation that prior methods lack.

major comments (1)

[Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.

minor comments (1)

[Abstract] The abstract states the approach and claims a zero-shot MLLM navigator but supplies no summary of experimental results, ablation studies, or quantitative validation that the derived weights actually improve efficiency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.

Authors: We agree that the absence of held-out splits, cross-corpus validation, and sensitivity checks is a limitation. The current derivation relies on the full corpora without explicit generalization tests. In the revised manuscript we will add held-out episode splits within each corpus, cross-corpus validation across the source navigation datasets, and sensitivity analysis on the resulting weights to confirm they remain predictive on unseen data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses fixed corpus-derived weights as external input for new benchmark and zero-shot policy

full rationale

The paper derives question types and weights via information-gain analysis on existing navigation corpora, then builds a new benchmark and Weighted Success Rate metric that incorporates those fixed derived costs, while proposing a zero-shot MLLM policy. This does not reduce any central claim to a self-fit or self-citation by construction; the weights serve as an independent, precomputed input rather than being refitted to the evaluation episodes or making success tautological. No load-bearing step matches the enumerated circularity patterns with a specific equation or definition that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all such elements would require the full manuscript.

pith-pipeline@v0.9.1-grok · 5775 in / 1189 out tokens · 40632 ms · 2026-06-28T10:58:46.819340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 30 canonical work pages · 10 internal anchors

[1]

Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

work page arXiv 2022
[2]

Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

work page arXiv 2023
[3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The robotslang benchmark: Dialog-guided robot localization and navigation

Shurjo Banerjee, Jesse Thomason, and Jason Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

2021
[7]

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. InarXiv:2006.13171, 2020

work page arXiv 2006
[8]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017

2017
[9]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[10]

Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

work page arXiv 2024
[11]

History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

2021
[12]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022

2022
[13]

Learning from unlabeled 3d environments for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Learning from unlabeled 3d environments for vision-and-language navigation. InEuropean Conference on Computer Vision, pages 638–655. Springer, 2022

2022
[14]

Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 10

work page arXiv 2024
[15]

Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

Ta-Chung Chi, Mihail Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

work page arXiv 1912
[16]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

2022
[18]

A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025

Google. A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025. Accessed: 2026-05-02

2025
[19]

Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter

Google DeepMind. Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/,
[20]

Accessed: 2026-05-04

2026
[21]

Dialnav: Multi- turn dialog navigation with a remote guide

Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. Dialnav: Multi- turn dialog navigation with a remote guide. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8514–8523, 2025

2025
[22]

Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022

2022
[23]

Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs

Wensi Huang, Shaohao Zhu, Meng Wei, Jinming Xu, Xihui Liu, Hanqing Wang, Tai Wang, Feng Zhao, and Jiangmiao Pang. Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs. arXiv preprint arXiv:2512.22342, 2025

work page arXiv 2025
[24]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020
[25]

Waypoint models for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

2021
[26]

Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

2020
[27]

Ground- level viewpoint vision-and-language navigation in continuous environments

Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground- level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273. IEEE, 2025

2025
[28]

One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, and Qi Wu. One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

work page arXiv 2026
[29]

VLNVerse: A benchmark for vision-language navigation with versatile, embodied, real- istic simulation and evaluation.arXiv:2512.19021,

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

work page arXiv 2025
[30]

Bayesian statistics: A review.SIAM, 1972

Dennis V Lindley. Bayesian statistics: A review.SIAM, 1972

1972
[31]

Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

work page arXiv 2023
[32]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

work page arXiv 2024
[33]

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation.arXiv preprint arXiv:1901.03035, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[34]

The regretful agent: Heuristic-aided navigation through progress estimation

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 11

2019
[35]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab - A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning.arXiv preprint arXiv:2511.04831, 2025. doi: 10.48550/arXiv.2511.04831. URLhttps://arxiv.org/abs/2511.04831

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04831 2025
[37]

Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning

Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

2019
[38]

Vision-based navigation with language- based assistance via imitation learning with indirect intervention

Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language- based assistance via imitation learning with indirect intervention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12537, 2019

2019
[39]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

2026
[40]

Teach: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022

2017
[41]

Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

work page arXiv 2023
[42]

Universal Scene Description (USD) project

Pixar Animation Studios. Universal Scene Description (USD) project. https://openusd.org/dev/ intro.html, 2021. Accessed: 2026-05-04

2021
[43]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

2020
[44]

March in chat: Interactive prompting for remote embodied referring expression

Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for remote embodied referring expression. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023

2023
[45]

Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

work page arXiv 2024
[46]

Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

work page arXiv 2025
[47]

Qwen3.5: Towards Native Multimodal Agents

Qwen. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

2026
[48]

Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

2021
[49]

Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

2020
[50]

Habitat: A Platform for Embodied AI Research.ICCV, 2019

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research.ICCV, 2019

2019
[51]

Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation

Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, and Qi Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16923–16930. IEEE, 2025. 12

2025
[52]

View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, and Mark Crowley. View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

2026
[53]

Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues

Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang. Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18781–18792, 2025

2025
[54]

Vision-and-dialog navigation

Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. InConference on Robot Learning, pages 394–406, 2020

2020
[55]

Vision-and- language navigation via causal learning

Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, and Qijun Chen. Vision-and- language navigation via causal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13139–13150, 2024

2024
[56]

Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

2025
[57]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023

2023
[59]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023

2023
[60]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025
[61]

Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

work page arXiv 2025
[62]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei, and Qi Wu. Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

work page arXiv 2026
[65]

Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, and Qi Wu. Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

work page arXiv 2026
[66]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

Xunyi Zhao, Gengze Zhou, and Qi Wu. Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

work page arXiv 2025
[68]

Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

work page arXiv 2023
[69]

Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026

Zhide Zhong, Jia Lu, Xiangchen Liu, Runze Yu, Xinhu Zheng, Zhe Liu, Hesheng Wang, and Haoang Li. Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 13

2026
[70]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

2024
[71]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2025

2025
[72]

Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts

Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, and Qi Wu. Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7794–7807, 2025

2025
[73]

helpfulness

Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021. Appendix A Uncertainty Mining and Question Penalties A.1 Annotation Sources and Protocol The uncer...

2021

[1] [1]

Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

work page arXiv 2022

[2] [2]

Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

work page arXiv 2023

[3] [3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The robotslang benchmark: Dialog-guided robot localization and navigation

Shurjo Banerjee, Jesse Thomason, and Jason Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

2021

[7] [7]

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. InarXiv:2006.13171, 2020

work page arXiv 2006

[8] [8]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017

2017

[9] [9]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[10] [10]

Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

work page arXiv 2024

[11] [11]

History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

2021

[12] [12]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022

2022

[13] [13]

Learning from unlabeled 3d environments for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Learning from unlabeled 3d environments for vision-and-language navigation. InEuropean Conference on Computer Vision, pages 638–655. Springer, 2022

2022

[14] [14]

Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 10

work page arXiv 2024

[15] [15]

Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

Ta-Chung Chi, Mihail Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

work page arXiv 1912

[16] [16]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

2022

[18] [18]

A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025

Google. A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025. Accessed: 2026-05-02

2025

[19] [19]

Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter

Google DeepMind. Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/,

[20] [20]

Accessed: 2026-05-04

2026

[21] [21]

Dialnav: Multi- turn dialog navigation with a remote guide

Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. Dialnav: Multi- turn dialog navigation with a remote guide. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8514–8523, 2025

2025

[22] [22]

Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022

2022

[23] [23]

Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs

Wensi Huang, Shaohao Zhu, Meng Wei, Jinming Xu, Xihui Liu, Hanqing Wang, Tai Wang, Feng Zhao, and Jiangmiao Pang. Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs. arXiv preprint arXiv:2512.22342, 2025

work page arXiv 2025

[24] [24]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020

[25] [25]

Waypoint models for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

2021

[26] [26]

Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

2020

[27] [27]

Ground- level viewpoint vision-and-language navigation in continuous environments

Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground- level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273. IEEE, 2025

2025

[28] [28]

One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, and Qi Wu. One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

work page arXiv 2026

[29] [29]

VLNVerse: A benchmark for vision-language navigation with versatile, embodied, real- istic simulation and evaluation.arXiv:2512.19021,

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

work page arXiv 2025

[30] [30]

Bayesian statistics: A review.SIAM, 1972

Dennis V Lindley. Bayesian statistics: A review.SIAM, 1972

1972

[31] [31]

Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

work page arXiv 2023

[32] [32]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

work page arXiv 2024

[33] [33]

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation.arXiv preprint arXiv:1901.03035, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[34] [34]

The regretful agent: Heuristic-aided navigation through progress estimation

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 11

2019

[35] [35]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab - A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning.arXiv preprint arXiv:2511.04831, 2025. doi: 10.48550/arXiv.2511.04831. URLhttps://arxiv.org/abs/2511.04831

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04831 2025

[37] [37]

Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning

Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

2019

[38] [38]

Vision-based navigation with language- based assistance via imitation learning with indirect intervention

Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language- based assistance via imitation learning with indirect intervention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12537, 2019

2019

[39] [39]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

2026

[40] [40]

Teach: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022

2017

[41] [41]

Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

work page arXiv 2023

[42] [42]

Universal Scene Description (USD) project

Pixar Animation Studios. Universal Scene Description (USD) project. https://openusd.org/dev/ intro.html, 2021. Accessed: 2026-05-04

2021

[43] [43]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

2020

[44] [44]

March in chat: Interactive prompting for remote embodied referring expression

Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for remote embodied referring expression. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023

2023

[45] [45]

Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

work page arXiv 2024

[46] [46]

Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

work page arXiv 2025

[47] [47]

Qwen3.5: Towards Native Multimodal Agents

Qwen. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

2026

[48] [48]

Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

2021

[49] [49]

Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

2020

[50] [50]

Habitat: A Platform for Embodied AI Research.ICCV, 2019

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research.ICCV, 2019

2019

[51] [51]

Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation

Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, and Qi Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16923–16930. IEEE, 2025. 12

2025

[52] [52]

View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, and Mark Crowley. View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

2026

[53] [53]

Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues

Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang. Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18781–18792, 2025

2025

[54] [54]

Vision-and-dialog navigation

Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. InConference on Robot Learning, pages 394–406, 2020

2020

[55] [55]

Vision-and- language navigation via causal learning

Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, and Qijun Chen. Vision-and- language navigation via causal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13139–13150, 2024

2024

[56] [56]

Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

2025

[57] [57]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023

2023

[59] [59]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023

2023

[60] [60]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025

[61] [61]

Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

work page arXiv 2025

[62] [62]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei, and Qi Wu. Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

work page arXiv 2026

[65] [65]

Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, and Qi Wu. Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

work page arXiv 2026

[66] [66]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

Xunyi Zhao, Gengze Zhou, and Qi Wu. Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

work page arXiv 2025

[68] [68]

Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

work page arXiv 2023

[69] [69]

Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026

Zhide Zhong, Jia Lu, Xiangchen Liu, Runze Yu, Xinhu Zheng, Zhe Liu, Hesheng Wang, and Haoang Li. Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 13

2026

[70] [70]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

2024

[71] [71]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2025

2025

[72] [72]

Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts

Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, and Qi Wu. Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7794–7807, 2025

2025

[73] [73]

helpfulness

Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021. Appendix A Uncertainty Mining and Question Penalties A.1 Annotation Sources and Protocol The uncer...

2021