pith. sign in

arxiv: 2606.03175 · v2 · pith:34EZA57Pnew · submitted 2026-06-02 · 💻 cs.CV · cs.RO

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Pith reviewed 2026-06-28 10:58 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords instance goal navigationinteractive navigationcost-sensitive interactionuncertainty reductionembodied agentsoracle queryingmultimodal language models
0
0 comments X

The pith

An agent in instance goal navigation should ask an oracle question only when its expected reduction in navigation uncertainty exceeds the query's cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats interactive instance goal navigation as a problem of selecting oracle questions that maximize uncertainty reduction per unit cost rather than maximizing information alone. It first runs an information-gain study on existing navigation datasets to rank question types by how much each lowers the agent's uncertainty about the target location, then converts those rankings into fixed relative costs. With those costs in hand the authors build a diagnostic benchmark and a weighted success metric that subtracts a penalty for every query issued. They finally show a zero-shot multimodal large language model that, at each step, computes the expected gain of possible questions and issues one only when the ratio justifies the cost.

Core claim

Interactive instance goal navigation is recast as cost-sensitive uncertainty reduction: the agent selects the question whose answer yields the largest drop in navigation uncertainty relative to its derived penalty. An information-gain analysis performed on prior navigation corpora supplies a compact taxonomy of question types together with empirical weights that quantify each type's typical contribution to uncertainty reduction. These weights are used both to construct a new benchmark that records query cost and to drive a decision rule inside a zero-shot MLLM navigator that queries only when the expected reduction exceeds the penalty.

What carries the argument

The information-gain analysis that converts navigation corpora into a ranked set of question types and their relative cost weights for uncertainty reduction.

If this is right

  • Agents reach target instances with fewer total queries while preserving success rate.
  • The weighted success metric ranks methods by both accuracy and interaction efficiency.
  • A single zero-shot MLLM can implement the cost-sensitive policy without task-specific fine-tuning.
  • Benchmarks that ignore query cost will overestimate the value of high-frequency questioning strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cost-sensitive selection rule could be applied to other embodied tasks that involve open-ended clarification, such as visual dialog or instruction following.
  • If the derived weights prove stable across environments, they could serve as a lightweight prior for training future interactive agents rather than learning costs from scratch.
  • Extending the analysis to include the cost of waiting for an answer or the risk of receiving noisy oracle responses would make the model more realistic for real-world deployment.

Load-bearing premise

The question types and relative weights obtained from information-gain analysis on existing corpora continue to predict useful uncertainty reduction in new, previously unseen environments.

What would settle it

Run the same navigator on a fresh set of episodes drawn from environments never seen in the original corpora; if the weighted success rate drops sharply or the model begins issuing many low-value queries, the derived weights no longer transfer.

Figures

Figures reproduced from arXiv: 2606.03175 by Gengze Zhou, Jiajun Liu, Qi Wu, Shijie Li, Sihao Lin, Wei Tao, Xunyi Zhao, Zerui Li.

Figure 1
Figure 1. Figure 1: Benchmark statistics. Overview of the dataset composition, including episode distribution by difficulty, distractor room and instance counts, target object categories, goal-distance distribution, and goal-room distribution. 3.3 TANDEM: Two-stage Navigation with Disentangled Planning and Metric Grounding TANDEM instantiates the benchmark protocol as a stateful zero-shot MLLM navigator. Each step has exactly… view at source ↗
Figure 2
Figure 2. Figure 2: TANDEM decomposes interactive instance image-goal navigation into two coupled stages. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal and spatial patterns of interaction for the full TANDEM agent. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of spatial interaction. A spatial QA cue helps the agent resolve ambiguity and reach the target more directly, instead of following uncertain exploratory paths [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to recast interactive Instance Goal Navigation (IGN) as a cost-sensitive uncertainty-reduction problem. It performs information-gain analysis on existing navigation corpora to derive a compact set of question types and data-derived weights, constructs a new benchmark for diagnosing interaction behavior together with a Weighted Success Rate metric that penalizes queries by derived cost, and proposes a zero-shot MLLM navigator that selectively queries only when expected uncertainty reduction justifies the interaction cost.

Significance. If the derived weights generalize beyond the source corpora and the selective-query policy is shown to improve efficiency, the work would supply a principled, cost-aware framework for open-ended interaction in embodied navigation that prior methods lack.

major comments (1)
  1. [Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.
minor comments (1)
  1. [Abstract] The abstract states the approach and claims a zero-shot MLLM navigator but supplies no summary of experimental results, ablation studies, or quantitative validation that the derived weights actually improve efficiency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.

    Authors: We agree that the absence of held-out splits, cross-corpus validation, and sensitivity checks is a limitation. The current derivation relies on the full corpora without explicit generalization tests. In the revised manuscript we will add held-out episode splits within each corpus, cross-corpus validation across the source navigation datasets, and sensitivity analysis on the resulting weights to confirm they remain predictive on unseen data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses fixed corpus-derived weights as external input for new benchmark and zero-shot policy

full rationale

The paper derives question types and weights via information-gain analysis on existing navigation corpora, then builds a new benchmark and Weighted Success Rate metric that incorporates those fixed derived costs, while proposing a zero-shot MLLM policy. This does not reduce any central claim to a self-fit or self-citation by construction; the weights serve as an independent, precomputed input rather than being refitted to the evaluation episodes or making success tautological. No load-bearing step matches the enumerated circularity patterns with a specific equation or definition that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all such elements would require the full manuscript.

pith-pipeline@v0.9.1-grok · 5775 in / 1189 out tokens · 40632 ms · 2026-06-28T10:58:46.819340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 30 canonical work pages · 10 internal anchors

  1. [1]

    Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

    Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

  2. [2]

    Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023

  3. [3]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

  4. [4]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    The robotslang benchmark: Dialog-guided robot localization and navigation

    Shurjo Banerjee, Jesse Thomason, and Jason Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

  7. [7]

    ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. InarXiv:2006.13171, 2020

  8. [8]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017

  9. [9]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  10. [10]

    Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

    Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024

  11. [11]

    History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

    Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021

  12. [12]

    Think global, act local: Dual-scale graph transformer for vision-and-language navigation

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022

  13. [13]

    Learning from unlabeled 3d environments for vision-and-language navigation

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Learning from unlabeled 3d environments for vision-and-language navigation. InEuropean Conference on Computer Vision, pages 638–655. Springer, 2022

  14. [14]

    Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 10

  15. [15]

    Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

    Ta-Chung Chi, Mihail Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019

  16. [16]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

  17. [17]

    Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

    Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022

  18. [18]

    A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025

    Google. A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025. Accessed: 2026-05-02

  19. [19]

    Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter

    Google DeepMind. Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/,

  20. [20]

    Accessed: 2026-05-04

  21. [21]

    Dialnav: Multi- turn dialog navigation with a remote guide

    Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. Dialnav: Multi- turn dialog navigation with a remote guide. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8514–8523, 2025

  22. [22]

    Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022

  23. [23]

    Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs

    Wensi Huang, Shaohao Zhu, Meng Wei, Jinming Xu, Xihui Liu, Hanqing Wang, Tai Wang, Feng Zhao, and Jiangmiao Pang. Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs. arXiv preprint arXiv:2512.22342, 2025

  24. [24]

    Beyond the nav-graph: Vision-and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

  25. [25]

    Waypoint models for instruction-guided navigation in continuous environments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

  26. [26]

    Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020

  27. [27]

    Ground- level viewpoint vision-and-language navigation in continuous environments

    Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground- level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273. IEEE, 2025

  28. [28]

    One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

    Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, and Qi Wu. One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026

  29. [29]

    VLNVerse: A benchmark for vision-language navigation with versatile, embodied, real- istic simulation and evaluation.arXiv:2512.19021,

    Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

  30. [30]

    Bayesian statistics: A review.SIAM, 1972

    Dennis V Lindley. Bayesian statistics: A review.SIAM, 1972

  31. [31]

    Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

    Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023

  32. [32]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

  33. [33]

    Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

    Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation.arXiv preprint arXiv:1901.03035, 2019

  34. [34]

    The regretful agent: Heuristic-aided navigation through progress estimation

    Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 11

  35. [35]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  36. [36]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab - A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning.arXiv preprint arXiv:2511.04831, 2025. doi: 10.48550/arXiv.2511.04831. URLhttps://arxiv.org/abs/2511.04831

  37. [37]

    Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning

    Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

  38. [38]

    Vision-based navigation with language- based assistance via imitation learning with indirect intervention

    Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language- based assistance via imitation learning with indirect intervention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12537, 2019

  39. [39]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

  40. [40]

    Teach: Task-driven embodied agents that chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022

  41. [41]

    Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

    Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023

  42. [42]

    Universal Scene Description (USD) project

    Pixar Animation Studios. Universal Scene Description (USD) project. https://openusd.org/dev/ intro.html, 2021. Accessed: 2026-05-04

  43. [43]

    Reverie: Remote embodied visual referring expression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

  44. [44]

    March in chat: Interactive prompting for remote embodied referring expression

    Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for remote embodied referring expression. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023

  45. [45]

    Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

    Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024

  46. [46]

    Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

    Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

  47. [47]

    Qwen3.5: Towards Native Multimodal Agents

    Qwen. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

  48. [48]

    Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

  49. [49]

    Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

    Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020

  50. [50]

    Habitat: A Platform for Embodied AI Research.ICCV, 2019

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research.ICCV, 2019

  51. [51]

    Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation

    Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, and Qi Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16923–16930. IEEE, 2025. 12

  52. [52]

    View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

    Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, and Mark Crowley. View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026

  53. [53]

    Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues

    Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang. Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18781–18792, 2025

  54. [54]

    Vision-and-dialog navigation

    Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. InConference on Robot Learning, pages 394–406, 2020

  55. [55]

    Vision-and- language navigation via causal learning

    Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, and Qijun Chen. Vision-and- language navigation via causal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13139–13150, 2024

  56. [56]

    Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities

    Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

  57. [57]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  58. [58]

    Gridmm: Grid memory map for vision-and-language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023

  59. [59]

    Scaling data generation in vision-and-language navigation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023

  60. [60]

    Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

    Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

  61. [61]

    Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

    Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025

  62. [62]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

  63. [63]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

  64. [64]

    Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

    Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei, and Qi Wu. Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026

  65. [65]

    Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

    Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, and Qi Wu. Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026

  66. [66]

    MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

    Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025

  67. [67]

    Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

    Xunyi Zhao, Gengze Zhou, and Qi Wu. Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025

  68. [68]

    Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023

  69. [69]

    Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026

    Zhide Zhong, Jia Lu, Xiangchen Liu, Runze Yu, Xinhu Zheng, Zhe Liu, Hesheng Wang, and Haoang Li. Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 13

  70. [70]

    Navgpt: Explicit reasoning in vision-and-language navigation with large language models

    Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

  71. [71]

    Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

    Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2025

  72. [72]

    Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts

    Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, and Qi Wu. Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7794–7807, 2025

  73. [73]

    helpfulness

    Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021. Appendix A Uncertainty Mining and Question Penalties A.1 Annotation Sources and Protocol The uncer...