Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation
Pith reviewed 2026-06-28 10:58 UTC · model grok-4.3
The pith
An agent in instance goal navigation should ask an oracle question only when its expected reduction in navigation uncertainty exceeds the query's cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Interactive instance goal navigation is recast as cost-sensitive uncertainty reduction: the agent selects the question whose answer yields the largest drop in navigation uncertainty relative to its derived penalty. An information-gain analysis performed on prior navigation corpora supplies a compact taxonomy of question types together with empirical weights that quantify each type's typical contribution to uncertainty reduction. These weights are used both to construct a new benchmark that records query cost and to drive a decision rule inside a zero-shot MLLM navigator that queries only when the expected reduction exceeds the penalty.
What carries the argument
The information-gain analysis that converts navigation corpora into a ranked set of question types and their relative cost weights for uncertainty reduction.
If this is right
- Agents reach target instances with fewer total queries while preserving success rate.
- The weighted success metric ranks methods by both accuracy and interaction efficiency.
- A single zero-shot MLLM can implement the cost-sensitive policy without task-specific fine-tuning.
- Benchmarks that ignore query cost will overestimate the value of high-frequency questioning strategies.
Where Pith is reading between the lines
- The same cost-sensitive selection rule could be applied to other embodied tasks that involve open-ended clarification, such as visual dialog or instruction following.
- If the derived weights prove stable across environments, they could serve as a lightweight prior for training future interactive agents rather than learning costs from scratch.
- Extending the analysis to include the cost of waiting for an answer or the risk of receiving noisy oracle responses would make the model more realistic for real-world deployment.
Load-bearing premise
The question types and relative weights obtained from information-gain analysis on existing corpora continue to predict useful uncertainty reduction in new, previously unseen environments.
What would settle it
Run the same navigator on a fresh set of episodes drawn from environments never seen in the original corpora; if the weighted success rate drops sharply or the model begins issuing many low-value queries, the derived weights no longer transfer.
Figures
read the original abstract
Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to recast interactive Instance Goal Navigation (IGN) as a cost-sensitive uncertainty-reduction problem. It performs information-gain analysis on existing navigation corpora to derive a compact set of question types and data-derived weights, constructs a new benchmark for diagnosing interaction behavior together with a Weighted Success Rate metric that penalizes queries by derived cost, and proposes a zero-shot MLLM navigator that selectively queries only when expected uncertainty reduction justifies the interaction cost.
Significance. If the derived weights generalize beyond the source corpora and the selective-query policy is shown to improve efficiency, the work would supply a principled, cost-aware framework for open-ended interaction in embodied navigation that prior methods lack.
major comments (1)
- [Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.
minor comments (1)
- [Abstract] The abstract states the approach and claims a zero-shot MLLM navigator but supplies no summary of experimental results, ablation studies, or quantitative validation that the derived weights actually improve efficiency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and information-gain analysis section] The information-gain analysis on existing navigation corpora is used to produce question types and weights that are then deployed in the new benchmark and zero-shot MLLM policy, yet the manuscript supplies no held-out splits, cross-corpus validation, or sensitivity checks demonstrating that these weights remain predictive on unseen episodes and environments. This is load-bearing for the central claim that the agent should ask only when the answer provides the largest reduction in navigation uncertainty relative to its penalty.
Authors: We agree that the absence of held-out splits, cross-corpus validation, and sensitivity checks is a limitation. The current derivation relies on the full corpora without explicit generalization tests. In the revised manuscript we will add held-out episode splits within each corpus, cross-corpus validation across the source navigation datasets, and sensitivity analysis on the resulting weights to confirm they remain predictive on unseen data. revision: yes
Circularity Check
No significant circularity; derivation uses fixed corpus-derived weights as external input for new benchmark and zero-shot policy
full rationale
The paper derives question types and weights via information-gain analysis on existing navigation corpora, then builds a new benchmark and Weighted Success Rate metric that incorporates those fixed derived costs, while proposing a zero-shot MLLM policy. This does not reduce any central claim to a self-fit or self-citation by construction; the weights serve as an independent, precomputed input rather than being refitted to the evaluation episodes or making success tautological. No load-bearing step matches the enumerated circularity patterns with a specific equation or definition that collapses to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Topo-metric map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022
-
[2]
Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments.arXiv preprint arXiv:2304.03047, 2023
-
[3]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018
2018
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
The robotslang benchmark: Dialog-guided robot localization and navigation
Shurjo Banerjee, Jesse Thomason, and Jason Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021
2021
-
[7]
ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects
Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. InarXiv:2006.13171, 2020
-
[8]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017
2017
-
[9]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[10]
Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation.arXiv preprint arXiv:2401.07314, 2024
-
[11]
History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in Neural Information Processing Systems, 34:5834–5847, 2021
2021
-
[12]
Think global, act local: Dual-scale graph transformer for vision-and-language navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022
2022
-
[13]
Learning from unlabeled 3d environments for vision-and-language navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Learning from unlabeled 3d environments for vision-and-language navigation. InEuropean Conference on Computer Vision, pages 638–655. Springer, 2022
2022
-
[14]
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 10
-
[15]
Ta-Chung Chi, Mihail Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation.arXiv preprint arXiv:1912.00915, 2019
-
[16]
BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022
Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue-enabled agents for embodied instruction following.IEEE Robotics and Automation Letters, 7(4): 10049–10056, 2022
2022
-
[18]
A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025
Google. A new era of intelligence with Gemini 3.https://blog.google/products-and-platforms/ products/gemini/gemini-3/, 2025. Accessed: 2026-05-02
2025
-
[19]
Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter
Google DeepMind. Gemma 4: Our most intelligent open models, built from Gemini 3 research and technol- ogy to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/,
-
[20]
Accessed: 2026-05-04
2026
-
[21]
Dialnav: Multi- turn dialog navigation with a remote guide
Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. Dialnav: Multi- turn dialog navigation with a remote guide. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8514–8523, 2025
2025
-
[22]
Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation
Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439–15449, 2022
2022
-
[23]
Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs
Wensi Huang, Shaohao Zhu, Meng Wei, Jinming Xu, Xihui Liu, Hanqing Wang, Tai Wang, Feng Zhao, and Jiangmiao Pang. Vl-ln bench: Towards long-horizon goal-oriented navigation with active dialogs. arXiv preprint arXiv:2512.22342, 2025
-
[24]
Beyond the nav-graph: Vision-and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020
2020
-
[25]
Waypoint models for instruction-guided navigation in continuous environments
Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021
2021
-
[26]
Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020
2020
-
[27]
Ground- level viewpoint vision-and-language navigation in continuous environments
Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, and Qi Wu. Ground- level viewpoint vision-and-language navigation in continuous environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5266–5273. IEEE, 2025
2025
-
[28]
Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, and Qi Wu. One agent to guide them all: Empowering mllms for vision-and-language navigation via explicit world representation.arXiv preprint arXiv:2602.15400, 2026
-
[29]
Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025
-
[30]
Bayesian statistics: A review.SIAM, 1972
Dennis V Lindley. Bayesian statistics: A review.SIAM, 1972
1972
-
[31]
Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions.arXiv preprint arXiv:2309.11382, 2023
-
[32]
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024
-
[33]
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation.arXiv preprint arXiv:1901.03035, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[34]
The regretful agent: Heuristic-aided navigation through progress estimation
Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 11
2019
-
[35]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab - A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning.arXiv preprint arXiv:2511.04831, 2025. doi: 10.48550/arXiv.2511.04831. URLhttps://arxiv.org/abs/2511.04831
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04831 2025
-
[37]
Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning
Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
2019
-
[38]
Vision-based navigation with language- based assistance via imitation learning with indirect intervention
Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language- based assistance via imitation learning with indirect intervention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12537, 2019
2019
-
[39]
Introducing GPT-5.4
OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02
2026
-
[40]
Teach: Task-driven embodied agents that chat
Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022
2017
-
[41]
Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation.arXiv preprint arXiv:2310.07889, 2023
-
[42]
Universal Scene Description (USD) project
Pixar Animation Studios. Universal Scene Description (USD) project. https://openusd.org/dev/ intro.html, 2021. Accessed: 2026-05-04
2021
-
[43]
Reverie: Remote embodied visual referring expression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020
2020
-
[44]
March in chat: Interactive prompting for remote embodied referring expression
Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, and Qi Wu. March in chat: Interactive prompting for remote embodied referring expression. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023
2023
-
[45]
Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open- nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms.arXiv preprint arXiv:2409.18794, 2024
-
[46]
Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025
-
[47]
Qwen3.5: Towards Native Multimodal Agents
Qwen. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02
2026
-
[48]
Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat- matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...
2021
-
[49]
Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020
Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. Rmm: A recursive mental model for dialogue navigation.Findings of the association for computational linguistics: EMNLP, 2020
2020
-
[50]
Habitat: A Platform for Embodied AI Research.ICCV, 2019
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research.ICCV, 2019
2019
-
[51]
Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation
Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, and Qi Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16923–16930. IEEE, 2025. 12
2025
-
[52]
View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026
Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, and Mark Crowley. View invariant learning for vision-language navigation in continuous environments.IEEE Robotics and Automation Letters, 2026
2026
-
[53]
Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues
Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang. Collaborative instance object navigation: Leveraging uncertainty-awareness to minimize human-agent dialogues. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18781–18792, 2025
2025
-
[54]
Vision-and-dialog navigation
Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. InConference on Robot Learning, pages 394–406, 2020
2020
-
[55]
Vision-and- language navigation via causal learning
Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, and Qijun Chen. Vision-and- language navigation via causal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13139–13150, 2024
2024
-
[56]
Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities
Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025
2025
-
[57]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Gridmm: Grid memory map for vision-and-language navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023
2023
-
[59]
Scaling data generation in vision-and-language navigation
Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023
2023
-
[60]
Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025
-
[61]
Zecheng Yin, Hao Zhao, and Zhen Li. Hypernav: Hybrid perception for object-oriented navigation in unknown environment.arXiv preprint arXiv:2510.22917, 2025
-
[62]
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei, and Qi Wu. Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation.arXiv preprint arXiv:2601.06806, 2026
-
[65]
Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, and Qi Wu. Spatialant: Autonomous zero-shot robot navigation via active scene reconstruction and visual anticipation.arXiv preprint arXiv:2603.26837, 2026
-
[66]
MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation
Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Xunyi Zhao, Gengze Zhou, and Qi Wu. Vln-mme: Diagnosing mllms as language-guided visual navigation agents.arXiv preprint arXiv:2512.24851, 2025
-
[68]
Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023
Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation.arXiv preprint arXiv:2312.02010, 2023
-
[69]
Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026
Zhide Zhong, Jia Lu, Xiangchen Liu, Runze Yu, Xinhu Zheng, Zhe Liu, Hesheng Wang, and Haoang Li. Spatial-aware and viewpoint-robust vision-language navigation.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 13
2026
-
[70]
Navgpt: Explicit reasoning in vision-and-language navigation with large language models
Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024
2024
-
[71]
Navgpt-2: Unleashing navigational reasoning capability for large vision-language models
Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2025
2025
-
[72]
Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts
Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, and Qi Wu. Same: Learning generic language-guided visual navigation with state-adaptive mixture of experts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7794–7807, 2025
2025
-
[73]
helpfulness
Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021. Appendix A Uncertainty Mining and Question Penalties A.1 Annotation Sources and Protocol The uncer...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.