Rethinking Embodied Navigation via Relational Inductive Bias

Cheng Deng; Chenghao Xu; Weitao An; Xu Yang

arxiv: 2606.10348 · v1 · pith:7KZEHNQSnew · submitted 2026-06-09 · 💻 cs.RO

Rethinking Embodied Navigation via Relational Inductive Bias

Weitao An , Chenghao Xu , Xu Yang , Cheng Deng This is my paper

Pith reviewed 2026-06-27 13:09 UTC · model grok-4.3

classification 💻 cs.RO

keywords object navigationrelational inductive biasactivation biasinhibition biasfrontier explorationembodied navigationDB-Nav

0 comments

The pith

DB-Nav reshapes object navigation search space using dual relational biases to suppress unreliable cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Object navigation agents frequently follow misleading semantic cues such as false detections and outdated priors that contaminate maps and decisions. The paper introduces DB-Nav to factorize target-centric relations into an Activation Bias that spreads useful context and an Inhibition Bias that blocks unreliable zones through perceptual confusion and action-level falsification. These biases combine in a single graph that adjusts frontier priorities from live observations and recorded failures. The result is higher success rates and better path efficiency on standard benchmarks without repeated calls to large vision-language models.

Core claim

The paper claims that unifying an Activation Bias, which propagates contextual evidence, and an Inhibition Bias, which suppresses unreliable regions, inside a Relational Activation-Inhibition Exploration Graph allows frontier values to be modulated directly from online observations and failed accesses, thereby avoiding systematic contamination of mapping and decision making in object navigation.

What carries the argument

The Relational Activation-Inhibition Exploration Graph, which unifies Activation Bias for spreading evidence and Inhibition Bias for suppressing unreliable regions to adjust frontier exploration values.

If this is right

Higher success rate and Success weighted by Path Length on ObjectNav benchmarks compared with prior methods.
Navigation remains lightweight and does not require costly online vision-language model reasoning.
Explicit relational biases provide interpretability for why certain frontiers are chosen or avoided.
Robustness increases against false positives, outdated static priors, and repeated failed explorations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bias mechanism could be tested on point-goal navigation or other embodied tasks that face noisy perceptual input.
Tracking failure history explicitly might improve reliability in reinforcement-learning agents operating under partial observability.
Lower dependence on heavy models could allow navigation policies to run on robots with limited onboard compute.

Load-bearing premise

Relational biases can be computed and unified from online observations and failed accesses without creating new systematic errors or needing extra perception modules.

What would settle it

A controlled test in which DB-Nav shows no gain or a drop in success rate when the environment contains many objects that produce high rates of false positive detections.

Figures

Figures reproduced from arXiv: 2606.10348 by Cheng Deng, Chenghao Xu, Weitao An, Xu Yang.

**Figure 1.** Figure 1: Motivation of DB-Nav. Existing ObjectNav methods mainly focus on where to search, and may be misled by visually similar distractors or unverified semantic cues. In contrast, DB-Nav explicitly models both positive contextual affinity and negative inhibition, allowing the agent to identify promising regions while suppressing unreliable evidence. or visually similar to distractors, making open-world object… view at source ↗

**Figure 2.** Figure 2: Overview of DB-Nav. Given a target category, DB-Nav first constructs target-centric category-role vectors that assign semantic cues to commitment, activation, and inhibition roles. Online object observations are converted into object-level relational evidence through multi-view accumulation and multi-label object-role competition. Positive affinity cues generate activation over promising frontiers, while s… view at source ↗

**Figure 3.** Figure 3: Temporal evolution of relational activation–inhibition propagation. Contextual observations generate hypothesis nodes and diffusion fields, while distractor cues and falsified evidence suppress unreliable regions. As exploration proceeds, DB-Nav dynamically updates the search field for subsequent frontier selection. where mo(ℓ) denotes the accumulated multi-view observation mass of assigning label ℓ to ob… view at source ↗

**Figure 4.** Figure 4: Trajectory comparison between the baseline method and ours on representative ObjectNav tasks. Our method reduces redundant exploration and selects more reliable search directions under noisy semantic observations. Implementation details. All methods are evaluated under the same Habitat ObjectNav protocol and use identical parameters across HM3Dv1, HM3Dv2, and MP3D. The agent acts in a discrete action spa… view at source ↗

**Figure 6.** Figure 6: Hyperparameter sensitivity analysis on the HM3Dv2 validation set. DB-Nav achieves the best overall performance at R = 2.3 and θ = 0.45 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis under noisy openvocabulary perception. In Scene 1, high-confidence false detections attract the baseline toward unreliable regions, while DB-Nav suppresses false-positive attraction. In Scene 2, the baseline misses the target under unreliable observations, whereas DB-Nav recovers through contextual activation and action-level verification. ever, DB-Nav consistently outperforms VLFM … view at source ↗

read the original abstract

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DB-Nav factors relational biases into activation and inhibition to adjust an exploration graph, but the abstract supplies no equations or validation for how the inhibition term is computed from confusion and failed accesses.

read the letter

The paper's central move is to split target-centric relations into an Activation Bias that spreads useful context and an Inhibition Bias that down-weights regions based on perceptual confusion and action-level falsification, then fold both into a graph that re-scores frontiers. That factorization is the concrete new piece; most prior navigation work either trusts the detector output or adds a separate VLM step, so this is a direct attempt to handle the "what not to trust" side inside the same mapping loop.

It does address a genuine pain point. Open-vocabulary detectors do produce repeatable false positives tied to object co-occurrence, and using failed accesses to prune the map is a sensible embodied signal that does not require extra models.

The soft spot is exactly where the stress-test note points: the abstract never shows how perceptual confusion is quantified or how a failed access falsifies a region without simply propagating the same detection errors the method wants to correct. No equations, no pseudocode, no ablation on the bias terms, and no mention of error bars or dataset splits appear in the provided text. That leaves the reported SR/SPL gains impossible to evaluate for robustness.

The work is aimed at researchers building lightweight ObjectNav agents who already use frontier-based exploration and want to add a relational filter without calling a VLM at every step. A reader who already knows the standard ObjectNav benchmarks would see whether the numbers move, but only if the full paper supplies the missing implementation details.

I would not send this to peer review in its current form; the claim rests on an unshown mechanism that could introduce new systematic bias. Once the computation of the inhibition term and the corresponding controls are written out, it would be worth a referee look.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DB-Nav for object navigation, factorizing target-centric relations into an Activation Bias (propagating contextual evidence) and an Inhibition Bias (suppressing unreliable regions via perceptual confusion and action-level falsification). These are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values from online observations and failed accesses. The central claim is that this yields significant gains in success rate (SR) and Success weighted by Path Length (SPL) on ObjectNav benchmarks while remaining lightweight and avoiding online VLM reasoning.

Significance. If the bias computation mechanisms prove reliable and free of new systematic errors, the approach could supply an interpretable, parameter-light alternative to VLM-dependent navigation by explicitly encoding what not to trust. The relational inductive bias framing and use of failed accesses are potentially valuable contributions, but the significance hinges on whether the reported SR/SPL improvements survive detailed ablation and error analysis.

major comments (2)

[Abstract / Method description] The abstract and method overview provide no equations, pseudocode, or quantitative validation for how perceptual confusion is measured or how action-level falsification updates the graph; without these, it is impossible to assess whether the Inhibition Bias avoids the false-positive and outdated-prior problems it aims to solve.
[Experiments] The central performance claim (outperformance on ObjectNav benchmarks) rests on the assumption that the dual biases can be computed solely from online observations without new perception modules or error propagation, yet no ablation isolating the contribution of the Inhibition Bias or reporting error bars on SR/SPL is referenced.

minor comments (1)

[Experiments] Dataset details, baseline implementations, and exact ObjectNav episode counts should be stated explicitly to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the experimental validation.

read point-by-point responses

Referee: [Abstract / Method description] The abstract and method overview provide no equations, pseudocode, or quantitative validation for how perceptual confusion is measured or how action-level falsification updates the graph; without these, it is impossible to assess whether the Inhibition Bias avoids the false-positive and outdated-prior problems it aims to solve.

Authors: We agree that the abstract and high-level method overview would benefit from explicit formulations. The full method section defines perceptual confusion as the normalized discrepancy between open-vocabulary detections and multi-view consistency checks, and action-level falsification as a decay factor applied to frontier values upon repeated failed accesses. To address the concern, we will insert the core equations and a pseudocode algorithm into the method overview, enabling direct evaluation of how the Inhibition Bias mitigates false positives and outdated priors from online observations alone. revision: yes
Referee: [Experiments] The central performance claim (outperformance on ObjectNav benchmarks) rests on the assumption that the dual biases can be computed solely from online observations without new perception modules or error propagation, yet no ablation isolating the contribution of the Inhibition Bias or reporting error bars on SR/SPL is referenced.

Authors: The reported results demonstrate overall gains from the dual-bias graph. To isolate the Inhibition Bias contribution and quantify robustness, we will add an ablation comparing the full model against an Activation-Bias-only variant, plus standard deviation error bars on SR and SPL computed over multiple random seeds. These additions will directly test the assumption of error-free online computation without new modules. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical results without self-referential derivation

full rationale

The provided abstract and description introduce DB-Nav as a framework that computes Activation Bias and Inhibition Bias from online observations to modulate an exploration graph, but contain no equations, parameter-fitting steps, self-citations, or uniqueness theorems that reduce any claimed result to its own inputs by construction. No predictions are presented as derived from fitted values, and the performance claims are benchmark comparisons rather than internal derivations. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Review is based only on the abstract; no explicit free parameters, axioms, or invented entities beyond the named biases and graph are stated.

invented entities (3)

Activation Bias no independent evidence
purpose: Propagates contextual evidence from target-centric relations
Introduced in the abstract as one half of the dual relational bias mechanism.
Inhibition Bias no independent evidence
purpose: Suppresses unreliable regions via perceptual confusion and action-level falsification
Introduced in the abstract as the complementary bias.
Relational Activation-Inhibition Exploration Graph no independent evidence
purpose: Unifies the two biases to modulate frontier exploration values
Core data structure proposed to integrate the biases with online observations.

pith-pipeline@v0.9.1-grok · 5738 in / 1330 out tokens · 18630 ms · 2026-06-27T13:09:26.445825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 7 canonical work pages · 3 internal anchors

[1]

R.; Fischer, M.; Malik, J.; and Savarese, S

Armeni, I.; He, Z.-Y.; Gwak, J.; Zamir, A. R.; Fischer, M.; Malik, J.; and Savarese, S. 2019. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2019
[3]

X.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y

Chang, A. X.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments . In Proceedings of the International Conference on 3D Vision (3DV)

2017
[4]

S.; Gandhi, D.; Gupta, A.; Malik, J.; and Salakhutdinov, R

Chaplot, D. S.; Gandhi, D.; Gupta, A.; Malik, J.; and Salakhutdinov, R. 2020 a . Learning to Explore using Active Neural SLAM . In Proceedings of the International Conference on Learning Representations (ICLR)

2020
[5]

S.; Gandhi, D.; Gupta, A.; and Salakhutdinov, R

Chaplot, D. S.; Gandhi, D.; Gupta, A.; and Salakhutdinov, R. 2020 b . Object Goal Navigation using Goal-Oriented Semantic Exploration . In Advances in Neural Information Processing Systems (NeurIPS)

2020
[6]

S.; Salakhutdinov, R.; Gupta, A.; and Gupta, S

Chaplot, D. S.; Salakhutdinov, R.; Gupta, A.; and Gupta, S. 2020 c . Neural Topological SLAM for Visual Navigation . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2020
[7]

Chen, R.; and Zhang, X. 2026. CoS: Contextual Object Semantics for Zero-Shot Navigation . In AAAI Conference on Artificial Intelligence

2026
[8]

Chen, X.; Zhou, M.; and Li, T. 2024. PixNav: Pixel-Level Visual Navigation for Zero-Shot Object Goals . In International Conference on Robotics and Automation (ICRA)

2024
[9]

Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; and Shan, Y. 2024. YOLO-World: Real-Time Open-Vocabulary Object Detection . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2024
[10]

Deitke, M.; Han, W.; Herrasti, A.; Kembhavi, A.; Kolve, E.; Mottaghi, R.; Salvador, J.; Schwenk, D.; VanderBilt, E.; Wallingford, M.; Weihs, L.; Yatskar, M.; and Farhadi, A. 2020. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2020
[11]

Deitke, T.; Xie, L.; and Chen, R. 2022. ProcTHOR: Procedural Object Navigation in Simulated Environments . In Advances in Neural Information Processing Systems (NeurIPS)

2022
[12]

Y.; Wortsman, M.; Ilharco, G.; Schmidt, L.; and Song, S

Gadre, S. Y.; Wortsman, M.; Ilharco, G.; Schmidt, L.; and Song, S. 2023. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023
[13]

Georgakis, G.; Bucher, B.; Schmeckpeper, K.; Singh, S.; and Daniilidis, K. 2021. Learning to Map for Active Semantic Goal Navigation . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2021
[14]

M.; Sen, B.; Agarwal, A.; Rivera, C.; Paul, W.; Ellis, K.; Chellappa, R.; Gan, C.; de Melo, C

Gu, Q.; Kuwajerwala, A.; Morin, S.; Jatavallabhula, K. M.; Sen, B.; Agarwal, A.; Rivera, C.; Paul, W.; Ellis, K.; Chellappa, R.; Gan, C.; de Melo, C. M.; Tenenbaum, J. B.; Torralba, A.; Shkurti, F.; and Paull, L. 2024. ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning . In Proceedings of the IEEE International Conference on Roboti...

2024
[15]

Gupta, S.; Tolani, V.; Davidson, J.; Levine, S.; Sukthankar, R.; and Malik, J. 2017. Cognitive Mapping and Planning for Visual Navigation . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2017
[16]

Huang, W.; Zhang, L.; and Sun, J. 2025. ASCENT: Floor-Aware Zero-Shot Object Goal Navigation . In ICRA

2025
[17]

Kim, S.; Park, J.; and Lee, H. 2024. InstructNav: Instruction-Guided Zero-Shot Object Navigation . In Conference on Robot Learning (CoRL)

2024
[18]

C.; Lo, W.-Y.; Dollar, P.; and Girshick, R

Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.-Y.; Dollar, P.; and Girshick, R. 2023. Segment Anything . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023
[20]

Kuang, Y.; Lin, H.; and Jiang, M. 2024. OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models . In Findings of the Association for Computational Linguistics: NAACL

2024
[21]

Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . In Proceedings of the International Conference on Machine Learning (ICML)

2023
[22]

Li, Z.; Xu, Q.; and Chen, R. 2024. TriHelper: Tri-Modal Guidance for Zero-Shot Object Navigation . In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

2024
[23]

Liu, F.; Chen, R.; and Wang, J. 2025. MFNP: Multi-Factor Navigation Policies for Zero-Shot Object Goals . In ICRA

2025
[25]

Majumdar, A.; Aggarwal, G.; Devnani, B.; Hoffman, J.; and Batra, D. 2022. ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings . In Advances in Neural Information Processing Systems (NeurIPS)

2022
[26]

Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; Wang, X.; Zhai, X.; Kipf, T.; and Houlsby, N. 2022. Simple Open-Vocabulary Object Detection with Vision Transformers . In Proceedings of the European Conference on Computer Vision (ECCV)

2022
[27]

Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; and Funkhouser, T. 2023. OpenScene: 3D Scene Understanding with Open Vocabularies . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023
[28]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the International Conference on Machine Learning (ICML)

2021
[29]

K.; Al-Halah, Z.; and Grauman, K

Ramakrishnan, S. K.; Al-Halah, Z.; and Grauman, K. 2020. Occupancy Anticipation for Efficient Exploration and Navigation . In Proceedings of the European Conference on Computer Vision (ECCV)

2020
[30]

K.; Chaplot, D

Ramakrishnan, S. K.; Chaplot, D. S.; Al-Halah, Z.; Malik, J.; and Grauman, K. 2022. PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2022
[31]

K.; Gokaslan, A.; Wijmans, E.; Maksymets, O.; Clegg, A.; Turner, J.; Undersander, E.; Galuba, W.; Westbury, A.; Chang, A

Ramakrishnan, S. K.; Gokaslan, A.; Wijmans, E.; Maksymets, O.; Clegg, A.; Turner, J.; Undersander, E.; Galuba, W.; Westbury, A.; Chang, A. X.; Savva, M.; Zhao, Y.; and Batra, D. 2021. Habitat-Matterport 3D Dataset . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track

2021
[32]

Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh, D.; and Batra, D. 2019. Habitat: A Platform for Embodied AI Research . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2019
[34]

Wang, J.; and Li, P. 2026. PanoNav: Panoramic Observation for Zero-Shot Object Navigation . In AAAI Conference on Artificial Intelligence

2026
[35]

Wang, L.; Gupta, A.; and Singh, A. 2023. L3MVN: Learning Multi-View Navigation Policies for Zero-Shot Object Goal Navigation . In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

2023
[37]

Werby, A.; Huang, C.; Buchner, M.; Valada, A.; and Burgard, W. 2024. Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation . In Proceedings of Robotics: Science and Systems (RSS)

2024
[38]

S.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; and Batra, D

Wijmans, E.; Kadian, A.; Morcos, A. S.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; and Batra, D. 2020. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames . In Proceedings of the International Conference on Learning Representations (ICLR)

2020
[39]

Wu, P.; Mu, Y.; Wu, B.; Hou, Y.; Ma, J.; Zhang, S.; and Liu, C. 2024. VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model . In Proceedings of the International Conference on Machine Learning (ICML)

2024
[40]

R.; He, Z.; Sax, A.; Malik, J.; and Savarese, S

Xia, F.; Zamir, A. R.; He, Z.; Sax, A.; Malik, J.; and Savarese, S. 2018. Gibson Env: Real-World Perception for Embodied Agents . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2018
[41]

K.; Gervet, T.; Turner, J.; Gokaslan, A.; Maestre, N.; Chang, A

Yadav, K.; Ramrakhya, R.; Ramakrishnan, S. K.; Gervet, T.; Turner, J.; Gokaslan, A.; Maestre, N.; Chang, A. X.; Batra, D.; Savva, M.; Clegg, A. W.; and Chaplot, D. S. 2022. Habitat-Matterport 3D Semantics Dataset . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track

2022
[42]

Yamauchi, B. 1997. A Frontier-Based Approach for Autonomous Exploration . In Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA)

1997
[43]

Yang, W.; Wang, X.; Farhadi, A.; Gupta, A.; and Mottaghi, R. 2019. Visual Semantic Navigation using Scene Priors . In Proceedings of the International Conference on Learning Representations (ICLR)

2019
[44]

Ye, J.; Batra, D.; Das, A.; and Wijmans, E. 2021. Auxiliary Tasks and Exploration Enable ObjectGoal Navigation . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2021
[45]

Yin, H.; Xu, X.; Wu, Z.; Zhou, J.; and Lu, J. 2024. SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation . In Advances in Neural Information Processing Systems (NeurIPS)

2024
[46]

Yokoyama, H.; Tanaka, K.; and Saito, N. 2025. ImagineNav: Vision-Language Grounded Zero-Shot Navigation . In International Conference on Learning Representations (ICLR)

2025
[47]

Yokoyama, N.; Ha, S.; Batra, D.; Wang, J.; and Bucher, B. 2024. VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation . In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)

2024
[49]

Zhang, H.; Li, P.; and Wang, Y. 2025. UniGoal: Unified Goal Representation for Zero-Shot Object Navigation . In CVPR

2025
[51]

Zhou, K.; Zheng, K.; Pryor, C.; Shen, Y.; Jin, H.; Getoor, L.; and Wang, X. E. 2023. ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation . In Proceedings of the International Conference on Machine Learning (ICML)

2023
[52]

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

Batra, Dhruv and Gokaslan, Aaron and Kembhavi, Aniruddha and Maksymets, Oleksandr and Mottaghi, Roozbeh and Savva, Manolis and Toshev, Alexander and Wijmans, Erik. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. arXiv:2006.13171

work page arXiv 2006
[53]

Habitat: A Platform for Embodied AI Research

Savva, Manolis and Kadian, Abhishek and Maksymets, Oleksandr and Zhao, Yili and Wijmans, Erik and Jain, Bhavana and Straub, Julian and Liu, Jia and Koltun, Vladlen and Malik, Jitendra and Parikh, Devi and Batra, Dhruv. Habitat: A Platform for Embodied AI Research. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[54]

and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda

Chang, Angel X. and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda. Matterport3D: Learning from RGB-D Data in Indoor Environments. Proceedings of the International Conference on 3D Vision (3DV)
[55]

and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio

Xia, Fei and Zamir, Amir R. and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio. Gibson Env: Real-World Perception for Embodied Agents. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[56]

AI2-THOR: An Interactive 3D Environment for Visual AI

Kolve, Eric and Mottaghi, Roozbeh and Han, Winson and VanderBilt, Eli and Weihs, Luca and Herrasti, Alvaro and Gordon, Daniel and Zhu, Yuke and Gupta, Abhinav and Farhadi, Ali. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv:1712.05474

work page internal anchor Pith review Pith/arXiv arXiv
[57]

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

Deitke, Matt and Han, Winson and Herrasti, Alvaro and Kembhavi, Aniruddha and Kolve, Eric and Mottaghi, Roozbeh and Salvador, Jordi and Schwenk, Dustin and VanderBilt, Eli and Wallingford, Matthew and Weihs, Luca and Yatskar, Mark and Farhadi, Ali. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. Proceedings of the IEEE/CVF Conference on Compute...
[58]

and Savva, Manolis and Zhao, Yili and Batra, Dhruv

Ramakrishnan, Santhosh Kumar and Gokaslan, Aaron and Wijmans, Erik and Maksymets, Oleksandr and Clegg, Alex and Turner, John and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X. and Savva, Manolis and Zhao, Yili and Batra, Dhruv. Habitat-Matterport 3D Dataset. Advances in Neural Information Processing Systems Datasets and Be...
[59]

Cognitive Mapping and Planning for Visual Navigation

Gupta, Saurabh and Tolani, Varun and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra. Cognitive Mapping and Planning for Visual Navigation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[60]

and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv

Wijmans, Erik and Kadian, Abhishek and Morcos, Ari S. and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. Proceedings of the International Conference on Learning Representations (ICLR)
[61]

Auxiliary Tasks and Exploration Enable ObjectGoal Navigation

Ye, Joel and Batra, Dhruv and Das, Abhishek and Wijmans, Erik. Auxiliary Tasks and Exploration Enable ObjectGoal Navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[62]

A Frontier-Based Approach for Autonomous Exploration

Yamauchi, Brian. A Frontier-Based Approach for Autonomous Exploration. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA)
[63]

Learning to Explore using Active Neural SLAM

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Abhinav and Malik, Jitendra and Salakhutdinov, Ruslan. Learning to Explore using Active Neural SLAM. Proceedings of the International Conference on Learning Representations (ICLR)
[64]

Neural Topological SLAM for Visual Navigation

Chaplot, Devendra Singh and Salakhutdinov, Ruslan and Gupta, Abhinav and Gupta, Saurabh. Neural Topological SLAM for Visual Navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[65]

Object Goal Navigation using Goal-Oriented Semantic Exploration

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Abhinav and Salakhutdinov, Ruslan. Object Goal Navigation using Goal-Oriented Semantic Exploration. Advances in Neural Information Processing Systems (NeurIPS)
[66]

Learning to Map for Active Semantic Goal Navigation

Georgakis, Georgios and Bucher, Bernadette and Schmeckpeper, Karl and Singh, Siddharth and Daniilidis, Kostas. Learning to Map for Active Semantic Goal Navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[67]

PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning

Ramakrishnan, Santhosh Kumar and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen. PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[68]

Visual Semantic Navigation using Scene Priors

Yang, Wei and Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav and Mottaghi, Roozbeh. Visual Semantic Navigation using Scene Priors. Proceedings of the International Conference on Learning Representations (ICLR)
[69]

Occupancy Anticipation for Efficient Exploration and Navigation

Ramakrishnan, Santhosh Kumar and Al-Halah, Ziad and Grauman, Kristen. Occupancy Anticipation for Efficient Exploration and Navigation. Proceedings of the European Conference on Computer Vision (ECCV)
[70]

and Fischer, Martin and Malik, Jitendra and Savarese, Silvio

Armeni, Iro and He, Zhi-Yang and Gwak, JunYoung and Zamir, Amir R. and Fischer, Martin and Malik, Jitendra and Savarese, Silvio. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[71]

OpenScene: 3D Scene Understanding with Open Vocabularies

Peng, Songyou and Genova, Kyle and Jiang, Chiyu and Tagliasacchi, Andrea and Pollefeys, Marc and Funkhouser, Thomas. OpenScene: 3D Scene Understanding with Open Vocabularies. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[72]

and Torralba, Antonio and Shkurti, Florian and Paull, Liam

Gu, Qiao and Kuwajerwala, Ali and Morin, Sacha and Jatavallabhula, Krishna Murthy and Sen, Bipasha and Agarwal, Aditya and Rivera, Corban and Paul, William and Ellis, Kirsty and Chellappa, Rama and Gan, Chuang and de Melo, Celso Miguel and Tenenbaum, Joshua B. and Torralba, Antonio and Shkurti, Florian and Paull, Liam. ConceptGraphs: Open-Vocabulary 3D Sc...
[73]

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Werby, Abdelrhman and Huang, Chenguang and Buchner, Martin and Valada, Abhinav and Burgard, Wolfram. Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation. Proceedings of Robotics: Science and Systems (RSS)
[74]

OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph

Tang, Yujie and Wang, Meiling and Deng, Yinan and Zheng, Zibo and Zhong, Jiagui and Yue, Yufeng. OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph. arXiv:2409.18743

work page arXiv
[75]

ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation

Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric. ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation. Proceedings of the International Conference on Machine Learning (ICML)
[76]

SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation

Yin, Hang and Xu, Xiuwei and Wu, Zhenyu and Zhou, Jie and Lu, Jiwen. SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation. Advances in Neural Information Processing Systems (NeurIPS)
[77]

VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model

Wu, Pengying and Mu, Yao and Wu, Bingxian and Hou, Yi and Ma, Ji and Zhang, Shanghang and Liu, Chang. VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model. Proceedings of the International Conference on Machine Learning (ICML)
[78]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the International Conference on Machine Learning (ICML)
[79]

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the International Conference on Machine Lear...
[80]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:2303.05499

work page internal anchor Pith review Pith/arXiv arXiv
[81]

and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross. Segment Anything. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
[82]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv:2306.14289

work page internal anchor Pith review Pith/arXiv arXiv
[83]

YOLO-World: Real-Time Open-Vocabulary Object Detection

Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying. YOLO-World: Real-Time Open-Vocabulary Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[84]

Simple Open-Vocabulary Object Detection with Vision Transformers

Minderer, Matthias and Gritsenko, Alexey and Stone, Austin and Neumann, Maxim and Weissenborn, Dirk and Dosovitskiy, Alexey and Mahendran, Aravindh and Arnab, Anurag and Dehghani, Mostafa and Shen, Zhuoran and Wang, Xiao and Zhai, Xiaohua and Kipf, Thomas and Houlsby, Neil. Simple Open-Vocabulary Object Detection with Vision Transformers. Proceedings of t...
[85]

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Majumdar, Arjun and Aggarwal, Gunjan and Devnani, Bhavika and Hoffman, Judy and Batra, Dhruv. ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings. Advances in Neural Information Processing Systems (NeurIPS)
[86]

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Gadre, Samir Yitzhak and Wortsman, Mitchell and Ilharco, Gabriel and Schmidt, Ludwig and Song, Shuran. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[87]

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Yokoyama, Naoki and Ha, Sehoon and Batra, Dhruv and Wang, Jiuguang and Bucher, Bernadette. VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)

Showing first 80 references.

[1] [1]

R.; Fischer, M.; Malik, J.; and Savarese, S

Armeni, I.; He, Z.-Y.; Gwak, J.; Zamir, A. R.; Fischer, M.; Malik, J.; and Savarese, S. 2019. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2019

[2] [3]

X.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y

Chang, A. X.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments . In Proceedings of the International Conference on 3D Vision (3DV)

2017

[3] [4]

S.; Gandhi, D.; Gupta, A.; Malik, J.; and Salakhutdinov, R

Chaplot, D. S.; Gandhi, D.; Gupta, A.; Malik, J.; and Salakhutdinov, R. 2020 a . Learning to Explore using Active Neural SLAM . In Proceedings of the International Conference on Learning Representations (ICLR)

2020

[4] [5]

S.; Gandhi, D.; Gupta, A.; and Salakhutdinov, R

Chaplot, D. S.; Gandhi, D.; Gupta, A.; and Salakhutdinov, R. 2020 b . Object Goal Navigation using Goal-Oriented Semantic Exploration . In Advances in Neural Information Processing Systems (NeurIPS)

2020

[5] [6]

S.; Salakhutdinov, R.; Gupta, A.; and Gupta, S

Chaplot, D. S.; Salakhutdinov, R.; Gupta, A.; and Gupta, S. 2020 c . Neural Topological SLAM for Visual Navigation . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2020

[6] [7]

Chen, R.; and Zhang, X. 2026. CoS: Contextual Object Semantics for Zero-Shot Navigation . In AAAI Conference on Artificial Intelligence

2026

[7] [8]

Chen, X.; Zhou, M.; and Li, T. 2024. PixNav: Pixel-Level Visual Navigation for Zero-Shot Object Goals . In International Conference on Robotics and Automation (ICRA)

2024

[8] [9]

Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; and Shan, Y. 2024. YOLO-World: Real-Time Open-Vocabulary Object Detection . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2024

[9] [10]

Deitke, M.; Han, W.; Herrasti, A.; Kembhavi, A.; Kolve, E.; Mottaghi, R.; Salvador, J.; Schwenk, D.; VanderBilt, E.; Wallingford, M.; Weihs, L.; Yatskar, M.; and Farhadi, A. 2020. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2020

[10] [11]

Deitke, T.; Xie, L.; and Chen, R. 2022. ProcTHOR: Procedural Object Navigation in Simulated Environments . In Advances in Neural Information Processing Systems (NeurIPS)

2022

[11] [12]

Y.; Wortsman, M.; Ilharco, G.; Schmidt, L.; and Song, S

Gadre, S. Y.; Wortsman, M.; Ilharco, G.; Schmidt, L.; and Song, S. 2023. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023

[12] [13]

Georgakis, G.; Bucher, B.; Schmeckpeper, K.; Singh, S.; and Daniilidis, K. 2021. Learning to Map for Active Semantic Goal Navigation . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2021

[13] [14]

M.; Sen, B.; Agarwal, A.; Rivera, C.; Paul, W.; Ellis, K.; Chellappa, R.; Gan, C.; de Melo, C

Gu, Q.; Kuwajerwala, A.; Morin, S.; Jatavallabhula, K. M.; Sen, B.; Agarwal, A.; Rivera, C.; Paul, W.; Ellis, K.; Chellappa, R.; Gan, C.; de Melo, C. M.; Tenenbaum, J. B.; Torralba, A.; Shkurti, F.; and Paull, L. 2024. ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning . In Proceedings of the IEEE International Conference on Roboti...

2024

[14] [15]

Gupta, S.; Tolani, V.; Davidson, J.; Levine, S.; Sukthankar, R.; and Malik, J. 2017. Cognitive Mapping and Planning for Visual Navigation . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2017

[15] [16]

Huang, W.; Zhang, L.; and Sun, J. 2025. ASCENT: Floor-Aware Zero-Shot Object Goal Navigation . In ICRA

2025

[16] [17]

Kim, S.; Park, J.; and Lee, H. 2024. InstructNav: Instruction-Guided Zero-Shot Object Navigation . In Conference on Robot Learning (CoRL)

2024

[17] [18]

C.; Lo, W.-Y.; Dollar, P.; and Girshick, R

Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.-Y.; Dollar, P.; and Girshick, R. 2023. Segment Anything . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023

[18] [20]

Kuang, Y.; Lin, H.; and Jiang, M. 2024. OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models . In Findings of the Association for Computational Linguistics: NAACL

2024

[19] [21]

Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . In Proceedings of the International Conference on Machine Learning (ICML)

2023

[20] [22]

Li, Z.; Xu, Q.; and Chen, R. 2024. TriHelper: Tri-Modal Guidance for Zero-Shot Object Navigation . In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

2024

[21] [23]

Liu, F.; Chen, R.; and Wang, J. 2025. MFNP: Multi-Factor Navigation Policies for Zero-Shot Object Goals . In ICRA

2025

[22] [25]

Majumdar, A.; Aggarwal, G.; Devnani, B.; Hoffman, J.; and Batra, D. 2022. ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings . In Advances in Neural Information Processing Systems (NeurIPS)

2022

[23] [26]

Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; Wang, X.; Zhai, X.; Kipf, T.; and Houlsby, N. 2022. Simple Open-Vocabulary Object Detection with Vision Transformers . In Proceedings of the European Conference on Computer Vision (ECCV)

2022

[24] [27]

Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; and Funkhouser, T. 2023. OpenScene: 3D Scene Understanding with Open Vocabularies . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023

[25] [28]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the International Conference on Machine Learning (ICML)

2021

[26] [29]

K.; Al-Halah, Z.; and Grauman, K

Ramakrishnan, S. K.; Al-Halah, Z.; and Grauman, K. 2020. Occupancy Anticipation for Efficient Exploration and Navigation . In Proceedings of the European Conference on Computer Vision (ECCV)

2020

[27] [30]

K.; Chaplot, D

Ramakrishnan, S. K.; Chaplot, D. S.; Al-Halah, Z.; Malik, J.; and Grauman, K. 2022. PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2022

[28] [31]

K.; Gokaslan, A.; Wijmans, E.; Maksymets, O.; Clegg, A.; Turner, J.; Undersander, E.; Galuba, W.; Westbury, A.; Chang, A

Ramakrishnan, S. K.; Gokaslan, A.; Wijmans, E.; Maksymets, O.; Clegg, A.; Turner, J.; Undersander, E.; Galuba, W.; Westbury, A.; Chang, A. X.; Savva, M.; Zhao, Y.; and Batra, D. 2021. Habitat-Matterport 3D Dataset . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track

2021

[29] [32]

Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh, D.; and Batra, D. 2019. Habitat: A Platform for Embodied AI Research . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2019

[30] [34]

Wang, J.; and Li, P. 2026. PanoNav: Panoramic Observation for Zero-Shot Object Navigation . In AAAI Conference on Artificial Intelligence

2026

[31] [35]

Wang, L.; Gupta, A.; and Singh, A. 2023. L3MVN: Learning Multi-View Navigation Policies for Zero-Shot Object Goal Navigation . In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

2023

[32] [37]

Werby, A.; Huang, C.; Buchner, M.; Valada, A.; and Burgard, W. 2024. Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation . In Proceedings of Robotics: Science and Systems (RSS)

2024

[33] [38]

S.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; and Batra, D

Wijmans, E.; Kadian, A.; Morcos, A. S.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; and Batra, D. 2020. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames . In Proceedings of the International Conference on Learning Representations (ICLR)

2020

[34] [39]

Wu, P.; Mu, Y.; Wu, B.; Hou, Y.; Ma, J.; Zhang, S.; and Liu, C. 2024. VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model . In Proceedings of the International Conference on Machine Learning (ICML)

2024

[35] [40]

R.; He, Z.; Sax, A.; Malik, J.; and Savarese, S

Xia, F.; Zamir, A. R.; He, Z.; Sax, A.; Malik, J.; and Savarese, S. 2018. Gibson Env: Real-World Perception for Embodied Agents . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2018

[36] [41]

K.; Gervet, T.; Turner, J.; Gokaslan, A.; Maestre, N.; Chang, A

Yadav, K.; Ramrakhya, R.; Ramakrishnan, S. K.; Gervet, T.; Turner, J.; Gokaslan, A.; Maestre, N.; Chang, A. X.; Batra, D.; Savva, M.; Clegg, A. W.; and Chaplot, D. S. 2022. Habitat-Matterport 3D Semantics Dataset . In Advances in Neural Information Processing Systems Datasets and Benchmarks Track

2022

[37] [42]

Yamauchi, B. 1997. A Frontier-Based Approach for Autonomous Exploration . In Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA)

1997

[38] [43]

Yang, W.; Wang, X.; Farhadi, A.; Gupta, A.; and Mottaghi, R. 2019. Visual Semantic Navigation using Scene Priors . In Proceedings of the International Conference on Learning Representations (ICLR)

2019

[39] [44]

Ye, J.; Batra, D.; Das, A.; and Wijmans, E. 2021. Auxiliary Tasks and Exploration Enable ObjectGoal Navigation . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2021

[40] [45]

Yin, H.; Xu, X.; Wu, Z.; Zhou, J.; and Lu, J. 2024. SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation . In Advances in Neural Information Processing Systems (NeurIPS)

2024

[41] [46]

Yokoyama, H.; Tanaka, K.; and Saito, N. 2025. ImagineNav: Vision-Language Grounded Zero-Shot Navigation . In International Conference on Learning Representations (ICLR)

2025

[42] [47]

Yokoyama, N.; Ha, S.; Batra, D.; Wang, J.; and Bucher, B. 2024. VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation . In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)

2024

[43] [49]

Zhang, H.; Li, P.; and Wang, Y. 2025. UniGoal: Unified Goal Representation for Zero-Shot Object Navigation . In CVPR

2025

[44] [51]

Zhou, K.; Zheng, K.; Pryor, C.; Shen, Y.; Jin, H.; Getoor, L.; and Wang, X. E. 2023. ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation . In Proceedings of the International Conference on Machine Learning (ICML)

2023

[45] [52]

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

Batra, Dhruv and Gokaslan, Aaron and Kembhavi, Aniruddha and Maksymets, Oleksandr and Mottaghi, Roozbeh and Savva, Manolis and Toshev, Alexander and Wijmans, Erik. ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects. arXiv:2006.13171

work page arXiv 2006

[46] [53]

Habitat: A Platform for Embodied AI Research

Savva, Manolis and Kadian, Abhishek and Maksymets, Oleksandr and Zhao, Yili and Wijmans, Erik and Jain, Bhavana and Straub, Julian and Liu, Jia and Koltun, Vladlen and Malik, Jitendra and Parikh, Devi and Batra, Dhruv. Habitat: A Platform for Embodied AI Research. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[47] [54]

and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda

Chang, Angel X. and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda. Matterport3D: Learning from RGB-D Data in Indoor Environments. Proceedings of the International Conference on 3D Vision (3DV)

[48] [55]

and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio

Xia, Fei and Zamir, Amir R. and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio. Gibson Env: Real-World Perception for Embodied Agents. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

[49] [56]

AI2-THOR: An Interactive 3D Environment for Visual AI

Kolve, Eric and Mottaghi, Roozbeh and Han, Winson and VanderBilt, Eli and Weihs, Luca and Herrasti, Alvaro and Gordon, Daniel and Zhu, Yuke and Gupta, Abhinav and Farhadi, Ali. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv:1712.05474

work page internal anchor Pith review Pith/arXiv arXiv

[50] [57]

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

Deitke, Matt and Han, Winson and Herrasti, Alvaro and Kembhavi, Aniruddha and Kolve, Eric and Mottaghi, Roozbeh and Salvador, Jordi and Schwenk, Dustin and VanderBilt, Eli and Wallingford, Matthew and Weihs, Luca and Yatskar, Mark and Farhadi, Ali. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. Proceedings of the IEEE/CVF Conference on Compute...

[51] [58]

and Savva, Manolis and Zhao, Yili and Batra, Dhruv

Ramakrishnan, Santhosh Kumar and Gokaslan, Aaron and Wijmans, Erik and Maksymets, Oleksandr and Clegg, Alex and Turner, John and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X. and Savva, Manolis and Zhao, Yili and Batra, Dhruv. Habitat-Matterport 3D Dataset. Advances in Neural Information Processing Systems Datasets and Be...

[52] [59]

Cognitive Mapping and Planning for Visual Navigation

Gupta, Saurabh and Tolani, Varun and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra. Cognitive Mapping and Planning for Visual Navigation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

[53] [60]

and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv

Wijmans, Erik and Kadian, Abhishek and Morcos, Ari S. and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. Proceedings of the International Conference on Learning Representations (ICLR)

[54] [61]

Auxiliary Tasks and Exploration Enable ObjectGoal Navigation

Ye, Joel and Batra, Dhruv and Das, Abhishek and Wijmans, Erik. Auxiliary Tasks and Exploration Enable ObjectGoal Navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[55] [62]

A Frontier-Based Approach for Autonomous Exploration

Yamauchi, Brian. A Frontier-Based Approach for Autonomous Exploration. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA)

[56] [63]

Learning to Explore using Active Neural SLAM

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Abhinav and Malik, Jitendra and Salakhutdinov, Ruslan. Learning to Explore using Active Neural SLAM. Proceedings of the International Conference on Learning Representations (ICLR)

[57] [64]

Neural Topological SLAM for Visual Navigation

Chaplot, Devendra Singh and Salakhutdinov, Ruslan and Gupta, Abhinav and Gupta, Saurabh. Neural Topological SLAM for Visual Navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[58] [65]

Object Goal Navigation using Goal-Oriented Semantic Exploration

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Abhinav and Salakhutdinov, Ruslan. Object Goal Navigation using Goal-Oriented Semantic Exploration. Advances in Neural Information Processing Systems (NeurIPS)

[59] [66]

Learning to Map for Active Semantic Goal Navigation

Georgakis, Georgios and Bucher, Bernadette and Schmeckpeper, Karl and Singh, Siddharth and Daniilidis, Kostas. Learning to Map for Active Semantic Goal Navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[60] [67]

PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning

Ramakrishnan, Santhosh Kumar and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen. PONI: Potential Functions for ObjectGoal Navigation with Interaction-Free Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[61] [68]

Visual Semantic Navigation using Scene Priors

Yang, Wei and Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav and Mottaghi, Roozbeh. Visual Semantic Navigation using Scene Priors. Proceedings of the International Conference on Learning Representations (ICLR)

[62] [69]

Occupancy Anticipation for Efficient Exploration and Navigation

Ramakrishnan, Santhosh Kumar and Al-Halah, Ziad and Grauman, Kristen. Occupancy Anticipation for Efficient Exploration and Navigation. Proceedings of the European Conference on Computer Vision (ECCV)

[63] [70]

and Fischer, Martin and Malik, Jitendra and Savarese, Silvio

Armeni, Iro and He, Zhi-Yang and Gwak, JunYoung and Zamir, Amir R. and Fischer, Martin and Malik, Jitendra and Savarese, Silvio. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[64] [71]

OpenScene: 3D Scene Understanding with Open Vocabularies

Peng, Songyou and Genova, Kyle and Jiang, Chiyu and Tagliasacchi, Andrea and Pollefeys, Marc and Funkhouser, Thomas. OpenScene: 3D Scene Understanding with Open Vocabularies. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[65] [72]

and Torralba, Antonio and Shkurti, Florian and Paull, Liam

Gu, Qiao and Kuwajerwala, Ali and Morin, Sacha and Jatavallabhula, Krishna Murthy and Sen, Bipasha and Agarwal, Aditya and Rivera, Corban and Paul, William and Ellis, Kirsty and Chellappa, Rama and Gan, Chuang and de Melo, Celso Miguel and Tenenbaum, Joshua B. and Torralba, Antonio and Shkurti, Florian and Paull, Liam. ConceptGraphs: Open-Vocabulary 3D Sc...

[66] [73]

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Werby, Abdelrhman and Huang, Chenguang and Buchner, Martin and Valada, Abhinav and Burgard, Wolfram. Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation. Proceedings of Robotics: Science and Systems (RSS)

[67] [74]

OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph

Tang, Yujie and Wang, Meiling and Deng, Yinan and Zheng, Zibo and Zhong, Jiagui and Yue, Yufeng. OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph. arXiv:2409.18743

work page arXiv

[68] [75]

ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation

Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric. ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation. Proceedings of the International Conference on Machine Learning (ICML)

[69] [76]

SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation

Yin, Hang and Xu, Xiuwei and Wu, Zhenyu and Zhou, Jie and Lu, Jiwen. SG-Nav: Online 3D Scene Graph Prompting for LLM-Based Zero-Shot Object Navigation. Advances in Neural Information Processing Systems (NeurIPS)

[70] [77]

VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model

Wu, Pengying and Mu, Yao and Wu, Bingxian and Hou, Yi and Ma, Ji and Zhang, Shanghang and Liu, Chang. VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model. Proceedings of the International Conference on Machine Learning (ICML)

[71] [78]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the International Conference on Machine Learning (ICML)

[72] [79]

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the International Conference on Machine Lear...

[73] [80]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:2303.05499

work page internal anchor Pith review Pith/arXiv arXiv

[74] [81]

and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross. Segment Anything. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

[75] [82]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv:2306.14289

work page internal anchor Pith review Pith/arXiv arXiv

[76] [83]

YOLO-World: Real-Time Open-Vocabulary Object Detection

Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying. YOLO-World: Real-Time Open-Vocabulary Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[77] [84]

Simple Open-Vocabulary Object Detection with Vision Transformers

Minderer, Matthias and Gritsenko, Alexey and Stone, Austin and Neumann, Maxim and Weissenborn, Dirk and Dosovitskiy, Alexey and Mahendran, Aravindh and Arnab, Anurag and Dehghani, Mostafa and Shen, Zhuoran and Wang, Xiao and Zhai, Xiaohua and Kipf, Thomas and Houlsby, Neil. Simple Open-Vocabulary Object Detection with Vision Transformers. Proceedings of t...

[78] [85]

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Majumdar, Arjun and Aggarwal, Gunjan and Devnani, Bhavika and Hoffman, Judy and Batra, Dhruv. ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings. Advances in Neural Information Processing Systems (NeurIPS)

[79] [86]

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Gadre, Samir Yitzhak and Wortsman, Mitchell and Ilharco, Gabriel and Schmidt, Ludwig and Song, Shuran. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[80] [87]

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Yokoyama, Naoki and Ha, Sehoon and Batra, Dhruv and Wang, Jiuguang and Bucher, Bernadette. VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)