pith. sign in

arxiv: 2603.05377 · v2 · pith:TRCWJAITnew · submitted 2026-03-05 · 💻 cs.RO · cs.CV

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Pith reviewed 2026-05-21 11:31 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords robot navigationvision-language modelsfrontier detectionzero-shot navigationopen-world roboticssubgoal planningvisual grounding
0
0 comments X

The pith

Robots reach language goals in new spaces by using visual frontiers as anchors for off-the-shelf vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that navigation reduces to sparse subgoal identification and reaching. It shows that visual frontiers can serve as semantic anchors that let high-level vision-language priors guide the robot without dense 3D maps or any task-specific training. If this holds, robots gain a lightweight way to handle flexible goals across varied everyday environments using existing models. A sympathetic reader would care because the method promises simpler deployment and better generalization than approaches that require reconstruction or fine-tuning.

Core claim

We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models.

What carries the argument

Visual frontiers as semantic anchors: frontier points chosen from the current view to link high-level language priors directly to visual targets for selecting the next subgoal.

If this is right

  • Removes the requirement for dense 3D semantic mapping during operation.
  • Delivers strong zero-shot performance on multiple navigation benchmarks.
  • Permits direct use of different vision-language models without retraining or fine-tuning.
  • Supports immediate real-world deployment on standard mobile robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frontier anchors could be reused for active exploration when no goal is given.
  • Adding short-term memory of visited frontiers might extend reliable performance over longer horizons.
  • The anchoring idea could transfer to language-directed manipulation by treating graspable objects as frontiers.
  • Classical frontier methods from exploration research gain new utility when paired with modern vision-language models.

Load-bearing premise

That visual frontiers supply enough semantic information for vision-language priors to produce efficient subgoal choices in complex everyday environments.

What would settle it

A controlled test in an unseen cluttered indoor space where the frontier-anchored system shows markedly lower success rate or higher path length than a dense-mapping baseline on the same language goals.

Figures

Figures reproduced from arXiv: 2603.05377 by Boyang Sun, Esteban Padilla-Cerdio, Hermann Blum, Marc Pollefeys.

Figure 1
Figure 1. Figure 1: We present OpenFrontier. Top: OpenFrontier detects and semantically evaluates visual frontiers directly in the image space, enabling flexible language-conditioned goal reasoning without 3D reconstruction. Bottom: These frontiers are grounded into the 3D metric space to drive long-horizon, natural-language–conditioned navigation across diverse goals in a fully zero-shot manner, without training or fine-tuni… view at source ↗
Figure 3
Figure 3. Figure 3: System Overview. Given a posed RGB observation and a natural-language navigation goal, OpenFrontier detects visual frontiers in the image and directly queries a vision–language model to evaluate their relevance using in-image context. The resulting frontiers are then lifted into the 3D metric space with the updated information gain as goal-conditioned candidates and globally managed to update navigation ta… view at source ↗
Figure 4
Figure 4. Figure 4: The detected 2D frontier clusters are jointly queried with the corresponding RGB [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Navigation Results across two representative baseline methods and OpenFrontier on an HM3D validation scene with the goal of finding a bed. The red square and shaded region indicate the ground-truth target location and its success region. OpenFrontier makes more efficient decisions at multi-choice intersections, navigating directly toward the bedroom while avoiding redundant exploration of irrelevant areas.… view at source ↗
Figure 6
Figure 6. Figure 6: Additional Navigation Examples. Top: OVON scenes with goals (left to right) refrigerator, picture, and dishwasher. Middle: MP3D scenes with goals stool, table, and cushion. Bottom: HM3D scenes with goals sofa, toilet, and bed. All experiments are conducted using the same system configuration and parameters across datasets. VLM SR (%)↑ SPL (%)↑ Gemini-2.5-flash 77.3 35.6 Gemma-3-4b-it 76.9 33.7 InternVL3_5-… view at source ↗
Figure 7
Figure 7. Figure 7: OpenFrontier Navigation with Different Goal Contexts. Top: target is “plant in the bathroom.” Bottom: target is “plant.” The robot is initialized at the same starting location in both runs. From left to right, we show selected frames along the navigation trajectory together with the corresponding image observations overlaid with detected frontiers. The final image shows the final image observation, which t… view at source ↗
Figure 8
Figure 8. Figure 8: Image-space Frontier Reasoning Examples. Four different vision–language models are evaluated for frontier probability estimation (without normalization) using the set-of-marks querying strategy (left to right: Qwen3-VL, Gemini-2.5, GPT-4o, InternVL￾3.5). All models operate on the same real-world image with the same prompt after frontier detection. section, we discuss several insights that help explain thes… view at source ↗
Figure 9
Figure 9. Figure 9: Real-world Deployment. Example of a deployment of OpenFrontier queried to find a fire extinguisher. The blue arrows illustrate the path of the robot through the environment. The yellow boxes mark the VLM prompt images for different keyframes, connected to some key 3D frontiers marked during navigation on the map. The red box shows the detection of the target object that signals successful task completion … view at source ↗
Figure 10
Figure 10. Figure 10: Failure Case Analysis. From left to right, we show representative failures from the HM3D, MP3D, and OVON evaluation benchmarks. Across all three benchmarks, the two most common failure modes are false-positive target detections and termination due to exhausting the step budget. A third failure mode, where the robot becomes stuck, primarily arises from planner failures to escape local minima or from target… view at source ↗
read the original abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision-language navigation (VLN) and vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D semantic mapping, task-specific policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes OpenFrontier, a navigation framework that formulates the problem as sparse subgoal identification by selecting visual frontiers as semantic anchors for high-level vision-language priors. It claims to enable efficient goal-conditioned navigation in open-world environments without dense 3D semantic mapping, task-specific policy training, or model fine-tuning, while integrating diverse pre-trained VL models and demonstrating strong zero-shot performance on benchmarks plus real-world robot deployment.

Significance. If the central claims hold, the work would offer a meaningful contribution to lightweight, generalizable robot navigation by avoiding the computational and data overhead of dense mapping or end-to-end training. The seamless integration of off-the-shelf VL priors without fine-tuning is a clear practical strength that could support flexible task adaptation across environments.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method, frontier detection subsection): The core claim that the system requires 'no dense 3D semantic mapping' is load-bearing for the lightweight design and efficiency arguments. Standard frontier extraction from depth or point clouds typically builds at least a local 2D/3D occupancy grid or voxel map to delineate free versus unknown space; the manuscript does not explicitly show how OpenFrontier avoids this step or quantify any implicit mapping costs, leaving the 'no dense mapping' advantage unverified.
  2. [§5] §5 (Experiments): The abstract asserts 'strong zero-shot performance across multiple navigation benchmarks,' yet the provided evaluation details lack quantitative metrics, error breakdowns, or ablation on frontier selection/reaching success rates. This weakens the ability to assess whether visual frontiers as semantic anchors truly suffice for the claimed efficiency in complex environments.
minor comments (1)
  1. [Abstract and Introduction] The abstract and introduction could more clearly distinguish the proposed visual-frontier approach from prior frontier-based methods that also use semantic cues, to sharpen the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to improve our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method, frontier detection subsection): The core claim that the system requires 'no dense 3D semantic mapping' is load-bearing for the lightweight design and efficiency arguments. Standard frontier extraction from depth or point clouds typically builds at least a local 2D/3D occupancy grid or voxel map to delineate free versus unknown space; the manuscript does not explicitly show how OpenFrontier avoids this step or quantify any implicit mapping costs, leaving the 'no dense mapping' advantage unverified.

    Authors: We clarify that our approach specifically avoids dense 3D semantic mapping by using visual frontiers as anchors for pre-trained vision-language models, rather than building explicit semantic 3D reconstructions. Frontier detection is performed on local depth observations to identify boundaries between free and unknown space, but without semantic labeling or global dense mapping. This enables the lightweight design. We will revise the method section to explicitly describe the frontier detection pipeline, including any local geometric processing, and discuss why this does not constitute dense semantic mapping. Additionally, we will include runtime analysis to quantify any mapping-related computations. revision: yes

  2. Referee: [§5] §5 (Experiments): The abstract asserts 'strong zero-shot performance across multiple navigation benchmarks,' yet the provided evaluation details lack quantitative metrics, error breakdowns, or ablation on frontier selection/reaching success rates. This weakens the ability to assess whether visual frontiers as semantic anchors truly suffice for the claimed efficiency in complex environments.

    Authors: The experiments section reports success rates and path efficiency on navigation benchmarks as well as real-robot results. To address the concern, we will expand the evaluation with detailed quantitative metrics, including breakdowns of navigation errors, ablations on the impact of visual-language grounding for frontier selection, and specific success rates for the subgoal reaching component. This will better demonstrate the sufficiency of the approach in complex environments. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external VL priors and design choices without self-referential reductions

full rationale

The paper formulates navigation as sparse subgoal selection using visual frontiers as anchors for pre-trained vision-language models. No equations, fitted parameters, or predictions are presented that reduce by construction to inputs or self-citations. The core claims rest on integration of off-the-shelf VL models and a lightweight system design rather than any internal derivation loop or ansatz smuggled via prior self-work. The absence of dense mapping is stated as an explicit architectural choice, not a derived result that presupposes the outcome. This is a standard engineering paper whose central contribution is empirical validation on benchmarks, not a closed mathematical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; specific free parameters such as frontier selection thresholds or subgoal reaching tolerances are not detailed. Standard robotics assumptions about sensor noise and environment structure are implicit but unstated.

axioms (1)
  • domain assumption Visual frontiers provide sufficient semantic anchoring for high-level vision-language priors to guide navigation
    Core insight stated in abstract; if false, the sparse-subgoal formulation collapses.

pith-pipeline@v0.9.0 · 5732 in / 1207 out tokens · 32266 ms · 2026-05-21T11:31:25.304059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

    cs.RO 2026-05 unverdicted novelty 5.0

    TravExplorer couples zero-shot semantic guidance with traversability-aware 3-D planning to enable cross-floor object navigation in unseen indoor environments.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

  2. [2]

    Go fetch: Mobile manipulation in unstructured environments.arXiv preprint arXiv:2004.00899, 2020

    Kenneth Blomqvist, Michel Breyer, Andrei Cramariuc, Julian Förster, Margarita Grinvald, Florian Tschopp, Jen Jen Chung, Lionel Ott, Juan Nieto, and Roland Siegwart. Go fetch: Mobile manipulation in unstructured environments.arXiv preprint arXiv:2004.00899, 2020

  3. [3]

    Tare: A hierarchical framework for efficiently exploring complex 3d environments

    Chao Cao, Hongbiao Zhu, Howie Choset, and Ji Zhang. Tare: A hierarchical framework for efficiently exploring complex 3d environments. InRobotics: Science and Systems, volume 5, 2021

  4. [4]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

  5. [5]

    Goat: Go to any thing

    Matthew Chang, Théophile Gervet, Mukul Khanna, Sri- ramYenamandra,DhruvShah,SoYeonMin,KavitShah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing. 2024

  6. [6]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33: 4247–4258, 2020

  7. [7]

    Seal: Self- supervised embodied active learning using exploration and 3d consistency.Advances in neural information processing systems, 34:13086–13098, 2021

    Devendra Singh Chaplot, Murtaza Dalal, Saurabh Gupta, Jitendra Malik, and Russ R Salakhutdinov. Seal: Self- supervised embodied active learning using exploration and 3d consistency.Advances in neural information processing systems, 34:13086–13098, 2021

  8. [8]

    Navila: Legged robot vision-language- action model for navigation

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language- action model for navigation. InRSS, 2025

  9. [9]

    Fast frontier- based information-driven autonomous exploration with an mav

    Anna Dai, Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, and Stefan Leutenegger. Fast frontier- based information-driven autonomous exploration with an mav. InICRA, 2020

  10. [10]

    Gemma 3 Technical Report

    Gemma Team et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786

  11. [11]

    Seeing across views: Bench- marking spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025

    Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Bench- marking spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025

  12. [12]

    Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. InProceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pages 23171–23181, 2023

  13. [13]

    Navigating to objects in the real world.Science Robotics, 8(79): eadf6991, 2023

    Theophile Gervet, Soumith Chintala, Dhruv Batra, Ji- tendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world.Science Robotics, 8(79): eadf6991, 2023

  14. [14]

    End-to-end navigation with vision language models: Transforming spatial reasoning into question- answering.arXiv preprint arXiv:2411.05755, 2024

    Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision language models: Transforming spatial reasoning into question- answering.arXiv preprint arXiv:2411.05755, 2024

  15. [15]

    History- augmented vision-language models for frontier-based zero-shot object navigation, 2025

    Mobin Habibpour and Fatemeh Afghah. History- augmented vision-language models for frontier-based zero-shot object navigation, 2025

  16. [16]

    Mapex: Indoor structure exploration with probabilistic information gain from global map predictions

    Cherie Ho, Seungchan Kim, Brady Moon, Aditya Paran- dekar, Narek Harutyunyan, Chen Wang, Katia Sycara, Graeme Best, and Sebastian Scherer. Mapex: Indoor structure exploration with probabilistic information gain from global map predictions. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13074–13080. IEEE, 2025

  17. [17]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monoc- ular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  18. [18]

    Hughes, Y

    N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research,2024. doi:10.1177/02783649241229725. URL https://doi.org/10.1177/02783649241229725

  19. [19]

    Efficient frontier detectionforrobotexploration.TheInternationalJournal of Robotics Research, 33(2):215–236, 2014

    Matan Keidar and Gal A Kaminka. Efficient frontier detectionforrobotexploration.TheInternationalJournal of Robotics Research, 33(2):215–236, 2014

  20. [20]

    Goat-bench: A benchmark for multi-modal lifelong navigation

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373–16383, 2024

  21. [21]

    Openfmnav: Towards open-set zero-shot object navigation via vision- language foundation models, 2024

    Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfmnav: Towards open-set zero-shot object navigation via vision- language foundation models, 2024

  22. [22]

    Sight over site: Perception-aware reinforcement learning for efficient robotic inspection.ArXiv, 2025

    Richard Kuhlmann, Jakob Wolfram, Boyang Sun, Jiaxu Xing, Davide Scaramuzza, Marc Pollefeys, and Cesar Cadena. Sight over site: Perception-aware reinforcement learning for efficient robotic inspection.ArXiv, 2025

  23. [23]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environ- ment, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environ- ment, 2024

  24. [24]

    Cognitive planning for object goal navigation using generative ai models.arXiv preprint arXiv:2404.00318, 2024

    Andrew Melnik, Gora Chand Nandi, et al. Cognitive planning for object goal navigation using generative ai models.arXiv preprint arXiv:2404.00318, 2024

  25. [25]

    A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers

    J A Placed, J Strader, H Carrillo, N Atanasov, V In- delman, L Carlone, and J A Castellanos. A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers. 2023

  26. [26]

    Ippon: Common sense guided informative path planning for object goal navigation

    Kaixian Qu, Jie Tan, Tingnan Zhang, Fei Xia, Cesar Cadena, and Marco Hutter. Ippon: Common sense guided informative path planning for object goal navigation. arXiv preprint arXiv:2410.19697, 2024

  27. [27]

    Poni: Potential functions for objectgoal navigation with interaction-free learning

    Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grau- man. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

  28. [28]

    Efficient volumetric mapping of multi-scale environments using wavelet-based compression

    Victor Reijgwart, Cesar Cadena, Roland Siegwart, and Lionel Ott. Efficient volumetric mapping of multi-scale environments using wavelet-based compression. 2023- 07

  29. [29]

    Fore- sightnav: Learning scene imagination for efficient ex- ploration

    Hardik Shah, Jiaxu Xing, Nico Messikommer, Boyang Sun, Marc Pollefeys, and Davide Scaramuzza. Fore- sightnav: Learning scene imagination for efficient ex- ploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025

  30. [30]

    Bumble: Unifying reasoning and acting with vision-language models for building- wide mobile manipulation

    Rutav Shah, Albert Yu, Yifeng Zhu, Yuke Zhu, and Roberto Martín-Martín. Bumble: Unifying reasoning and acting with vision-language models for building- wide mobile manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13337–13345. IEEE, 2025

  31. [31]

    S ¸ucan et al

    Ioan A. Şucan, Mark Moll, and Lydia E. Kavraki. The Open Motion Planning Library.IEEE Robotics & Au- tomation Magazine, 19(4):72–82, December 2012. doi: 10.1109/MRA.2012.2205651. https://ompl.kavrakilab. org

  32. [32]

    Frontier- net: Learning visual cues to explore.IEEE Robotics and Automation Letters, 10(7):6576–6583, 2025

    Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, and Hermann Blum. Frontier- net: Learning visual cues to explore.IEEE Robotics and Automation Letters, 10(7):6576–6583, 2025. doi: 10.1109/LRA.2025.3569122

  33. [33]

    A survey of object goal navigation.IEEE Transactions on Automation Science and Engineering, 22:2292–2308,

    Jingwen Sun, Jing Wu, Ze Ji, and Yu-Kun Lai. A survey of object goal navigation.IEEE Transactions on Automation Science and Engineering, 22:2292–2308,

  34. [34]

    doi: 10.1109/TASE.2024.3378010

  35. [35]

    Seer: Safe efficient exploration for aerial robots using learning to predict information gain

    Yuezhan Tao, Yuwei Wu, Beiming Li, Fernando Cladera, Alex Zhou, Dinesh Thakur, and Vijay Kumar. Seer: Safe efficient exploration for aerial robots using learning to predict information gain. InICRA, 2023

  36. [36]

    What matters in rl-based methods for object-goal navigation? an empirical study and a unified framework.arXiv preprint arXiv:2510.01830, 2025

    Hongze Wang, Boyang Sun, Jiaxu Xing, Fan Yang, Marco Hutter, Dhruv Shah, Davide Scaramuzza, and Marc Pollefeys. What matters in rl-based methods for object-goal navigation? an empirical study and a unified framework.arXiv preprint arXiv:2510.01830, 2025

  37. [37]

    Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

  38. [38]

    Dd-ppo: Learning near-perfect pointgoal naviga- tors from 2.5 billion frames

    Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal naviga- tors from 2.5 billion frames. InInternational Confer- ence on Learning Representations, 2020. URL https: //openreview.net/forum?id=H1gX8C4YPr

  39. [39]

    Naviformer: A spatio-temporal context-aware transformer for object navigation

    Wei Xie, Haobo Jiang, Yun Zhu, Jianjun Qian, and Jin Xie. Naviformer: A spatio-temporal context-aware transformer for object navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14708–14716, 2025

  40. [40]

    Navrl: Learning safe flight in dynamic environments.IEEE Robotics and Automation Letters, 10 (4):3668–3675, 2025

    Zhefan Xu, Xinming Han, Haoyu Shen, Hanyu Jin, and Kenji Shimada. Navrl: Learning safe flight in dynamic environments.IEEE Robotics and Automation Letters, 10 (4):3668–3675, 2025. doi: 10.1109/LRA.2025.3546069

  41. [41]

    Omninav: A unified framework for prospective exploration and visual-language navigation

    Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jin- tao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation. arXiv preprint arXiv:2509.25687, 2025

  42. [42]

    abitat challenge 2022

    Karmesh Yadav, Santhosh Kumar Ramakrishnan, John Turner, Aaron Gokaslan, Oleksandr Maksymets, Rishabh Jain, Ram Ramrakhya, Angel X Chang, Alexander Clegg, Manolis Savva, Eric Undersander, Devendra Singh Chap- lot, and Dhruv Batra. abitat challenge 2022. https: //aihabitat.org/challenge/2022/, 2022

  43. [43]

    Habitat challenge

    Karmesh Yadav, Jacob Krantz, Ram Ramrakhya, San- thosh Kumar Ramakrishnan, Jimmy Yang, Austin Wang, John Turner, Aaron Gokaslan, Vincent-Pierre Berges, Roozbeh Mootaghi, Oleksandr Maksymets, An- gel X Chang, Manolis Savva, Alexander Clegg, Deven- dra Singh Chaplot, and Dhruv Batra. Habitat challenge

  44. [44]

    https://aihabitat.org/challenge/2023/, 2023

  45. [45]

    A frontier-based approach for au- tonomous exploration

    Brian Yamauchi. A frontier-based approach for au- tonomous exploration. InProceedings 1997 IEEE In- ternational Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, pages 146–151. IEEE, 1997

  46. [46]

    iPlanner: Imperative Path Planning

    Fan Yang, Chen Wang, Cesar Cadena, and Marco Hutter. iPlanner: Imperative Path Planning. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.064

  47. [47]

    Fan Yang, Per Frivik, David Hoeller, Chen Wang, Cesar Cadena, and Marco Hutter. Spatially-enhanced recurrent memory for long-range mapless navigation via end-to- end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025

  48. [48]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  49. [49]

    Unigoal: Towards universal zero- shot goal-oriented navigation

    Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero- shot goal-oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066, 2025

  50. [50]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

  51. [51]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation, 2024

    Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation, 2024

  52. [52]

    Frontier semantic exploration for visual target navigation.arXiv preprint arXiv:2304.05506, 2023

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. Frontier semantic exploration for visual target navigation.arXiv preprint arXiv:2304.05506, 2023

  53. [53]

    Poliformer: Scal- ing on-policy rl with transformers results in masterful navigators

    Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hen- drix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, and Luca Weihs. Poliformer: Scal- ing on-policy rl with transformers results in masterful navigators. InConferenceon RobotLearning, pages408–

  54. [54]

    3d-aware object goal navigation via simultaneous exploration and identi- fication

    Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identi- fication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672– 6682, 2023

  55. [55]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

  56. [56]

    Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Sci- ence and Systems, 2024

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Sci- ence and Systems, 2024

  57. [57]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

    Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035, 2024

  58. [58]

    Fuel: Fast uav exploration using incremental fron- tier structure and hierarchical planning.IEEE Robotics and Automation Letters, 6(2):779–786, 2021

    Boyu Zhou, Yichen Zhang, Xinyi Chen, and Shaojie Shen. Fuel: Fast uav exploration using incremental fron- tier structure and hierarchical planning.IEEE Robotics and Automation Letters, 6(2):779–786, 2021

  59. [59]

    Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation, 2025

    Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero-shot object navigation, 2025

  60. [60]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su,JieShao,etal. Internvl3:Exploringadvancedtraining and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. APPENDIX A. Configuration Settings and Key Parameters For simulation evaluation and benchmarking, we use ...

  61. [61]

    A": [0.3,

    Frontier Probability Estimation (Set-of-Marks): Assume the labeled frontiers {labels} represent possible places to go. Each frontier is a detected boundary between explored and unexplored space. Estimate the probability that each frontier leads to (or is already around) a{target_object}when moving towards it and continuing exploration. Unseen labels shoul...

  62. [62]

    probability

    Target Presence Verification: Based on this image, estimate the probability that a{target_object} is in the camera field of view, within five meters, and reachable. If the object is reflected in a mirror, behind glass, barely visible, heavily occluded, or unreachable, it should be considered absent. Probabilities should be close to 0 (absent) or 1 (presen...