pith. sign in

arxiv: 2606.27871 · v1 · pith:QQPTRD3Bnew · submitted 2026-06-26 · 💻 cs.RO

LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation

Pith reviewed 2026-06-29 04:43 UTC · model grok-4.3

classification 💻 cs.RO
keywords Object Goal NavigationVision Language Model DistillationOn-Device RoboticsEmbodied Reinforcement LearningScene Graph PipelineEdge Inference OptimizationToken Generation Regularization
0
0 comments X

The pith

Fine-tuning Qwen3.5-4B on 500 frontier reasoning traces yields 34.5 percent success in object goal navigation, closing most of the gap to a 39.7 percent cloud pipeline while cutting edge latency by 82.8 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a scene-graph pipeline with Claude Sonnet 4.6 that reaches 39.7 percent success rate on the HM3D OVON benchmark for open-vocabulary object navigation. It then transfers the underlying spatial-semantic reasoning into a 4B-parameter local VLM by fine-tuning on only 500 traces generated by the frontier model, producing 34.5 percent success. An additional embodied RL stage with token-generation regularization shrinks output sequences and, together with quantization, delivers an 82.8 percent reduction in inference latency on devices such as the Jetson Orin. The result demonstrates that frontier capabilities can be compressed into lightweight models suitable for on-device robot deployment.

Core claim

Distilling 500 reasoning traces from a frontier VLM into Qwen3.5-4B enables 34.5 percent success rate on HM3D OVON, narrowing the gap to the 39.7 percent achieved by the full Claude Sonnet 4.6 scene-graph pipeline, while E-RLVR with token-generation regularization plus quantization reduces overall inference latency by 82.8 percent for edge execution.

What carries the argument

Distillation of frontier VLM reasoning traces into a 4B local model combined with E-RLVR token-generation regularization for sequence compression.

If this is right

  • A 4B VLM can perform object goal navigation at success rates within 5 percentage points of a much larger cloud model.
  • Output sequence lengths drop by 72.1 percent and per-step latency by 71.8 percent after E-RLVR regularization.
  • Quantization combined with the regularization produces an 82.8 percent cumulative latency reduction suitable for embedded GPUs.
  • The distilled model supports low-latency local execution on mobile robots without cloud round-trips.
  • The same distillation pattern can be applied to other open-vocabulary embodied tasks that rely on spatial-semantic reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 500-trace set proves sufficient across benchmarks, the cost of creating local navigation agents drops dramatically because frontier model calls are needed only once.
  • The method implies that embodied RL can serve as a post-distillation compressor rather than a full training regime when high-quality traces already exist.
  • Real-robot deployment would test whether the simulated 34.5 percent success transfers when perception noise and actuation delays appear.

Load-bearing premise

The 500 reasoning traces generated by Claude Sonnet 4.6 contain sufficient transferable spatial-semantic knowledge that fine-tuning alone can close most of the performance gap to the full frontier pipeline.

What would settle it

Evaluating the base Qwen3.5-4B model without the 500-trace fine-tuning on the same HM3D OVON benchmark and finding its success rate remains below 15 percent would show the traces are necessary for the reported performance.

Figures

Figures reproduced from arXiv: 2606.27871 by Boyang Sun, Edoardo Ghignone, Liam Boyle, Luca Benini, Marc Pollefeys, Michele Magno, Nicolas Baumann, Pu Deng.

Figure 1
Figure 1. Figure 1: Schematic overview of the proposed LocalNav system. Supervised Fine-Tuning (SFT) enables Object Goal Navigation (ObjectNav) capabilities by distilling from large cloud frontier models, achieving a 26%-point increase in Success Rate (SR) on a Qwen3.5-4B model. Embod￾ied Reinforcement Learning from Verifiable Rewards (E-RLVR) grounds the agent in its task and regularizes Token Generation (TG) length, decreas… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the system architecture. A 3D [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic overview of the E-RLVR training loop in Habitat. The model samples four independent completions conditioned on the current camera observation and SG state. Each action trajectory is rolled out individually in the simulator. Because the target (”fireplace”) is already present within the SG (G ∈ SG), the reward function penalizes distance-to-goal and verbosity (Rbrevity), omitting exploration incen… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world experiment conducted using the hand-held device (Section [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Panoramic image of an example Habitat interaction, with the SG objects projected into the image. This image corresponds to the VLMs outputs in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt of the proposed architecture, the content matches the input panorama image of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative trajectories of the proposed system evaluated on the [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Depiction of the handheld device with the [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Unitree A2-W robot used for robot deployment. The handheld device of [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The proposed system mounted on a Unitree A2-W. The open-vocabulary task is to nav￾igate to a ”chair with a person sitting on it.”. The robot first begins to explore its surroundings, as the goal is not in its immediate FoV. Once the goal has been spotted and anchored in the SG, it can then navigate to it and visually verify it. A.8 E-RLVR Training Curves [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: depicts the cumulative reward and the mean token output length of a E-RLVR training run using the Claude SFT pretrained Qwen3.5-4B-SFT-Claude model, aforementioned in Section 4. The run performed with the hyperparameters as in Section A.11. As can be seen, the Rbrevity reward pushes the token output length from the initial ∼500 output tokens of the SFT-Claude model, down to ∼200 tokens. Note, that during … view at source ↗
Figure 12
Figure 12. Figure 12: KLD versus prefill throughput and TG throughput for all available quantization formats in llama.cpp. The quantization formats on the Pareto front are highlighted in red. The chosen IQ4-XS quantization is highlighted in yellow. From the profiling results, we observe that the prefill throughput is largely invariant across different quantization formats. This is because the prefill stage is compute-bound. Th… view at source ↗
read the original abstract

Vision Language Models (VLMs) have emerged in the robotic domain as a powerful tool that enables environmental perception with language context, serving as a catalyst for open-vocabulary tasks like ObjectNav. Yet, their computational footprint typically confines them to cloud execution, hindering low-latency inference with local deployment on resource-constrained robots. To address this challenge, we present a distillation strategy that transfers complex spatial-semantic reasoning from large frontier models into a lightweight, 4B-parameter local VLM for edge execution on embedded GPU devices (e.g., Jetson Orin). We first establish a State of the Art (SotA), Scene Graph (SG)-based pipeline using Claude Sonnet 4.6, achieving a 39.7% Success Rate (SR) on the HM3D OVON benchmark. We then demonstrate that fine-tuning Qwen3.5-4B on just 500 frontier reasoning traces effectively enables navigation capabilities, yielding a SR of 34.5%, narrowing the gap to the performance of large cloud models. Finally, we introduce E-RLVR with Token Generation (TG) regularization to compress output sequence lengths for physical deployment while grounding the agent in its task. This downstream optimization reduces TG overhead by 72.1% and latency by 71.8%. Combined with quantization, this joint strategy yields a cumulative 82.8% reduction in overall inference latency without significantly sacrificing performance, presenting a viable paradigm for local, low-latency VLM execution on mobile robots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces LocalNav, a distillation approach that transfers spatial-semantic reasoning from a frontier VLM (Claude Sonnet 4.6) via a scene-graph pipeline to a lightweight 4B VLM (Qwen3.5-4B) by fine-tuning on 500 reasoning traces. This achieves a 34.5% success rate on the HM3D OVON benchmark, close to the 39.7% of the frontier model. It further proposes E-RLVR with Token Generation regularization to reduce output lengths and latency for on-device deployment on embedded GPUs, reporting an 82.8% cumulative latency reduction.

Significance. If the results hold, the work would be significant as it shows that effective distillation of complex navigation reasoning can be achieved with a very small number of traces from a frontier model, enabling practical on-device VLM-based ObjectNav on resource-constrained robots without relying on cloud inference. The latency optimizations further support real-world deployment.

major comments (3)
  1. [Abstract] Abstract: The success rates of 34.5% and 39.7% are presented as single-point values without error bars, confidence intervals, or details on the evaluation protocol (e.g., number of episodes, variance across runs), which is load-bearing for claims of narrowing the performance gap.
  2. [Abstract / distillation experiments] Abstract / distillation experiments: No details are provided on the collection, sampling strategy, diversity (e.g., object categories, scene layouts on HM3D), quality filtering, or coverage metrics of the 500 frontier reasoning traces. This is central to the claim that fine-tuning on these traces transfers sufficient spatial-semantic knowledge, as the transferability assumption lacks supporting ablations or analysis.
  3. [Experiments section] Experiments section: The manuscript does not include ablations on the number of traces, comparison to other distillation methods, or verification that the 500-trace set was not post-hoc selected, making it hard to assess if the performance is due to genuine distillation or dataset-specific effects.
minor comments (1)
  1. [Abstract] Abstract: The term 'E-RLVR with Token Generation (TG) regularization' is introduced without prior definition or reference to its components.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas for improving the clarity and rigor of our results and experimental details. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The success rates of 34.5% and 39.7% are presented as single-point values without error bars, confidence intervals, or details on the evaluation protocol (e.g., number of episodes, variance across runs), which is load-bearing for claims of narrowing the performance gap.

    Authors: We agree that providing evaluation protocol details and statistical measures would strengthen the claims. In the revised manuscript, we will update the abstract and relevant sections to specify the number of episodes evaluated on the HM3D OVON benchmark and include any available variance or confidence interval information from the runs performed. revision: yes

  2. Referee: [Abstract / distillation experiments] Abstract / distillation experiments: No details are provided on the collection, sampling strategy, diversity (e.g., object categories, scene layouts on HM3D), quality filtering, or coverage metrics of the 500 frontier reasoning traces. This is central to the claim that fine-tuning on these traces transfers sufficient spatial-semantic knowledge, as the transferability assumption lacks supporting ablations or analysis.

    Authors: We will expand the methods and distillation experiments sections to include full details on the collection process for the 500 reasoning traces. This will cover the sampling strategy from the scene-graph pipeline, diversity across object categories and HM3D scene layouts, quality filtering steps applied, and any coverage metrics used. revision: yes

  3. Referee: [Experiments section] Experiments section: The manuscript does not include ablations on the number of traces, comparison to other distillation methods, or verification that the 500-trace set was not post-hoc selected, making it hard to assess if the performance is due to genuine distillation or dataset-specific effects.

    Authors: We will revise the experiments section to explicitly describe the trace generation and selection pipeline, confirming it followed a systematic process rather than post-hoc selection. While we did not conduct full ablations on varying numbers of traces or head-to-head comparisons against other distillation baselines (due to computational constraints), we will add discussion of related distillation literature and the rationale for selecting 500 traces based on our preliminary scaling observations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical distillation evaluated on external benchmark

full rationale

The paper describes an empirical pipeline that generates 500 reasoning traces from Claude Sonnet 4.6, fine-tunes Qwen3.5-4B on them, and reports success rate (34.5%) and latency reductions on the external HM3D OVON benchmark. No equations, fitted parameters, or self-referential definitions appear in the provided text that would reduce the reported metrics to quantities defined by the method itself. The central claims rest on measured outcomes after explicit training rather than any of the enumerated circular patterns (self-definitional, fitted-input-called-prediction, or load-bearing self-citation).

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the transferability of frontier traces and the effectiveness of the newly introduced regularization; both are supported only by the reported single-run benchmark numbers.

free parameters (1)
  • Number of frontier reasoning traces = 500
    The paper selects exactly 500 traces for fine-tuning; this count is chosen by the authors and directly determines the reported performance.
axioms (1)
  • domain assumption Reasoning traces produced by Claude Sonnet 4.6 contain the spatial-semantic knowledge required for ObjectNav.
    The entire distillation strategy presupposes that these traces are rich enough to transfer capability to the 4B model.
invented entities (1)
  • E-RLVR with Token Generation (TG) regularization no independent evidence
    purpose: Compress output sequence lengths while grounding the agent for physical deployment.
    New method introduced to achieve the 72.1% TG overhead reduction; no independent evidence supplied beyond the reported latency numbers.

pith-pipeline@v0.9.1-grok · 5832 in / 1486 out tokens · 60606 ms · 2026-06-29T04:43:09.940642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 6 canonical work pages

  1. [1]

    Achiam, S

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Hurst, A

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o System Card.arXiv preprint arXiv:2410.21276, 2024

  3. [3]

    G. C. et. al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/ 2507.06261

  4. [4]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    Zhang, K

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang. Uni- NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.013

  6. [6]

    Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng, S. Huang, and Q. Li. Move to understand a 3d scene: Bridging visual grounding and exploration for effi- cient and versatile embodied navigation.2025 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 8120–8132, 2025. URLhttps://api.semanticscholar. org/...

  7. [7]

    Yokoyama, S

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InInternational Conference on Robotics and Automation (ICRA), 2024

  8. [8]

    Padilla, B

    E. Padilla, B. Sun, M. Pollefeys, and H. Blum. Openfrontier: General navigation with visual- language grounded frontiers, 2026. URLhttps://arxiv.org/abs/2603.05377

  9. [9]

    Yokoyama, R

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha. Hm3d-ovon: A dataset and bench- mark for open-vocabulary object goal navigation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550. IEEE, 2024

  10. [10]

    Khanna*, R

    M. Khanna*, R. Ramrakhya*, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InCVPR, 2024. 9

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An Open-Source Vision-Language-Action Model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Lea...

  12. [12]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilin- sky. $\pi {0}$: A Vision-Language-Action Flow Model for General Robot Control. In ...

  13. [13]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

  14. [14]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics, 2025.URL https://arxiv. org/abs/2506.01844, 2, 1844

  15. [15]

    J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, et al. Molmoact: Action reasoning models that can reason in space, 2025.URL https://arxiv. org/abs/2508.07917, 2025

  16. [16]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  17. [17]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

  18. [18]

    F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter. Spatially-enhanced recur- rent memory for long-range mapless navigation via end-to-end reinforcement learning.The International Journal of Robotics Research, page 02783649251401926, 2025

  19. [19]

    K. Zhou, Q. Chen, D. Peng, Z. Li, X. Li, and J. Gu. Characterizing vision-language-action models across xpus: Constraints and acceleration for on-robot deployment, 2026. URL https://arxiv.org/abs/2604.24447

  20. [20]

    Honerkamp, M

    D. Honerkamp, M. B ¨uchner, F. Despinoy, T. Welschehold, and A. Valada. Language-grounded dynamic scene graphs for interactive object search with mobile manipulation.IEEE Robotics and Automation Letters, 2024. 10

  21. [21]

    H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu. Sg-nav: online 3d scene graph prompting for llm-based zero-shot object navigation. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

  22. [22]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as I Can, Not as I Say: Grounding Language in Robotic Affordances.arXiv preprint arXiv:2204.01691, 2022

  23. [23]

    W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik, et al. Language to Rewards for Robotic Skill Synthesis.arXiv preprint arXiv:2306.08647, 2023

  24. [24]

    Baumann, C

    N. Baumann, C. Hu, P. Sivasothilingam, H. Qin, L. Xie, M. Magno, and L. Benini. Enhancing Autonomous Driving Systems With on-Board Deployed Large Language Models, 2025. URL https://arxiv.org/abs/2504.11514

  25. [25]

    Wijmans, A

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DD- PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations (ICLR), 2020

  26. [26]

    Wijmans, A

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames.arXiv preprint arXiv:1911.00357, 2019

  27. [27]

    Hughes, Y

    N. Hughes, Y . Chang, and L. Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization.arXiv preprint arXiv:2201.13360, 2022

  28. [28]

    J. Ye, D. Batra, A. Das, and E. Wijmans. Auxiliary Tasks and Exploration Enable Ob- jectGoal Navigation . In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16097–16106, Los Alamitos, CA, USA, Oct. 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.01581. URLhttps://doi.ieeecomputersociety. org/10.1109/ICCV48922.2021.01581

  29. [29]

    K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–547, 2019. doi:10.1109/CVPR.2019.00063

  30. [30]

    D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. InIn Neural Information Processing Systems (NeurIPS), 2020

  31. [31]

    W. Xie, H. Jiang, Y . Zhu, J. Qian, and J. Xie. Naviformer: a spatio-temporal context-aware transformer for object navigation. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artifi- cial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intel...

  32. [32]

    H. Wang, B. Sun, J. Xing, F. Yang, M. Hutter, D. Shah, D. Scaramuzza, and M. Pollefeys. What matters in rl-based methods for object-goal navigation? an empirical study and a unified framework, 2025. URLhttps://arxiv.org/abs/2510.01830

  33. [33]

    X. Yu, S. Zhang, X. Song, X. Qin, and S. Jiang. Trajectory diffusion for objectgoal naviga- tion. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385. 11

  34. [34]

    S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InComputer Vision and Pattern Recognition (CVPR), 2022 IEEE Conference on. IEEE, 2022

  35. [35]

    Zhang, X

    S. Zhang, X. Song, Y . Bai, W. Li, Y . Chu, and S. Jiang. Hierarchical object-to-zone graph for object navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15130–15140, October 2021

  36. [36]

    W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi. Visual semantic navigation using scene priors. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019

  37. [37]

    Zhang, K

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.Robotics: Science and Systems, 2024

  38. [38]

    Cheng, Y

    A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Biyik, H. Yin, S. Liu, and X. Wang. NaVILA: Legged Robot Vision-Language-Action Model for Navigation. In Proceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS.2025.XXI.018

  39. [39]

    C. Gao, L. Jin, X. Peng, J. Zhang, Y . Deng, A. Li, H. Wang, and S. Liu. Octonav: Towards generalist embodied navigation, 2025. URLhttps://arxiv.org/abs/2506.09839

  40. [40]

    Zhang, A

    J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, X. LI, Y . Fan, W. Li, Z. Chen, F. Gao, Q. Wu, Z. Zhang, and H. Wang. Embodied navigation foundation model. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=kkBOIsrCXh

  41. [41]

    Zawalski, W

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=S70MgnIA0v

  42. [42]

    Firoozi, J

    R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, et al. Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research, 44(5):701–739, 2025

  43. [43]

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. In8th Annual Conference on Robot Learn- ing, 2024. URLhttps://openreview.net/forum?id=fCDOfpTCzZ

  44. [44]

    Chabal, S

    T. Chabal, S. Chen, J. Ponce, and C. Schmid. Fom-nav: Frontier-object maps for object goal navigation, 2025. URLhttps://arxiv.org/abs/2512.01009

  45. [45]

    Y . Cao, J. Zhang, Z. Yu, S. Liu, Z. Qin, Q. Zou, B. Du, and K. Xu. Cognav: Cognitive process modeling for object goal navigation with llms.arXiv preprint arXiv:2412.10439, 2024

  46. [46]

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  47. [47]

    G. Team. Gemma 3 technical report, 2025. URLhttps://arxiv.org/abs/2503. 19786

  48. [48]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  49. [49]

    Deepmind

    G. Deepmind. Gemini 3 pro - model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. [Ac- cessed 28-05-2026]

  50. [50]

    Gpt-5.4 thinking system card.https://deploymentsafety.openai.com/ gpt-5-4-thinking/gpt-5-4-thinking.pdf, 2026

    OpenAI. Gpt-5.4 thinking system card.https://deploymentsafety.openai.com/ gpt-5-4-thinking/gpt-5-4-thinking.pdf, 2026. [Accessed 28-05-2026]

  51. [51]

    Claude sonnet 4.6 system card.https://www-cdn.anthropic.com/ 78073f739564e986ff3e28522761a7a0b4484f84.pdf, 2026

    Anthropic. Claude sonnet 4.6 system card.https://www-cdn.anthropic.com/ 78073f739564e986ff3e28522761a7a0b4484f84.pdf, 2026. [Accessed 28-05- 2026]

  52. [52]

    Cheng, A

    B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation.Advances in neural information processing systems, 34:17864–17875, 2021

  53. [53]

    J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraor- dinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  54. [54]

    J. Brussee. Caveman: A Claude Code skill that cuts output tokens by talking like a caveman. https://github.com/JuliusBrussee/caveman, 2026. Accessed: June 29, 2026

  55. [55]

    Boyle, N

    L. Boyle, N. Baumann, P. Sivasothilingam, M. Magno, and L. Benini. Robotxr1: Enabling embodied robotic intelligence on large language models through closed-loop reinforcement learning.arXiv preprint arXiv:2505.03238, 2025

  56. [56]

    Y . Hu, A. Xi, Q. Xiao, S. Isaacson, H. X. Liu, R. Vasudevan, and M. Ghaffari. Longnav- r1: Horizon-adaptive multi-turn rl for long-horizon vla navigation, 2026. URLhttps:// arxiv.org/abs/2602.12351

  57. [57]

    Gerganov and O.-S

    G. Gerganov and O.-S. Contributors. Llama.cpp.https://github.com/ggerganov/ llama.cpp, 2023. Accessed: April 2025

  58. [58]

    Ziliotto, T

    F. Ziliotto, T. Campari, L. Serafini, and L. Ballan. Tango: training-free embodied ai agents for open-world tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24603–24613, 2025

  59. [59]

    Gorlo, L

    N. Gorlo, L. Schmid, and L. Carlone. Describe anything anywhere at any moment.arXiv preprint arXiv:2512.00565, 2025

  60. [60]

    Find a chair with a person sitting on it

    Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. A Appendix A.1 Model Input and Output Example This section aims to give an example of the VLM prompt in Figure 6, an exampleHabitatinput image in Figure 5, and the diversity of model responses in Table 4. Furthermore, this prompt-image pair and VLM response is an example from t...