pith. machine review for the scientific record. sign in

arxiv: 2604.08232 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: unknown

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords embodied navigationhybrid reasoningaction entropylarge reasoning modelsobject navigationreinforcement learningtoken efficiency
0
0 comments X

The pith

HiRO-Nav triggers reasoning only on high-entropy actions to raise navigation success while cutting token use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Embodied agents using large reasoning models must decide when deliberate thinking is worth the cost during long-horizon tasks. HiRO-Nav measures the uncertainty in its action distribution at each step and reserves reasoning for the rare high-entropy moments that typically lead toward new scenes or key objects. Analysis of trajectories shows these uncertain steps contribute disproportionately to task success. The agent is trained first with hybrid supervised fine-tuning and then with online reinforcement learning that activates the reasoning branch only when entropy is high. On the CHORES-S ObjectNav benchmark this selective policy delivers higher success rates at lower average token cost than either always-reason or never-reason baselines.

Core claim

HiRO-Nav adaptively determines whether to perform thinking at every step based on its own action entropy. Examining entropy evolution reveals that only a small fraction of actions exhibit high entropy and these often steer the agent toward novel scenes or critical objects. The relationship between action entropy and Q-value further shows that improving high-entropy actions contributes more positively to task success. A tailored training pipeline of hybrid supervised fine-tuning followed by online reinforcement learning with the hybrid reasoning strategy explicitly activates reasoning only for high-entropy actions, significantly reducing computational overhead while improving decision quality

What carries the argument

Action entropy computed from the policy, used as a dynamic gate that activates deliberate reasoning only at high-uncertainty steps.

If this is right

  • Success improves because reasoning effort is concentrated on the minority of steps that most affect long-horizon Q-values.
  • Token consumption drops because the majority of low-entropy actions are executed reflexively without full reasoning traces.
  • The two-stage training pipeline stabilizes the entropy-based gate without introducing instability in simple scenes.
  • The same selective mechanism can be applied to other long-horizon embodied tasks that exhibit similar entropy distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Entropy gating may reduce reasoning cost in other LRM applications such as planning or dialogue where uncertainty also clusters at decision forks.
  • In physical robots the entropy signal would need to be computed from fast policy rollouts or approximations to avoid latency.
  • Shared entropy across agents could coordinate when a team should reason collectively versus act independently.

Load-bearing premise

Action entropy reliably marks the exact steps where extra reasoning improves completion rates more than reflexive action and the hybrid training does not create new failure modes in low-entropy regimes.

What would settle it

An ablation that forces reasoning on low-entropy steps or suppresses it on high-entropy steps and measures whether overall success rate or efficiency degrades on the same CHORES-S episodes.

Figures

Figures reproduced from arXiv: 2604.08232 by Chunyan Miao, Deheng Ye, He Zhao, Yijun Yang, Zichuan Lin.

Figure 1
Figure 1. Figure 1: Illustration of HiRO-Nav agent adaptively determining [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The distribution of action entropy (AE) over navigation trajectories. We analyze the AE distribution of a VLM agent fine￾tuned using expert trajectories on CHORES-S ObjectNav tasks. (a): Only a small fraction (∼30%) of actions exhibits high entropy (AE ≥ 0.6). (b): High-entropy actions (red points in the map) often steer the agent to explore novel areas or approach critical objects. An extended version of … view at source ↗
Figure 3
Figure 3. Figure 3: Mean Q-value of a hybrid fine-tuned model intro￾duced in Sec. 3.3 across various action entropy thresholds. The high threshold means sparse activation of reasoning, resulting in high token efficiency. We conclude that thinking only for high-entropy actions (threshold=0.6) achieves the best trade-off between task completion and maximizing token efficiency. Lower or higher thresholds can result in “overthink… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of HiRO-Nav agent against SOTA base￾lines in terms of the trade-off between navigation success rate (SR) and token efficiency. We compute the average number of model-generated tokens per episode (#Token/E). HiRO-Nav with hybrid reasoning strategy (Ours) achieves the best trade-off. Hence, we split the RL training into two stages to optimize the no-thinking and thinking abilities separately. In s… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of HiRO-Nav training pipeline and the proposed hybrid reasoning strategy. which consists of two part: (1) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of reasoning efficiency of hybrid rea￾soning (Ours) and thinking-every-K-steps (Every-K). We divide the navigation tasks into different difficulty levels based on the ground truth shortest path lengths. Our hybrid reasoning method consistently outperforms the baseline reasoning approach across all difficulty levels while maintaining a lower thinking ra￾tio. TR=Thinking Ratio. SR= Success Rate. g… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study. Fig. (a)&(b): Vanilla RL fails to effectively enhance hybrid reasoning ability due to a decline in no-thinking ability. In contrast, our two-stage training paradigm successfully improves the agent’s no-thinking ability in Stage I and maintained by KL constraint in Stage II, which subsequently enhance the hybrid reasoning ability. Fig. (c): The superior performance of NTW>0 highlights its ef… view at source ↗
Figure 8
Figure 8. Figure 8: Pass@k curves of HiRO-Nav with hybrid reasoning. GT and DM refer to the ground truth ASMs and the ASMs esti￾mated by deep models as in Tab. 2 .We evaluation for 16 times with temperature=0.2. The navigation ability upper bound of HiRO￾Nav outperforms task-specific sota method Poliformer[43], even when using noisy deep model estimated ASMs. lines, further demonstrating that even with lower-quality ASMs, the… view at source ↗
Figure 9
Figure 9. Figure 9: An example of annotated semantic map. Task Success Condition.The task is considered successful if the agent terminates navigation by emitting the “end” action within a specified step limit, and the target object is within the agent’s view and within a certain distance from its current location. Action Space. We provide details of the action space the RE-Strech 1 robot in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 10
Figure 10. Figure 10: Additional visualization examples of action entropy at each navigation waypoint. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Two action entropy calculation method show the same relationship between action entropy and Q-value. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberate reasoning before acting.To achieve this, we introduce \textbf{H}ybr\textbf{i}d \textbf{R}eas\textbf{O}ning \textbf{Nav}igation (\textbf{HiRO-Nav}) agent, the first kind of agent capable of adaptively determining whether to perform thinking at every step based on its own action entropy. Specifically, by examining how the agent's action entropy evolves over the navigation trajectories, we observed that only a small fraction of actions exhibit high entropy, and these actions often steer the agent toward novel scenes or critical objects. Furthermore, studying the relationship between action entropy and task completion (i.e., Q-value) reveals that improving high-entropy actions contributes more positively to task success.Hence, we propose a tailored training pipeline comprising hybrid supervised fine-tuning as a cold start, followed by online reinforcement learning with the proposed hybrid reasoning strategy to explicitly activate reasoning only for high-entropy actions, significantly reducing computational overhead while improving decision quality. Extensive experiments on the \textsc{CHORES}-$\mathbb{S}$ ObjectNav benchmark showcases that HiRO-Nav achieves a better trade-off between success rates and token efficiency than both dense-thinking and no-thinking baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HiRO-Nav, an embodied navigation agent for long-horizon tasks that adaptively activates reasoning only at high action-entropy steps. It reports an observational finding that high-entropy actions are rare, steer toward novel/critical states, and that improving them yields higher marginal gains in task completion (measured by Q-value). A hybrid pipeline of supervised fine-tuning followed by online RL is proposed to implement this selective reasoning, with the central empirical claim being a superior success-rate versus token-efficiency trade-off on the CHORES-S ObjectNav benchmark relative to dense-thinking and no-thinking baselines.

Significance. If the empirical claims and mechanism are substantiated, the work could be significant for efficient deployment of large reasoning models in embodied settings by showing that selective, entropy-triggered reasoning can reduce computational overhead while preserving or improving decision quality in navigation. The hybrid SFT+RL training recipe and the entropy-based activation rule address a practical scaling concern, though the absence of supporting quantitative evidence and controls currently limits assessment of its broader impact.

major comments (3)
  1. [Abstract] Abstract: The central claim that HiRO-Nav achieves a better success-rate/token-efficiency trade-off is unsupported by any reported numbers, error bars, ablation tables, or statistical tests, rendering the headline result impossible to evaluate for magnitude or reliability.
  2. [Abstract] Abstract: The relationship between action entropy and Q-value is described only as an observational finding ('studying the relationship... reveals that improving high-entropy actions contributes more positively') without causal experiments, interventions, or controls demonstrating that the entropy selector itself drives the reported gains rather than other elements of the training pipeline.
  3. [Experiments section] Experiments section: No ablation is presented that replaces the entropy-based selector with an alternative sparse activation rule of comparable density (e.g., random or fixed-interval selection) while keeping the rest of the hybrid SFT+RL pipeline fixed; without this, it is impossible to establish that the adaptive entropy rule is load-bearing for the claimed trade-off improvement.
minor comments (2)
  1. The benchmark is referred to as CHORES-S or CHORES-𝕊 without an explicit definition or citation in the abstract; this notation should be clarified on first use in the main text.
  2. [Abstract] The phrase 'the first kind of agent capable of adaptively determining whether to perform thinking' should be tempered with a brief literature comparison to avoid overstatement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and proposed revisions to strengthen the presentation of results and controls.

read point-by-point responses
  1. Referee: [Abstract] The central claim that HiRO-Nav achieves a better success-rate/token-efficiency trade-off is unsupported by any reported numbers, error bars, ablation tables, or statistical tests, rendering the headline result impossible to evaluate for magnitude or reliability.

    Authors: The abstract provides a high-level summary of the empirical findings. Detailed quantitative results—including success rates, token consumption, error bars across multiple seeds, ablation tables, and statistical comparisons—are reported in the Experiments section on the CHORES-S benchmark. To improve accessibility, we will revise the abstract to incorporate key numerical values (e.g., success rate and token efficiency deltas versus baselines) while retaining the summary style. revision: yes

  2. Referee: [Abstract] The relationship between action entropy and Q-value is described only as an observational finding ('studying the relationship... reveals that improving high-entropy actions contributes more positively') without causal experiments, interventions, or controls demonstrating that the entropy selector itself drives the reported gains rather than other elements of the training pipeline.

    Authors: The entropy-Q-value relationship is presented as an observational analysis of trajectories to motivate the method. The primary evidence for the selector's utility is the end-to-end performance of the full HiRO-Nav pipeline (hybrid SFT followed by online RL with entropy-triggered reasoning) against dense-thinking and no-thinking baselines that share the same training recipe. We agree the language could be clarified to emphasize the observational basis and will add a dedicated paragraph discussing potential confounding factors from the training pipeline along with any available supporting statistics. revision: partial

  3. Referee: [Experiments section] No ablation is presented that replaces the entropy-based selector with an alternative sparse activation rule of comparable density (e.g., random or fixed-interval selection) while keeping the rest of the hybrid SFT+RL pipeline fixed; without this, it is impossible to establish that the adaptive entropy rule is load-bearing for the claimed trade-off improvement.

    Authors: The current experiments compare the entropy-triggered approach against the boundary cases of always-reason and never-reason under the identical hybrid training pipeline. We acknowledge that an ablation using non-adaptive sparse rules (random or fixed-interval) at matched activation density would more directly isolate the benefit of entropy-based selection. We will add this control experiment in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper's core chain proceeds from empirical observations of action entropy on trajectories (high-entropy steps are rare and correlate with higher marginal Q-value impact) to a design choice for an entropy-gated hybrid reasoning policy, implemented via hybrid SFT+RL training, and finally evaluated on downstream success/token metrics against baselines. These steps rely on external benchmark results rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or uniqueness theorems are invoked that reduce the claimed trade-off to the input statistics by construction. The entropy selector is a motivated heuristic whose effectiveness is tested independently via RL optimization and CHORES-S evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical observations of action entropy and its correlation with task success; no explicit free parameters, axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5606 in / 1098 out tokens · 32269 ms · 2026-05-10T16:55:27.373947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 36 canonical work pages · 14 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6

  2. [2]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Olek- sandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020. 3

  3. [3]

    Cl-cotnav: Closed-loop hi- erarchical chain-of-thought for zero-shot object-goal nav- igation with vision-language models.arXiv preprint arXiv:2504.09000, 2025

    Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, and Chen Lv. Cl-cotnav: Closed-loop hi- erarchical chain-of-thought for zero-shot object-goal nav- igation with vision-language models.arXiv preprint arXiv:2504.09000, 2025. 1, 2, 3

  4. [4]

    Cognav: Cognitive process modeling for object goal navigation with llms,

    Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive pro- cess modeling for object goal navigation with llms.arXiv preprint arXiv:2412.10439, 2024. 1, 3

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 8

  6. [6]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024. 3, 1

  7. [7]

    Navila: Legged robot vision-language- action model for navigation,

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 6, 1

  8. [8]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforce- ment learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025. 2, 3

  9. [9]

    Gemini pro.https : / / deepmind

    DeepMind. Gemini pro.https : / / deepmind . google/models/gemini/pro/, 2024. 1, 6

  10. [10]

    arXiv preprint arXiv:2507.19849 , year=

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025. 3

  11. [11]

    Spoc: Imitating shortest paths in simulation enables effective navigation and manipu- lation in the real world

    Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, et al. Spoc: Imitating shortest paths in simulation enables effective navigation and manipu- lation in the real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 162...

  12. [12]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025. 6

  13. [13]

    Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025

    Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: To- wards generalist embodied navigation.arXiv preprint arXiv:2506.09839, 2025. 3, 6, 1

  14. [14]

    End-to-end navigation with vlms: Transforming spatial reasoning into question-answering

    Dylan Goetting, Himanshu Gaurav Singh, and Antonio Lo- quercio. End-to-end navigation with vlms: Transforming spatial reasoning into question-answering. InWorkshop on Language and Robot Learning: Language as an Interface,

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 3

  16. [16]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 5, 8, 2

  17. [17]

    Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting

    Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware bud- geting.arXiv preprint arXiv:2505.18822, 2025. 3, 1

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  19. [19]

    Think only when you need with large hybrid-reasoning models.arXiv preprint arXiv:2505.14631,

    Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, and Furu Wei. Think only when you need with large hybrid-reasoning models.arXiv preprint arXiv:2505.14631,

  20. [20]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  21. [21]

    arXiv preprint arXiv:2503.16188 , volume=

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual re- inforcement fine-tuning.arXiv preprint arXiv:2503.16188,

  22. [22]

    Introducing o3 and o4 mini.https://openai

    OpenAI. Introducing o3 and o4 mini.https://openai. com / zh - Hans - CN / index / introducing - o3 - and-o4-mini/, 2024. 1, 6

  23. [23]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 4

  24. [24]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

    Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

  25. [25]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui inter- action with native agents.arXiv preprint arXiv:2501.12326,

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 4, 1

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1

  28. [28]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C ˆot´e, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for in- teractive learning.arXiv preprint arXiv:2010.03768, 2020. 8

  29. [29]

    To cot or not to cot? chain- of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024

    Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 2, 3, 1

  30. [30]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 4, 2

  31. [31]

    Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi- Stage RL.ArXiv, abs/2505.10832, 2025

    Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Lin- jing Li, Xiangyuan Lan, and Dongbin Zhao. Learning when to think: Shaping adaptive reasoning in r1-style models via multi-stage rl.arXiv preprint arXiv:2505.10832, 2025. 2, 3, 1

  32. [32]

    Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

    Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 3, 1

  33. [33]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix- uan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2, 3

  34. [34]

    Adaptive deep reasoning: Triggering deep thinking when needed.arXiv preprint arXiv:2505.20101, 2025

    Yunhao Wang, Yuhao Zhang, Tinghao Yu, Can Xu, Feng Zhang, and Fengzong Lian. Adaptive deep reasoning: Triggering deep thinking when needed.arXiv preprint arXiv:2505.20101, 2025. 3, 1

  35. [35]

    Divscene: Benchmarking lvlms for object nav- igation with diverse scenes and objects.arXiv preprint arXiv:2410.02730, 2024

    Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, and Dong Yu. Divscene: Benchmarking lvlms for object nav- igation with diverse scenes and objects.arXiv preprint arXiv:2410.02730, 2024. 1

  36. [36]

    Q-learning.Ma- chine learning, 8(3):279–292, 1992

    Christopher JCH Watkins and Peter Dayan. Q-learning.Ma- chine learning, 8(3):279–292, 1992. 2, 4

  37. [37]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3

  38. [38]

    Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 5, 8

  39. [39]

    Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024

    Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in neural information processing systems, 37:5285–5307, 2024. 1, 2, 3

  40. [40]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 1

  41. [42]

    Don’t overthink it: A survey of ef- ficient r1-style large reasoning models.arXiv preprint arXiv:2508.02120, 2025

    Linan Yue, Yichao Du, Yizhi Wang, Weibo Gao, Fangzhou Yao, Li Wang, Ye Liu, Ziyu Xu, Qi Liu, Shimin Di, et al. Don’t overthink it: A survey of efficient r1-style large rea- soning models.arXiv preprint arXiv:2508.02120, 2025. 3, 1

  42. [43]

    Poliformer: Scaling on-policy rl with transformers results in masterful navigators

    Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, and Luca Weihs. Poliformer: Scaling on-policy rl with transformers results in masterful navigators. InCon- ference on Robot Learning, pages 408–432. PMLR, 2025. 4, 5, 6, 8, 1

  43. [44]

    Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1

  44. [45]

    Navid: Video-based vlm plans the next step for vision-and-language navigation,

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 1

  45. [46]

    Adaptthink: Reasoning models can learn when to think

    Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. arXiv preprint arXiv:2505.13417, 2025. 3, 1

  46. [47]

    MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation

    Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

  47. [48]

    Mem2ego: Empower- ing vision-language models with global-to-ego memory for long-horizon embodied navigation.arXiv preprint arXiv:2502.14254, 2025

    Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, et al. Mem2ego: Empower- ing vision-language models with global-to-ego memory for long-horizon embodied navigation.arXiv preprint arXiv:2502.14254, 2025. 1, 2, 3

  48. [49]

    Apexnav: An adaptive exploration strategy for zero-shot object naviga- tion with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025

    Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adaptive exploration strategy for zero-shot object naviga- tion with target-centric semantic fusion.arXiv preprint arXiv:2504.14478, 2025. 1, 2, 3

  49. [50]

    Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

    Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Huimin Ma, Shifeng Zhang, Xu Zhou, and Si Liu. Topv- nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation.arXiv preprint arXiv:2411.16425, 2024. 1 A. Related Work A.1. Foundation Models as Navigation Agents LRMs have been introduced to handle navigation tasks [7, 13,...