arxiv: 2605.01700 · v1 · submitted 2026-05-03 · 💻 cs.CV

Recognition: unknown

TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation

Yiyao Wang , Sixian Zhang , Keming Zhang , Xinhang Song , Songjie Du , Shuqiang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot object navigationretrieval-augmented generationtrajectory representationembodied AIgeometric-semantic memorylifelong learning3D scene navigation

0 comments

The pith

A retrieval system stores past navigation paths in compact geometric-semantic form and retrieves similar ones to guide large models toward objects in new scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TrajRAG as a way to let navigation agents draw on their own prior embodied experience rather than only on general knowledge from language models. It converts raw episode observations into a topological-polar trajectory format that captures layouts and semantics while cutting redundancy, then groups similar trajectories hierarchically for efficient lookup. When an agent faces a new environment it generates possible path hypotheses and pulls matching past trajectories to inform where to move next. This setup allows continuous accumulation of experience across episodes and is shown to raise success rates on standard 3D navigation benchmarks.

Core claim

TrajRAG incrementally stores episodic observations as topo-polar trajectories that compactly encode spatial layouts and semantic contexts. Hierarchical chunking groups similar trajectories into summaries that support coarse-to-fine retrieval. At inference time candidate frontiers spawn multiple trajectory hypotheses that query the memory for relevant past experiences; retrieved trajectories then steer large-model reasoning for waypoint selection, after which the new episode is folded back into the store.

What carries the argument

The topological-polar trajectory representation that encodes spatial layouts and semantic contexts while removing redundancies, together with hierarchical chunking that organizes similar trajectories into unified summaries for retrieval.

If this is right

Zero-shot ObjectNav success rates rise on MP3D, HM3D-v1, and HM3D-v2 when relevant past trajectories are retrieved.
Episodic observations become reusable lifelong memory instead of being discarded after each episode.
Large-model waypoint selection receives concrete geometric-semantic examples rather than relying solely on pretrained commonsense.
New scenes benefit from transfer of structured trajectory patterns without any task-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured memory could be queried during exploration or mapping to avoid previously visited dead-ends.
Over repeated deployments the system might implicitly learn common building archetypes that transfer across different houses.
Replacing or augmenting the large model with direct trajectory lookup could lower compute cost while preserving performance.
Extending the representation to include action outcomes or failure cases would allow the agent to learn avoidance strategies.

Load-bearing premise

The topo-polar format and chunking summaries retain enough geometric and semantic detail from raw observations to support useful transfer to unseen environments.

What would settle it

An experiment in which TrajRAG returns trajectories that are geometrically or semantically dissimilar to the current scene, resulting in lower or equal success rates compared with a baseline that ignores the memory.

Figures

Figures reproduced from arXiv: 2605.01700 by Keming Zhang, Shuqiang Jiang, Sixian Zhang, Songjie Du, Xinhang Song, Yiyao Wang.

**Figure 2.** Figure 2: Navigation Framework of TrajRAG. The agent incrementally maintains a semantic map during navigation. Based on this map, we convert the explored area into a topo-polar trajectory. Candidate trajectories are then generated according to the potential frontiers. For each candidate, TrajRAG retrieves relevant experiences to help the planner estimate which trajectory can reach the goal more efficiently. A coarse… view at source ↗

**Figure 3.** Figure 3: Retrieved Examples. Given a partial trajectory, TrajRAG uses its accumulated keypoints (green) to retrieve a layoutconsistent scene and the assistant goal-reaching trajectory (blue). For pre-retrieval, the top-5 matched keypoints within the visible map are highlighted in red. the current scene and database summaries, retrieving trajectories directly with sequence embeddings. This leads to a clear perfor… view at source ↗

**Figure 4.** Figure 4: Navigation with TrajRAG. The left column shows the agent’s ego-view RGB images. The middle column presents the skeletonization and detected keypoints (red) on the skeleton map, where the location icon indicates the ground-truth location of the target (“bed”) for visualization purposes only; the target location is unknown to the agent during navigation. The right column illustrates the agent’s traversed tr… view at source ↗

read the original abstract

Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience. To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience. Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajRAG stores past navigation trajectories in topo-polar form inside a hierarchical RAG so zero-shot ObjectNav can draw on accumulated experience instead of resetting each time.

read the letter

TrajRAG stores past navigation trajectories in topo-polar form inside a hierarchical RAG so zero-shot ObjectNav can draw on accumulated experience instead of resetting each time. The new piece is the specific encoding that turns raw episodes into compact representations of layout and semantics, then groups similar ones for coarse-to-fine lookup. During a run, frontier hypotheses query the store for matching past trajectories to guide waypoint selection, and the new episode gets folded back in. This directly targets the habit in zero-shot work of discarding everything after each episode and relying only on internet-scale commonsense from LLMs or VLMs. The framing is practical for agents that should improve over repeated exposure to environments like MP3D and HM3D. The engineering choices around incremental accumulation and hypothesis-driven retrieval feel like a straightforward way to add memory without starting from scratch. The main limitation is the thin evidence. The abstract states that the system improves performance on the three benchmarks but supplies no numbers, no baseline comparisons, no ablations on the topo-polar choice or chunking, and no error analysis. Without those details it is hard to know whether the gains come from the retrieval or from other parts of the pipeline. The stress-test worry about losing fine metric details such as exact distances and local obstacle density is also worth checking, since topological encodings often trade precision for compactness. Readers focused on long-term embodied autonomy, memory systems for navigation, or ways to ground VLMs in real 3D experience will get the most from the architecture. The paper deserves a serious referee because the identified problem is genuine and the proposed structure is a concrete attempt to address it. I would send it out for peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes TrajRAG, a retrieval-augmented generation framework for zero-shot Object Goal Navigation that incrementally accumulates episodic observations, encodes them via a topological-polar (topo-polar) trajectory representation to remove redundancies, organizes similar trajectories with hierarchical chunking for coarse-to-fine retrieval, and queries this store during navigation to guide large-model reasoning for waypoint selection from candidate frontiers. New experiences are consolidated to enable lifelong accumulation. Experiments on MP3D, HM3D-v1, and HM3D-v2 are stated to demonstrate effective retrieval of geometric-semantic experiences and improved ObjectNav performance.

Significance. If the quantitative claims hold and the encoding preserves necessary spatial-semantic information, the work offers a practical mechanism for lifelong embodied experience reuse in navigation, moving beyond static LLM commonsense priors. The hierarchical retrieval design is a clear engineering strength for scaling experience. However, significance is limited by the absence of verifiable performance deltas, baselines, or preservation metrics, leaving the core transfer assumption untested.

major comments (2)

[Abstract] Abstract: the central claim that TrajRAG 'improves zero-shot ObjectNav performance' is asserted without any quantitative results, baselines, ablation details, success rates, or error analysis, rendering the empirical contribution unverifiable from the provided text.
[Abstract] Abstract / topo-polar representation description: the claim that the topo-polar encoding 'compactly encodes spatial layouts and semantic contexts, effectively removing redundancies' lacks any reconstruction error, information-preservation metric, or ablation on metric details (distances, obstacle densities, local geometry); this directly undermines the transfer assumption to novel MP3D/HM3D layouts.

minor comments (1)

[Abstract] Abstract: the hierarchical chunking process is mentioned but the similarity criteria, chunk size, or summarization method are unspecified, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract can be improved to better convey the empirical results and the rationale behind the topo-polar representation. We address each major comment below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that TrajRAG 'improves zero-shot ObjectNav performance' is asserted without any quantitative results, baselines, ablation details, success rates, or error analysis, rendering the empirical contribution unverifiable from the provided text.

Authors: We agree that the abstract should include key quantitative highlights to make the performance gains immediately verifiable. The full manuscript reports detailed results in Section 4, including success rates, SPL, and comparisons against baselines on MP3D and HM3D. In the revised abstract we will add a concise summary of these improvements (e.g., relative gains in success rate) while keeping the length appropriate. revision: yes
Referee: [Abstract] Abstract / topo-polar representation description: the claim that the topo-polar encoding 'compactly encodes spatial layouts and semantic contexts, effectively removing redundancies' lacks any reconstruction error, information-preservation metric, or ablation on metric details (distances, obstacle densities, local geometry); this directly undermines the transfer assumption to novel MP3D/HM3D layouts.

Authors: The topo-polar representation is constructed to retain topological connectivity and polar spatial-semantic relations while discarding redundant observations; its design rationale and implementation are detailed in Section 3.1. We do not compute reconstruction error because the representation is not intended for full scene reconstruction but for experience retrieval in navigation. Its information preservation is instead validated empirically through high retrieval accuracy and the resulting navigation performance gains shown in our experiments and ablations (Section 4, Figures 3-4, Table 2). We will revise the abstract to briefly articulate this design choice and reference the supporting empirical evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: engineering framework with no derivations or fitted predictions

full rationale

The paper describes TrajRAG as an incremental retrieval-augmented framework built from proposed design choices (topo-polar trajectory encoding and hierarchical chunking) whose value is assessed solely through empirical experiments on MP3D and HM3D. No equations, closed-form derivations, parameter-fitting steps, or first-principles claims appear that could reduce to their own inputs by construction. Self-citations, if present, are not invoked as load-bearing uniqueness theorems or ansatzes that substitute for independent justification. The central performance claim therefore rests on external dataset results rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details remain at the conceptual level.

pith-pipeline@v0.9.0 · 5543 in / 1003 out tokens · 35995 ms · 2026-05-10T15:59:20.798690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Embodiedrag: Dynamic 3d scene graph retrieval for efficient and scalable robot task planning.arXiv preprint arXiv:2410.23968, 2024

Meghan Booker, Grayson Byrd, Bethany Kemp, Aurora Schmidt, and Corban Rivera. Embodiedrag: Dynamic 3d scene graph retrieval for efficient and scalable robot task planning.arXiv preprint arXiv:2410.23968, 2024. 3

work page arXiv 2024
[2]

Bridging zero- shot object navigation and foundation models through pixel- guided navigation skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero- shot object navigation and foundation models through pixel- guided navigation skill. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5228–5234. IEEE, 2024. 1

2024
[3]

Cognav: Cognitive process modeling for object goal navigation with llms

Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navigation with llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9550–9560, 2025. 3

2025
[4]

Rq-rag: Learning to refine queries for retrieval augmented generation

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. Rq-rag: Learning to refine queries for retrieval augmented generation.arXiv preprint arXiv:2404.00610, 2024. 3

work page arXiv 2024
[5]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In2017 International Confer- ence on 3D Vision (3DV), pages 667–676. IEEE, 2017. 5, 6

2017
[6]

Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,
[7]

Object goal navigation with recursive implicit maps

Shizhe Chen, Thomas Chabal, Ivan Laptev, and Cordelia Schmid. Object goal navigation with recursive implicit maps. In2023 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 7089–7096. IEEE,
[8]

Search for or navi- gate to? dual adaptive thinking for object navigation

Ronghao Dang, Liuyi Wang, Zongtao He, Shuai Su, Jiagui Tang, Chengju Liu, and Qijun Chen. Search for or navi- gate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8250–8259, 2023. 1

2023
[9]

Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 1

2022
[10]

Learning object rela- tion graph and tentative policy for visual navigation

Heming Du, Xin Yu, and Liang Zheng. Learning object rela- tion graph and tentative policy for visual navigation. InEu- ropean Conference on Computer Vision, pages 19–34, 2020. 1

2020
[11]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropoli- tansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused sum- marization.arXiv preprint arXiv:2404.16130, 2024. 3

work page internal anchor Pith review arXiv 2024
[12]

Context-dependent decision-making in the primate hippocampal–prefrontal cir- cuit.Nature Neuroscience, 28(2):374–382, 2025

Thomas W Elston and Joni D Wallis. Context-dependent decision-making in the primate hippocampal–prefrontal cir- cuit.Nature Neuroscience, 28(2):374–382, 2025. 2

2025
[13]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 5

1981
[14]

Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171– 23181, 2023. 1, 2

2023
[15]

Precise zero-shot dense retrieval without relevance labels

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. InProceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, 2023. 3

2023
[16]

Learning to map for active semantic goal navigation

Georgios Georgakis, Bernadette Bucher, Karl Schmeck- peper, Siddharth Singh, and Kostas Daniilidis. Learning to map for active semantic goal navigation. InThe Tenth In- ternational Conference on Learning Representations (ICLR 2022), 2022. 1

2022
[17]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented gen- eration.arXiv preprint arXiv:2410.05779, 2024. 3

work page internal anchor Pith review arXiv 2024
[18]

Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408,

Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang, et al. Gamap: Zero-shot object goal navigation with multi-scale geometric-affordance guidance.Advances in Neural Information Processing Systems, 37:39386–39408,
[19]

Self model for embodied artifi- cial intelligence.Journal of Computer Science and Technol- ogy, 2026

Shuqiang Jiang, Sixian Zhang, Shida Tao, Xihong Zhu, Tian- liang Qi, and Xinhang Song. Self model for embodied artifi- cial intelligence.Journal of Computer Science and Technol- ogy, 2026. 2

2026
[20]

Goat-bench: A benchmark for multi-modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mot- taghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373– 16383, 2024. 1

2024
[21]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 3

2020
[22]

Distilling LLM prior to flow model for generaliz- able agent’s imagination in object goal navigation.Advances in Neural Information Processing Systems, 2025

Badi Li, Renjie Lu, Yu Zhou, Jingke Meng, and Wei-Shi Zheng. Distilling LLM prior to flow model for generaliz- able agent’s imagination in object goal navigation.Advances in Neural Information Processing Systems, 2025. 8

2025
[23]

Prediction, sequences and the hippocampus.Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1193–1201, 2009

John Lisman and A David Redish. Prediction, sequences and the hippocampus.Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1193–1201, 2009. 2

2009
[24]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEu- ropean Conference on Computer Vision, pages 38–55, 2024. 6

2024
[25]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment, 2024. 8

2024
[26]

Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,
[27]

Learning structures: predictive represen- tations, replay, and generalization.Current Opinion in Be- havioral Sciences, 32:155–166, 2020

Ida Momennejad. Learning structures: predictive represen- tations, replay, and generalization.Current Opinion in Be- havioral Sciences, 32:155–166, 2020. 2

2020
[28]

Learning task-state representations.Nature neuro- science, 22(10):1544–1553, 2019

Yael Niv. Learning task-state representations.Nature neuro- science, 22(10):1544–1553, 2019. 2

2019
[29]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 5, 6

work page internal anchor Pith review arXiv 2021
[30]

Poni: Potential functions for objectgoal navigation with interaction-free learning

Santhosh Kumar Ramakrishnan, Devendra Singh Chap- lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022. 1

2022
[31]

The grid code for ordered experience

Jon W Rueckemann, Marielena Sosa, Lisa M Giocomo, and Elizabeth A Buffalo. The grid code for ordered experience. Nature Reviews Neuroscience, 22(10):637–649, 2021. 2

2021
[32]

Habitat: A platform for embodied AI research

Manolis Savva, Jitendra Malik, Devi Parikh, Dhruv Batra, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, and Vladlen Koltun. Habitat: A platform for embodied AI research. In 2019 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2019, Seoul, Korea (South), October 27 - Novem- ber 2, 2019, pag...

2019
[33]

Prioritized semantic learning for zero-shot in- stance navigation

Xinyu Sun, Lizhao Liu, Hongyan Zhi, Ronghe Qiu, and Jun- wei Liang. Prioritized semantic learning for zero-shot in- stance navigation. InEuropean Conference on Computer Vi- sion, pages 161–178. Springer, 2024. 8

2024
[34]

Ge- ometric transformation of cognitive maps for generalization across hippocampal-prefrontal circuits.Cell reports, 42(3),

Wenbo Tang, Justin D Shin, and Shantanu P Jadhav. Ge- ometric transformation of cognitive maps for generalization across hippocampal-prefrontal circuits.Cell reports, 42(3),
[35]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 6

2025
[36]

g3d-lf: Generalizable 3d- language feature fields for embodied tasks

Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d- language feature fields for embodied tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14191–14202, 2025. 2

2025
[37]

Lookahead exploration with neural radiance representation for continuous vision- language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 1

2024
[38]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 3

2025
[39]

Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025

Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025. 3

work page arXiv 2025
[40]

How to build a cognitive map.Nature neuroscience, 25(10):1257–1272, 2022

James CR Whittington, David McCaffary, Jacob JW Baker- mans, and Timothy EJ Behrens. How to build a cognitive map.Nature neuroscience, 25(10):1257–1272, 2022. 2

2022
[41]

Dd- ppo: Learning near-perfect pointgoal navigators from 2.5 bil- lion frames

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Ir- fan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd- ppo: Learning near-perfect pointgoal navigators from 2.5 bil- lion frames. InInternational Conference on Learning Rep- resentations. 8
[42]

V oronav: V oronoi-based zero- shot object navigation with large language model

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shang- hang Zhang, and Chang Liu. V oronav: V oronoi-based zero- shot object navigation with large language model. InIn- ternational Conference on Machine Learning, pages 53757– 53775. PMLR, 2024. 2, 3, 8

2024
[43]

Embodied-rag: General non- parametric embodied memory for retrieval and generation.arXiv preprint arXiv:2409.18313, 2024

Quanting Xie, So Yeon Min, Pengliang Ji, Yue Yang, Tianyi Zhang, Kedi Xu, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, and Yonatan Bisk. Embodied- rag: General non-parametric embodied memory for retrieval and generation.arXiv preprint arXiv:2409.18313, 2024. 3

work page arXiv 2024
[44]

Habitat-matterport 3d semantics dataset

Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakr- ishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023. 6

2023
[45]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in Neural Information Processing Systems, 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation.Advances in Neural Information Processing Systems, 37:5285–5307, 2024. 2, 8

2024
[46]

Unigoal: Towards universal zero-shot goal- oriented navigation

Hang Yin, Xiuwei Xu, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal- oriented navigation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19057–19066,
[47]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 2, 3, 8

2024
[48]

L3mvn: Leveraging large language models for visual target naviga- tion

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target naviga- tion. In2023 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 3554–3560. IEEE,
[49]

Trajectory diffusion for objectgoal naviga- tion.Advances in Neural Information Processing Systems, 37:110388–110411, 2024

Xinyao Yu, Sixian Zhang, Xinhang Song, Xiaorong Qin, and Shuqiang Jiang. Trajectory diffusion for objectgoal naviga- tion.Advances in Neural Information Processing Systems, 37:110388–110411, 2024. 8

2024
[50]

Peanut: Predicting and navigating to unseen targets

Albert J Zhai and Shenlong Wang. Peanut: Predicting and navigating to unseen targets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10926–10935, 2023. 1

2023
[51]

Faster segment anything: Towards lightweight sam for mobile applications,

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mo- bile applications.arXiv preprint arXiv:2306.14289, 2023. 6

work page arXiv 2023
[52]

Apexnav: An adap- tive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE Robotics Autom

Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adap- tive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE Robotics Autom. Lett., 10(11):11530–11537, 2025. 6, 8

2025
[53]

Generative meta-adversarial network for unseen object navigation

Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. Generative meta-adversarial network for unseen object navigation. InEuropean Conference on Com- puter Vision, pages 301–320. Springer, 2022. 1

2022
[54]

Layout-based causal inference for object navigation

Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang. Layout-based causal inference for object navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 10792–10802. IEEE,

2023
[55]

Imagine before go: Self-supervised generative map for object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16414–16425, 2024. 1, 8

2024
[56]

Hoz++: Versa- tile hierarchical object-to-zone graph for object navigation

Sixian Zhang, Xinhang Song, Xinyao Yu, Yubing Bai, Xin- long Guo, Weijie Li, and Shuqiang Jiang. Hoz++: Versa- tile hierarchical object-to-zone graph for object navigation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 1

2025
[57]

Function-centric bayesian network for zero- shot object goal navigation

Sixian Zhang, Xinyao Yu, Xinhang Song, Yiyao Wang, and Shuqiang Jiang. Function-centric bayesian network for zero- shot object goal navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19535– 19545, 2025. 2

2025
[58]

Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7641–7649, 2024. 1, 3

2024
[59]

Esc: Ex- ploration with soft commonsense constraints for zero-shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Ex- ploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023. 2, 8

2023
[60]

Beliefmapnav: 3d voxel-based belief map for zero- shot object navigation.Advances in Neural Information Pro- cessing Systems, 2025

Zibo Zhou, Yue Hu, Lingkai Zhang, Zonglin Li, and Siheng Chen. Beliefmapnav: 3d voxel-based belief map for zero- shot object navigation.Advances in Neural Information Pro- cessing Systems, 2025. 2, 3, 8

2025