pith. sign in

arxiv: 2605.16899 · v1 · pith:JKNNJPLSnew · submitted 2026-05-16 · 💻 cs.CV

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

Pith reviewed 2026-05-19 21:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords embodied AIcognitive mapspatio-temporal reasoningcontrastive learningvision-language navigationVLN-CEzero-shot generalization
0
0 comments X p. Extension
pith:JKNNJPLS Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{JKNNJPLS}

Prints a linked pith:JKNNJPLS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A dual-memory architecture with contrastive spatio-temporal learning builds latent cognitive maps for embodied agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to determine whether embodied AI agents develop genuine internal models of spatial structure or simply mimic expert paths. Foundational methods in tasks like vision-language navigation and embodied question answering fail to provide signals for learning fine-grained spatial details such as topology and distance across long, broken experiences. LASAR introduces a dual-memory system that keeps both episodic memories and a semantic cognitive map, trained using Spatio-temporal Contextual Representation Learning to create contrastive pairs from spatio-temporal cues. This results in better zero-shot performance and consistent internal maps. Readers would care because it moves toward agents that truly understand space rather than following patterns.

Core claim

LASAR features a dual-memory system for episodic experiences and a semantic cognitive map, trained with the ST-CRL contrastive objective that uses spatio-temporal cues from annotated contexts in simulation to form the internal cognitive map from the agent's experiences.

What carries the argument

The dual-memory system and ST-CRL contrastive objective, which builds sample pairs from spatio-temporal cues to maintain a semantic cognitive map alongside episodic memory.

Load-bearing premise

That the contrastive objective using simulation-generated spatio-temporal queries will compel the agent to encode actual spatial structures instead of task-specific shortcuts.

What would settle it

Observing no improvement in generalization or low self-consistency in the cognitive map when the ST-CRL objective is ablated would indicate that the spatial encoding is not occurring as claimed.

Figures

Figures reproduced from arXiv: 2605.16899 by Jinzhou Tang, Keze Wang, Sidi Liu, Waikit Xiu, Weixing Chen.

Figure 1
Figure 1. Figure 1: We introduce LASAR, a framework that builds a dynamic Cognitive Map from instructions and observations to capture spatial [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LASAR. Observations are processed by 2D and 3D Encoders and merged via Geometric-Semantic Fusion for better spatial understanding. These features populate a dual-memory system: (1) a high-fidelity Episodic Memory storing the temporal visual history, and (2) a structured Latent Cognitive Map. This map is generated by querying a learnable codebook (i.e., the Spatial Semantic At￾las) with episodic… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results from VLN-CE [23] and VSI-Bench [49]. We deploy LASAR in the simulation environment (i.e., Habitat) for long-horizon navigation task (i.e., VLN-CE) and Visual-spatial EQA task (i.e., VSI-Bench). We also provide qualitative results of our proposed MindCraft-Test with cognitive queries during navigation. MindCraft Performance Based on the core MindCraft metrics in [PITH_FULL_IMAGE:figures… view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of the mt latent space, demon￾strating the structural impact of ST-CRL. We compare the Baseline (LASAR (IL+QA), Left Column) against our full model (LASAR (w. ST-CRL), Right Column). (Top-Left, Top-Right) Latent space colored by Query Type. (Bottom-Left, Bottom-Right) La￾tent space colored by Room Type. (Bottom Row) Visualization of the mt state evolution for a single trajectory from ou… view at source ↗
Figure 5
Figure 5. Figure 5: MindCraft Data Statistics. 6.2. MindCraft Dataset Generation For each expert trajectory, our pipeline programmatically gen￾erates a corresponding MindCraft dataset through a structured, three-stage process designed to ensure that all generated queries are contextually relevant, unambiguous, and grounded in the agent’s experiential history. First, during Trajectory Traversal and Logging, a virtual agent pre… view at source ↗
Figure 6
Figure 6. Figure 6: Navigation Failure Case. The model fails to execute the instruction “Walk into the living room and keep walking straight past the living room...” due to a misinterpretation of the global room layout. Navigation Instruction A: Right. GT: Left. A: Yes. GT: No. Q: According to the task instruction, from your current position, will you need to pass through the living room to reach your final destination? Q: Fr… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative analysis of reasoning failures. We il￾lustrate two representative failure modes on the MindCraft-Test set. Left (Prospective Failure): The agent correctly identifies the scene semantics but fails to distinguish the subtle topologi￾cal difference between “passing through” and “going around” an area. Right (Introspective Failure): The agent struggles with fine-grained geometric grounding, misclas… view at source ↗
read the original abstract

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LASAR, a dual-memory architecture that maintains both episodic experiences and a semantic latent cognitive map, trained via Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective that constructs sample pairs from spatio-temporal cues generated through annotated context in simulation. It reports 2%-3.5% gains in zero-shot generalization on VLN-CE and VSI-Bench benchmarks together with high self-consistency of the proposed cognitive map.

Significance. If the central claim holds, the work supplies a targeted learning signal for encoding fine-grained spatial topology and metric relations over long-range fragmented trajectories, which is a recognized gap in action-centric and reasoning-centric embodied tasks. The dual-memory design and contrastive formulation are technically coherent and could generalize beyond the reported benchmarks.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported 2%-3.5% gains are presented without baselines, error bars, statistical tests, or ablation details, so it is impossible to determine whether the improvements are robust or attributable to the cognitive map rather than generic contrastive regularization or added memory capacity.
  2. [§3.2 and §4.3] §3.2 (ST-CRL objective) and §4.3 (self-consistency results): self-consistency is measured using the same ST-CRL objective that constructs the map, creating a circularity risk; this internal property can hold for systematically distorted maps and does not constitute an independent test of spatial fidelity.
  3. [§4] §4 (Evaluation): no direct measurement—such as map-derived distance queries, path lengths, or topology checks compared against simulator ground truth—is reported, leaving open the possibility that observed gains arise from task-specific mimicry rather than accurate long-range spatial encoding.
minor comments (2)
  1. [§3] Notation for the dual-memory components and the contrastive loss terms could be clarified with an explicit diagram or additional equations to improve readability.
  2. [§3.1] The description of how annotated spatio-temporal context is generated in simulation would benefit from a short pseudocode snippet or example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of result presentation and evaluation that we address below. We have revised the manuscript to incorporate additional experimental details, statistical analyses, and independent evaluations as suggested.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 2%-3.5% gains are presented without baselines, error bars, statistical tests, or ablation details, so it is impossible to determine whether the improvements are robust or attributable to the cognitive map rather than generic contrastive regularization or added memory capacity.

    Authors: We agree that the original presentation of results required more supporting details to establish robustness. In the revised manuscript, Section 4 now includes additional baselines (including ablations removing the cognitive map or replacing ST-CRL with generic contrastive objectives), error bars computed over five random seeds, and statistical significance tests (paired t-tests with p-values). These additions clarify that the reported gains are attributable to the dual-memory architecture and spatio-temporal contrastive learning rather than generic regularization or capacity increases. revision: yes

  2. Referee: [§3.2 and §4.3] §3.2 (ST-CRL objective) and §4.3 (self-consistency results): self-consistency is measured using the same ST-CRL objective that constructs the map, creating a circularity risk; this internal property can hold for systematically distorted maps and does not constitute an independent test of spatial fidelity.

    Authors: The referee is correct that relying solely on the training objective for self-consistency introduces a circularity concern. We have revised §4.3 to include independent tests of spatial fidelity. These consist of direct comparisons between map-inferred distances, connectivity, and topology against simulator ground truth, which are not derived from the ST-CRL loss. The updated results show that the latent map maintains accurate metric and topological relations beyond what the contrastive objective alone would enforce. revision: yes

  3. Referee: [§4] §4 (Evaluation): no direct measurement—such as map-derived distance queries, path lengths, or topology checks compared against simulator ground truth—is reported, leaving open the possibility that observed gains arise from task-specific mimicry rather than accurate long-range spatial encoding.

    Authors: We acknowledge the value of direct measurements to rule out task-specific mimicry. The revised evaluation section now reports explicit map-derived distance queries, path-length accuracy, and topology verification against simulator ground truth on both VLN-CE and VSI-Bench. These metrics demonstrate that performance improvements correlate with faithful long-range spatial encoding in the cognitive map, providing evidence beyond downstream task success alone. revision: yes

Circularity Check

1 steps flagged

Cognitive map self-consistency is constructed and evaluated via the same ST-CRL objective

specific steps
  1. self definitional [Abstract]
    "ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. ... We also demonstrate that our proposed cognitive map has high self-consistency."

    The map is explicitly formed by the ST-CRL objective; high self-consistency is then claimed as a result. Because the objective directly trains for consistency on the same spatio-temporal query pairs, the reported self-consistency reduces to a property of the training signal rather than an independent test of accurate long-range spatial structure.

full rationale

The paper defines the cognitive map as being formed directly by the ST-CRL contrastive objective that uses annotated spatio-temporal cues to build sample pairs. Self-consistency is then reported as a positive demonstration. This creates a self-referential loop where the internal property optimized during training is presented as independent evidence of fine-grained spatial encoding (topology/distance). No external ground-truth comparison (e.g., map-derived distances vs. simulator metrics) is described in the provided text. Downstream benchmark gains supply some independent signal, preventing a higher circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a contrastive objective operating on simulated spatio-temporal queries will induce a reusable semantic cognitive map; no free parameters are explicitly named and the cognitive map itself is an invented modeling construct.

axioms (1)
  • domain assumption Contrastive learning on spatio-temporal cues extracted from simulation can force encoding of fine-grained spatial relationships such as topology and distance
    Invoked to justify why ST-CRL will succeed where prior action-centric and reasoning-centric methods fail
invented entities (1)
  • Latent Cognitive Map no independent evidence
    purpose: Semantic memory component that encodes spatial structure independently of specific episodic trajectories
    Core architectural addition whose existence and utility are asserted by the method but lack external falsifiable validation in the abstract

pith-pipeline@v0.9.0 · 5733 in / 1305 out tokens · 49722 ms · 2026-05-19T21:31:15.424475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 12 internal anchors

  1. [1]

    Self-supervised object detection from egocentric videos

    Peri Akiva, Jing Huang, Kevin J Liang, Rama Kovvuri, Xingyu Chen, Matt Feiszli, Kristin Dana, and Tal Hassner. Self-supervised object detection from egocentric videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5225–5237, 2023. 2

  2. [2]

    Vision- and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Joel Bruce, Mark Johnson, Stephen Gould, and Anton van den Hengel. Vision- and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR 2018), Spotlight Oral, 2018. 1

  3. [3]

    Semantic world models.arXiv preprint arXiv:2510.19818, 2025

    Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, and Abhishek Gupta. Semantic world models.arXiv preprint arXiv:2510.19818, 2025. 2

  4. [4]

    Clark, and Aaron Wilber

    Edgar Bermudez-Contreras, Benjamin J. Clark, and Aaron Wilber. The neuroscience of spatial navigation and the rela- tionship to artificial intelligence.Frontiers in Computational Neuroscience, V olume 14 - 2020, 2020. 1

  5. [5]

    Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakr- ishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnik...

  6. [6]

    Neural topological SLAM for visual navigation.CoRR, abs/2005.12256, 2020

    Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological SLAM for visual navigation.CoRR, abs/2005.12256, 2020. 2

  7. [7]

    Robot navigation with map-based deep reinforcement learning

    Guangda Chen, Lifan Pan, Yu’an Chen, Pei Xu, Zhiqiang Wang, Peichen Wu, Jianmin Ji, and Xiaoping Chen. Robot navigation with map-based deep reinforcement learning. In Proceedings of the IEEE International Conference on Net- working, Sensing and Control (ICNSC), pages 1–6, 2020. 2

  8. [8]

    Longvila: Scaling long-context vi- sual language models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context vi- sual language models for long videos. InThe Thirteenth In- ternational Conference on Learning Representations. 6

  9. [9]

    Navila: Legged robot vision-language-action model for navigation

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiao- long Wang. Navila: Legged robot vision-language-action model for navigation. InRSS, 2025. 6

  10. [10]

    An explainable artificial intelligence approach to spatial navigation based on hippocampal circuitry.Neural Networks, 163:97–107, 2023

    Simone Coppolino and Michele Migliore. An explainable artificial intelligence approach to spatial navigation based on hippocampal circuitry.Neural Networks, 163:97–107, 2023. 1

  11. [11]

    Embodied Question Answering

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing.CoRR, abs/1711.11543, 2017. 1, 3

  12. [12]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: an embodie...

  13. [13]

    Epstein, Eva Zita Patai, Joshua B

    Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: spatial nav- igation and beyond.Nature Neuroscience, 20:1504–1513,

  14. [14]

    World Models

    David Ha and J ¨urgen Schmidhuber. World models.CoRR, abs/1803.10122, 2018. 2

  15. [15]

    Vln-bert: A recurrent vision-and-language bert for navigation

    Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln-bert: A recurrent vision-and-language bert for navigation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  16. [16]

    Ponder: Point cloud pre-training via neural rendering

    Di Huang, Sida Peng, Tong He, Honghui Yang, Xiaowei Zhou, and Wanli Ouyang. Ponder: Point cloud pre-training via neural rendering. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 16089– 16098, 2023. 1

  17. [17]

    Spatio-temporal self-supervised representation learning for 3d point clouds

    Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu. Spatio-temporal self-supervised representation learning for 3d point clouds. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6535–6545,

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  19. [19]

    Mnemosyne: Learning to train transformers with transformers.Advances in Neural Information Process- ing Systems, 36:77331–77358, 2023

    Deepali Jain, Krzysztof M Choromanski, Kumar Avinava Dubey, Sumeet Singh, Vikas Sindhwani, Tingnan Zhang, and Jie Tan. Mnemosyne: Learning to train transformers with transformers.Advances in Neural Information Process- ing Systems, 36:77331–77358, 2023. 2

  20. [20]

    Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

    Vihan Jain, Gabriel Magalh ˜aes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. CoRR, abs/1905.12255, 2019. 1

  21. [21]

    Neds-slam: A neural explicit dense semantic slam framework using 3d gaussian splatting.IEEE Robotics and Automation Letters, 2024

    Yiming Ji, Yang Liu, Guanghu Xie, Boyu Ma, Zongwu Xie, and Hong Liu. Neds-slam: A neural explicit dense semantic slam framework using 3d gaussian splatting.IEEE Robotics and Automation Letters, 2024. 2

  22. [22]

    Sukhatme

    Zhiwei Jia, Kaixiang Lin, Yizhou Zhao, Qiaozi Gao, Govind Thattai, and Gaurav S. Sukhatme. Learning to act with affordance-aware multimodal neural SLAM, 2022. 2

  23. [23]

    Beyond the nav-graph: Vision and language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. InEuropean Con- ference on Computer Vision (ECCV), 2020. 5, 7

  24. [24]

    Context- aware deep reinforcement learning for autonomous robotic navigation in unknown area

    Jingsong Liang, Zhichen Wang, Yuhong Cao, Jimmy Chiun, Mengqi Zhang, and Guillaume Adrien Sartoretti. Context- aware deep reinforcement learning for autonomous robotic navigation in unknown area. InProceedings of The 7th Con- ference on Robot Learning, pages 1425–1436. PMLR, 2023. 2

  25. [25]

    Navcot: Boosting llm-based vision-and- language navigation via learning disentangled reasoning

    Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm-based vision-and- language navigation via learning disentangled reasoning. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 2025. 6

  26. [26]

    V olumetric environ- ment representation for vision-language navigation

    Rui Liu, Wenguan Wang, and Yi Yang. V olumetric environ- ment representation for vision-language navigation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16328, 2024. 2

  27. [27]

    Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

    Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self- monitoring navigation agent via auxiliary progress estima- tion.CoRR, abs/1901.03035, 2019. 1

  28. [28]

    Openeqa: Embodied question answering in the era of foun- dation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488– 16498, 2024. 2

  29. [29]

    Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

    Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023. 2

  30. [30]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  31. [31]

    Self-supervised joint encoding of motion and appearance for first person action recognition

    Mirco Planamente, Andrea Bottino, and Barbara Caputo. Self-supervised joint encoding of motion and appearance for first person action recognition. In2020 25th International Conference on Pattern Recognition (ICPR), pages 8751–

  32. [32]

    Self-regulated learning for egocentric video activity anticipation.IEEE transactions on pattern analysis and machine intelligence, 45(6):6715–6730, 2021

    Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. Self-regulated learning for egocentric video activity anticipation.IEEE transactions on pattern analysis and machine intelligence, 45(6):6715–6730, 2021. 2

  33. [33]

    Zhang, Bradley Emi, Amir Za- mir, Silvio Savarese, Leonidas J

    Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Za- mir, Silvio Savarese, Leonidas J. Guibas, and Jitendra Malik. Learning to navigate using mid-level visual priors.CoRR, abs/1912.11121, 2019. 1

  34. [34]

    High-fidelity slam using gaussian splat- ting with rendering-guided densification and regularized op- timization

    Shuo Sun, Malcolm Mielle, Achim J Lilienthal, and Mar- tin Magnusson. High-fidelity slam using gaussian splat- ting with rendering-guided densification and regularized op- timization. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10476–10482. IEEE, 2024. 2

  35. [35]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6

  36. [36]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3, 6

  37. [37]

    Neural Discrete Representation Learning

    A ¨aron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. CoRR, abs/1711.00937, 2017. 4

  38. [38]

    Representation Learning with Contrastive Predictive Coding

    A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. 5

  39. [39]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. 4

  40. [40]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 4, 6, 2

  41. [41]

    Causality-aware transformer networks for robotic naviga- tion

    Ruoyu Wang, Yao Liu, Yuanjiang Cao, and Lina Yao. Causality-aware transformer networks for robotic naviga- tion. InInternational Conference on Neural Information Processing, pages 403–418. Springer, 2025. 2

  42. [42]

    Research on autonomous robots navigation based on reinforcement learning

    Zixiang Wang, Hao Yan, Yining Wang, Zhengjia Xu, Zhuoyue Wang, and Zhizhong Wu. Research on autonomous robots navigation based on reinforcement learning. In2024 3rd International Conference on Robotics, Artificial Intelli- gence and Intelligent Control (RAIIC), pages 78–81. IEEE,

  43. [43]

    REMI: Reconstruct- ing episodic memory during internally driven path planning

    Zhaoze Wang, Genela Morris, Dori Derdikman, Pratik Chaudhari, and Vijay Balasubramanian. REMI: Reconstruct- ing episodic memory during internally driven path planning. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2

  44. [44]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  45. [45]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2

  46. [46]

    Language models meet world models: Embodied experiences enhance lan- guage models.Advances in neural information processing systems, 36:75392–75412, 2023

    Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. Language models meet world models: Embodied experiences enhance lan- guage models.Advances in neural information processing systems, 36:75392–75412, 2023. 2

  47. [47]

    Qwen2 technical report.CoRR, 2024

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.CoRR, 2024. 6

  48. [48]

    Unipad: A uni- versal pre-training paradigm for autonomous driving.arXiv preprint arXiv:2310.08370, 2023

    Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, Haoyi Zhu, Tong He, Shixiang Tang, Hengshuang Zhao, Qibo Qiu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Unipad: A uni- versal pre-training paradigm for autonomous driving.arXiv preprint arXiv:2310.08370, 2023. 1

  49. [49]

    Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024. 6, 7

  50. [50]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

  51. [51]

    Seqafford: Sequential 3d affordance reasoning via multimodal large language model

    Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, and Jingya Wang. Seqafford: Sequential 3d affordance reasoning via multimodal large language model. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 1691–1701, 2025. 2

  52. [52]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 6

  53. [53]

    Navid: Video-based vlm plans the next step for vision-and-language navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. InRobotics: Science and Systems, 2024. 2

  54. [54]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 6

  55. [55]

    Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning

    Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17385–17394, 2024. 2

  56. [56]

    Generative causal interpretation model for spatio- temporal representation learning

    Yu Zhao, Pan Deng, Junting Liu, Xiaofeng Jia, and Jianwei Zhang. Generative causal interpretation model for spatio- temporal representation learning. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3537–3548, 2023. 2

  57. [57]

    Towards learning a generalist model for embod- ied navigation

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624– 13634, 2024. 2

  58. [58]

    Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models

    Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models. InEuropean Con- ference on Computer Vision, pages 260–278. Springer, 2024. 2

  59. [59]

    Navgpt: explicit reasoning in vision-and-language navigation with large lan- guage models

    Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: explicit reasoning in vision-and-language navigation with large lan- guage models. AAAI Press, 2024. 1

  60. [60]

    Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models

    Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7641–7649, 2024. 2

  61. [61]

    Ponderv2: Pave the way for 3d foundation model with a universal pre-training paradigm.arXiv preprint arXiv:2310.08586, 2023

    Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong He, Tong He, Hengshuang Zhao, Chun- hua Shen, Yu Qiao, and Wanli Ouyang. Ponderv2: Pave the way for 3d foundation model with a universal pre-training paradigm.arXiv preprint arXiv:2310.08586, 2023. 1

  62. [62]

    3d-spatial mul- timodal memory

    Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, and Xiaolong Wang. 3d-spatial mul- timodal memory. InThe Thirteenth International Conference on Learning Representations, 2025. 1 LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map Supplementary Material

  63. [63]

    Path Memory Log,

    Model and Training Details 6.1. Train Data To accomplish the dual tasks of visual navigation and spatial rea- soning, our training dataset is constructed directly upon theVLN- CE (Vision-and-Language Navigation in Continuous Environ- ments) training set. Our core training curriculum is theMindCraft-Train dataset. As detailed in the main paper (Section 3.3...

  64. [64]

    Each encoder is designed to process a specific modality (vision, geometry, trajectory), transforming raw sensory input into rich feature representations

    Model Architecture Details Our proposed framework, LASAR, is constructed around a cen- tral Vision-Language Model (VLM) and is augmented by a series of specialized encoders. Each encoder is designed to process a specific modality (vision, geometry, trajectory), transforming raw sensory input into rich feature representations. These features are then proje...

  65. [65]

    Equivalent Probe Set

    Evaluation Protocol Details 8.1. Overall Reasoning Accuracy (QA-Acc) This metric measures the overall accuracy of the agent’s answers to all cognitive queries in the test set. LetQbe the set of all queries in the test set. For any queryq∈ Q, letA q be the agent’s answer and Gq be the ground-truth answer. LetI(·)be the indicator function. The Overall Reaso...

  66. [66]

    Is the sofa to the left of the lamp?

    Template & Relational Equivalence:The query’s phrasing or logical frame of reference is different, but the semantics and ground-truth answer are identical.(e.g., “Is the sofa to the left of the lamp?” vs. “Is the lamp to the right of the sofa?” GT is “Yes” for both)

  67. [67]

    Temporal Misalignment:Queries within the same trajectory that are based on temporally shifted clips (e.g., offset by 1-2 steps), but the core fact and GT answer remain unchanged

  68. [68]

    What color is the vase you see now?

    Cross-Query-Type Equivalence:Targeting the same fact us- ing queries of different cognitive types (e.g., introspective, ret- rospective).(e.g., Anintrospectivequery att= 10: “What color is the vase you see now?” (GT: “Red”) vs. aretrospec- tivequery att= 25: “What color was the vase you passed earlier?” (GT: “Red”))

  69. [69]

    k choose 2

    Cross-Trajectory Metadata:Queries from different trajecto- ries that are determined (via metadata like environment/object ID) to be asking about the same fact with the same GT answer. We partition all queries intoMsuch Equivalent Probe Sets S={S 1, S2, ..., SM }. We only consider sets wherek i ≥2. 1.Calculate Total Consistent Pairs (C total):We sum the nu...

  70. [70]

    All reported scores are the average of three evaluation runs with different random seeds to ensure statistical stability

    for generating all responses to ensure efficiency and determinis- tic outputs. All reported scores are the average of three evaluation runs with different random seeds to ensure statistical stability

  71. [71]

    Walk into the living room and keep walking straight past the living room

    Additional Experimental Results 9.1. Baseline details Specialized Navigation ModelsFor all specialized naviga- tion baselines on VLN-CE, includingNavid,NaviLLM, and NaVILA, we adhered to the following protocol to ensure a fair and reproducible comparison: •Source:We utilized the official codebases and pre-trained model weights released by the respective a...

  72. [72]

    Query Type

    Feature Visualization To qualitatively validate the structural impact of our ST-CRL ob- jective, we conducted a comprehensive feature space analysis as presented in Figure 4 of the main paper. This visualization com- prises two distinct components: a static manifold structure com- parison (Top and Middle panels) and a dynamic trajectory evolu- tion analys...

  73. [73]

    LASAR-Lite

    Limitations and Future Work While our proposed LASAR framework demonstrates significant advancements in equipping embodied agents with spatio-temporal intelligence, we acknowledge several limitations that pave the way for compelling future research. First, our model’s multi-encoder architecture, while power- ful, introduces a notable computational overhea...