MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Jing Zhang; Lingfeng Zhang; Pengwei Wang; Qiang Zhang; Qinwen Xu; Renjing Xu; Shanghang Zhang; Xiaoshuai Hao; Xinyao Zhang; Zhongyuan Wang

arxiv: 2502.13451 · v5 · submitted 2025-02-19 · 💻 cs.RO

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang , Xiaoshuai Hao , Qinwen Xu , Qiang Zhang , Xinyao Zhang , Pengwei Wang , Jing Zhang , Zhongyuan Wang

show 2 more authors

Shanghang Zhang Renjing Xu

This is my paper

Pith reviewed 2026-05-23 02:56 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-and-language navigationsemantic mapsmemory representationvision-language modelsembodied AIannotated mapsnavigation agentstop-down maps

0 comments

The pith

MapNav replaces historical observation frames with an Annotated Semantic Map to guide vision-and-language navigation agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that traditional VLN methods incur high storage and compute costs by retaining sequences of past visual frames as context. MapNav instead builds a single top-down semantic map at episode start, updates it each step with new observations, and adds explicit text labels to important regions to create an Annotated Semantic Map. This ASM becomes the sole memory input to a VLM-based agent. Experiments show the approach reaches state-of-the-art success rates in both simulation and real-world settings while eliminating the need to store frame histories. The authors position the ASM as a reusable new memory representation for the VLN task.

Core claim

MapNav constructs a top-down semantic map at the beginning of each episode and updates it at every timestep; key regions receive explicit textual labels that convert abstract semantics into navigation cues, producing the Annotated Semantic Map; the resulting ASM is supplied directly to a VLM-powered agent as its only memory representation, replacing all historical observation frames.

What carries the argument

Annotated Semantic Map (ASM): a top-down semantic map that is initialized once per episode, updated each timestep, and augmented with textual labels on key regions to supply structured navigation cues to the agent.

If this is right

Storage and compute overhead from maintaining observation histories is eliminated.
The same ASM construction process yields state-of-the-art navigation success in both simulated and physical environments.
VLMs can be applied directly to the compact annotated map rather than to raw image sequences.
The released ASM generation code and dataset enable other researchers to adopt the representation without re-implementing map construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ASM proves robust across longer trajectories, similar map-based memory could reduce context length requirements in other embodied tasks such as object manipulation or multi-agent coordination.
Explicit text labels on the map may allow human operators to inspect or correct the agent's internal state more easily than inspecting raw image histories.
The approach implicitly assumes reliable semantic segmentation and mapping; any degradation in those upstream modules would directly limit the ASM's usefulness.

Load-bearing premise

The combination of a top-down semantic map and its textual annotations contains enough structured information to substitute for stored historical frames without losing decision-critical detail.

What would settle it

A controlled ablation in which the textual annotations are removed from the ASM while keeping the geometric map intact, followed by measurement of whether success rate or path efficiency falls in the same environments where the full ASM previously achieved SOTA.

Figures

Figures reproduced from arXiv: 2502.13451 by Jing Zhang, Lingfeng Zhang, Pengwei Wang, Qiang Zhang, Qinwen Xu, Renjing Xu, Shanghang Zhang, Xiaoshuai Hao, Xinyao Zhang, Zhongyuan Wang.

**Figure 1.** Figure 1: Illustration of our Annotated Semantic Map (ASM). At each timestep, MapNav agent leverages egocentric observations to capture semantic objects and assign explicit textual labels to key regions, creating the ASM for the current moment. ASM provides information such as physical obstacles, explored regions, the agent’s current position, trajectory and semantic objects. tion of embodied AI and multimodal und… view at source ↗

**Figure 2.** Figure 2: An overview of MapNav framework. We present a top-down Annotated Semantic Map (ASM), updated at each timestep for precise object mapping and structured navigation. It features explicit textual labels for key regions, providing clear navigation cues. The current RGB observation, ASM, and instruction are used as inputs to an end-to-end VLM framework, which generates navigation actions in natural language. gr… view at source ↗

**Figure 3.** Figure 3: ASM Generation Process. Semantic map generation starts with episode initialization. At each timestep, the RGB image is processed by a semantic segmentation module to create a semantic mask aligned with the depth-converted 3D point cloud. By combining this with the previous pose transformation, we project the 3D point cloud onto a 2D plane to update the semantic map. Finally, we convert the semantic map int… view at source ↗

**Figure 4.** Figure 4: Comparison of different VLM’s understanding [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of MapNav using different num [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The real-world MapNav robot setup. to 32,768 tokens and incorporates sliding window attention with a window size of 131,072 tokens. Training Setting. We conducted our training on 8 NVIDIA A100 GPUs for approximately 30 hours, totaling 240 GPU hours (≈500k step-wise samples). During the fine-tuning process, we froze the vision encoder and only fine-tuned the multimodal projector and language model compone… view at source ↗

**Figure 7.** Figure 7: Visualization results of MapNav in the simulator. Timestep = 0 Timestep = 17 Timestep = 23 Third Perspective Egocentric View ASMs Simple Instruction “Walk forward, turn right and go straight, stop at the door. ” Timestep = 0 Timestep = 17 Timestep = 23 Semantic Instruction “Walk forward, turn right at the refrigerator, go straight, stop at the wall.” [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization results of MapNav in the real-world. successfully identifies the shortest path while following complex instructions involving multiple waypoints. In contrast, without ASM, the agent struggles to find the correct path, underscoring ASM’s importance in semantic understanding and path planning. In real-world tests, the agent effectively executes simple navigation instructions and excels at com… view at source ↗

**Figure 9.** Figure 9: Visualization of VLM Attention Across Different Map Representations. A comparison of different map representations showing that while Semantic Map exhibits sparse attention patterns without convergence on semantic objects, our ASM successfully leverages textual labels to guide attention towards semantic objects, as evidenced by concentrated attention distributions and the VLM’s responses. attention alignme… view at source ↗

**Figure 10.** Figure 10: Additional Visualizations of VLM Attention Across Different Map Representations [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: (1/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: (2/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: (3/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: (4/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: (5/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: (6/6) Simulator demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: (1/2) Real-world demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: (2/2) Real-world demo results visualization. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

read the original abstract

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapNav swaps historical frames for an updated top-down semantic map with text labels fed to a VLM, which is a practical memory tweak but rests on an untested claim that the map loses no critical navigation detail.

read the letter

The paper's main move is to replace the usual history of egocentric RGB frames with a single Annotated Semantic Map that starts as a top-down semantic layout, gets updated each step, and receives explicit textual labels on key regions before going into a VLM agent. This targets the storage and compute cost of long VLN episodes, which is a real constraint on robots. The concrete engineering choice—dynamic map plus overlaid text for direct language-model input—does not reduce cleanly to the memory mechanisms cited in the abstract, so that part feels like a distinct option worth trying. Releasing the map-generation code and dataset is also useful for anyone who wants to reproduce or extend the representation. The approach is presented plainly as an empirical construction rather than a fitted model, which keeps the circularity burden low. The soft spot is exactly the one in the stress-test note: a top-down semantic map drops viewpoint-specific appearance, texture, partial occlusions, and metric depth that many instructions reference, and nothing in the abstract or claim structure shows an ablation that holds the VLM backbone and training fixed while swapping the memory type. Without those controls or the actual metrics, baselines, and error analysis, the SOTA assertion in both sim and real settings cannot be assessed. This is for people working on memory-efficient VLN or resource-constrained embodied agents. A reader who wants to test lighter representations could extract value from the method description even if the performance numbers need verification. It deserves a serious referee because the idea is testable and the reproducibility step is already planned, though the experiments will need to address the substitution question directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces MapNav, an end-to-end VLN model for vision-and-language navigation that constructs a top-down Annotated Semantic Map (ASM) at episode start, updates it each timestep, adds explicit textual labels to key regions, and feeds the resulting ASM to a VLM agent in place of historical observation frames, claiming this yields SOTA performance in both simulated and real-world settings while reducing storage and compute overhead.

Significance. If the central performance claims hold after proper controls, the work would supply a concrete alternative memory representation for VLN that trades egocentric history for an explicitly annotated top-down semantic map, potentially lowering the cost of maintaining long-horizon context and offering a reusable resource via the promised code and dataset release.

major comments (2)

[Abstract / method description] Abstract and method overview: the central claim that the ASM fully substitutes for historical egocentric frames without loss of decision-critical detail (viewpoint-dependent appearance, texture, partial occlusions, metric depth referenced by instructions) is not isolated by any described ablation that holds the VLM backbone, training regime, and map-construction oracle fixed while toggling the presence of past RGB observations.
[Abstract] Abstract: the assertion of SOTA results in simulated and real-world environments is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis, preventing evaluation of whether gains arise from the ASM substitution itself.

minor comments (1)

[Abstract] Abstract contains several grammatical issues (e.g., 'update it at each timestep' should be 'updates'; the clause 'transforming abstract semantics into clear navigation cues and generate our ASM' is incomplete).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on MapNav. The comments highlight opportunities to strengthen the presentation of our core claims regarding the Annotated Semantic Map (ASM) as a memory representation. We address each point below and will revise the manuscript to improve clarity and provide additional supporting evidence.

read point-by-point responses

Referee: [Abstract / method description] Abstract and method overview: the central claim that the ASM fully substitutes for historical egocentric frames without loss of decision-critical detail (viewpoint-dependent appearance, texture, partial occlusions, metric depth referenced by instructions) is not isolated by any described ablation that holds the VLM backbone, training regime, and map-construction oracle fixed while toggling the presence of past RGB observations.

Authors: We agree that an explicit ablation isolating the ASM substitution—while holding the VLM backbone, training regime, and map-construction process fixed—is necessary to rigorously support the claim. The current manuscript focuses on end-to-end performance comparisons but does not include this controlled toggle of historical RGB frames. In the revised version, we will add such an ablation study, reporting navigation success rates and other metrics with and without past RGB observations under otherwise identical conditions. This will directly address whether decision-critical details are preserved by the ASM alone. revision: yes
Referee: [Abstract] Abstract: the assertion of SOTA results in simulated and real-world environments is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis, preventing evaluation of whether gains arise from the ASM substitution itself.

Authors: The abstract was written concisely and therefore omits specific numbers. The full manuscript contains quantitative results, baseline comparisons, and error analyses in the experiments section. To improve accessibility, we will revise the abstract to include key metrics (e.g., success rate improvements over baselines in simulation and real-world settings) while maintaining brevity. We will also ensure the abstract references the relevant tables for full details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction with no derivations or self-referential reductions.

full rationale

The paper introduces MapNav as an empirical method for VLN that constructs and annotates a top-down semantic map to replace historical frames, then evaluates it experimentally. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central claim rests on experimental SOTA results rather than any step that reduces by construction to its inputs. No self-citations are invoked as load-bearing uniqueness theorems. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces ASM as a new construct; no explicit free parameters, mathematical axioms, or invented physical entities are described in the abstract. Standard VLN assumptions (e.g., availability of depth or semantic segmentation) are implicit but not enumerated.

invented entities (1)

Annotated Semantic Map (ASM) no independent evidence
purpose: Compact memory representation that replaces historical observation frames for VLN decision making
Introduced in the abstract as the core novel component; no independent falsifiable prediction outside the paper is stated.

pith-pipeline@v0.9.0 · 5788 in / 1106 out tokens · 29373 ms · 2026-05-23T02:56:49.176727+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel end-to-end VLM-based VLN model, MapNav, which leverages Annotated Semantic Maps for innovative memory representation, effectively replacing traditional historical frames.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ASM generation pipeline involves two main stages: (1) semantic region identification via connected component analysis... (2) centroid computation... explicit textual annotations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 7.0

Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
cs.RO 2026-03 conditional novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
cs.CV 2026-05 unverdicted novelty 6.0

GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. 2024. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2024
[2]

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision, pages 667--676

work page 2017
[3]

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, and Kwan-Yee K Wong. 2024. Affordances-oriented planning using foundation models for continuous vision-language navigation. arXiv preprint arXiv:2407.05890

work page arXiv 2024
[4]

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. 2022. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Advances in Neural Information Processing Systems, pages 38149--38161

work page 2022
[5]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2022
[6]

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. 2022. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 7606--7623

work page 2022
[7]

Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025 a . Tla: Tactile-language-action model for contact-rich manipulation. arXiv preprint arXiv:2503.08548

work page arXiv 2025
[8]

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137--13146

work page 2020
[9]

Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yifan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. 2025 b . Mapfusion: A novel bev feature fusion network for multi-modal map construction. Information Fusion, 119:103018

work page 2025
[10]

Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, and Jing Zhang. 2024 a . Mapdistill: Boosting efficient camera-based hd map construction via camera-lidar fusion model distillation. In European Conference on Computer Vision, pages 166--183. Springer

work page 2024
[11]

Xiaoshuai Hao, Guanqun Liu, Yuting Zhao, Yuheng Ji, Mengchuan Wei, Haimei Zhao, Lingdong Kong, Rong Yin, and Yu Liu. 2025 c . Msc-bench: Benchmarking and analyzing multi-sensor corruption for driving perception. arXiv preprint arXiv:2501.01037

work page arXiv 2025
[12]

Xiaoshuai Hao, Hui Zhang, Yifan Yang, Yi Zhou, Sangil Jung, Seung-In Park, and ByungIn Yoo. 2024 b . Mbfusion: A new multi-modal bev feature fusion method for hd map construction. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15922--15928. IEEE

work page 2024
[13]

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. 2022. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439--15449

work page 2022
[14]

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055--3067

work page 2023
[15]

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. 2025. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257

work page arXiv 2025
[16]

Glenn Jocher, Jing Qiu, and Ayush Chaurasia. 2023. https://github.com/ultralytics/ultralytics Ultralytics YOLO

work page 2023
[17]

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. 2021. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162--15171

work page 2021
[18]

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision, pages 104--120

work page 2020
[19]

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 a . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

work page 2020
[20]

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 b . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

work page 2020
[21]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024 a . Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Dingzhe Li, Yixiang Jin, Yuhao Sun, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Jianwei Zhang, et al. 2024 b . What foundation models can bring for robot learning in manipulation: A survey. arXiv preprint arXiv:2404.18201

work page arXiv 2024
[23]

Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. 2024. Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning. arXiv preprint arXiv:2403.07376

work page arXiv 2024
[24]

Rui Liu, Wenguan Wang, and Yi Yang. 2024. Volumetric environment representation for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16317--16328

work page 2024
[25]

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. 2024 a . Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882

work page arXiv 2024
[26]

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. 2024 b . Discuss before moving: Visual language navigation via multi-expert discussions. In IEEE International Conference on Robotics and Automation, pages 17380--17387

work page 2024
[27]

Sang-Min Park and Young-Gab Kim. 2023. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, pages 365--427

work page 2023
[28]

Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020. Object-and-action aware model for visual language navigation. In European Conference on Computer Vision, pages 303--317

work page 2020
[29]

St \'e phane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the international conference on artificial intelligence and statistics, pages 627--635

work page 2011
[30]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339--9347

work page 2019
[31]

Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492--504

work page 2023
[32]

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. 2025. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. arXiv preprint arXiv:2503.00778

work page arXiv 2025
[33]

Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394--406

work page 2020
[34]

Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. 2021. Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision, pages 246--266

work page 2021
[35]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. 2023. Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625--15636

work page 2023
[37]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, pages 24824--24837

work page 2022
[38]

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. 2024. Voronav: Voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695

work page arXiv 2024
[39]

Siying Wu, Xueyang Fu, Feng Wu, and Zheng-Jun Zha. 2022. Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the ACM International Conference on Multimedia, pages 4233--4241

work page 2022
[40]

Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. 2025. Evaluating gpt-4o's embodied intelligence: A comprehensive empirical study. TechRxiv preprint techrxiv.174495686.69962588/v1

work page arXiv 2025
[41]

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. 2024. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation, pages 42--48

work page 2024
[42]

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3554--3560

work page 2023
[43]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975--11986

work page 2023
[44]

Zhaohuan Zhan, Lisha Yu, Sijie Yu, and Guang Tan. 2024. Mc-gpt: Empowering vision-and-language navigation with memory map and reasoning chains. arXiv preprint arXiv:2405.10620

work page arXiv 2024
[45]

Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025. Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation. arXiv preprint arXiv:2505.09577

work page arXiv 2025
[46]

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. 2023. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. 2024 a . Navid: Video-based vlm plans the next step for vision-and-language navigation. In Proceedings of Robotics: Science and Systems

work page 2024
[48]

Lingfeng Zhang, Hao Wang, Erjia Xiao, Xinyao Zhang, Qiang Zhang, Zixuan Jiang, and Renjing Xu. 2024 b . Multi-floor zero-shot object navigation policy. arXiv preprint arXiv:2409.10906

work page arXiv 2024
[49]

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. 2024 c . Trihelper: Zero-shot object navigation with dynamic assistance. arXiv preprint arXiv:2403.15223

work page arXiv 2024
[50]

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. 2024. Towards learning a generalist model for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624--13634

work page 2024
[51]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, pages 46595--46623

work page 2023
[52]

Gengze Zhou, Yicong Hong, and Qi Wu. 2024. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7641--7649

work page 2024
[53]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. 2024. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2024

[2] [2]

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision, pages 667--676

work page 2017

[3] [3]

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, and Kwan-Yee K Wong. 2024. Affordances-oriented planning using foundation models for continuous vision-language navigation. arXiv preprint arXiv:2407.05890

work page arXiv 2024

[4] [4]

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. 2022. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Advances in Neural Information Processing Systems, pages 38149--38161

work page 2022

[5] [5]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2022

[6] [6]

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. 2022. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 7606--7623

work page 2022

[7] [7]

Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025 a . Tla: Tactile-language-action model for contact-rich manipulation. arXiv preprint arXiv:2503.08548

work page arXiv 2025

[8] [8]

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137--13146

work page 2020

[9] [9]

Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yifan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. 2025 b . Mapfusion: A novel bev feature fusion network for multi-modal map construction. Information Fusion, 119:103018

work page 2025

[10] [10]

Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, and Jing Zhang. 2024 a . Mapdistill: Boosting efficient camera-based hd map construction via camera-lidar fusion model distillation. In European Conference on Computer Vision, pages 166--183. Springer

work page 2024

[11] [11]

Xiaoshuai Hao, Guanqun Liu, Yuting Zhao, Yuheng Ji, Mengchuan Wei, Haimei Zhao, Lingdong Kong, Rong Yin, and Yu Liu. 2025 c . Msc-bench: Benchmarking and analyzing multi-sensor corruption for driving perception. arXiv preprint arXiv:2501.01037

work page arXiv 2025

[12] [12]

Xiaoshuai Hao, Hui Zhang, Yifan Yang, Yi Zhou, Sangil Jung, Seung-In Park, and ByungIn Yoo. 2024 b . Mbfusion: A new multi-modal bev feature fusion method for hd map construction. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15922--15928. IEEE

work page 2024

[13] [13]

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. 2022. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15439--15449

work page 2022

[14] [14]

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055--3067

work page 2023

[15] [15]

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. 2025. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257

work page arXiv 2025

[16] [16]

Glenn Jocher, Jing Qiu, and Ayush Chaurasia. 2023. https://github.com/ultralytics/ultralytics Ultralytics YOLO

work page 2023

[17] [17]

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. 2021. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162--15171

work page 2021

[18] [18]

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision, pages 104--120

work page 2020

[19] [19]

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 a . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

work page 2020

[20] [20]

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020 b . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4392--4412

work page 2020

[21] [21]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024 a . Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Dingzhe Li, Yixiang Jin, Yuhao Sun, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Jianwei Zhang, et al. 2024 b . What foundation models can bring for robot learning in manipulation: A survey. arXiv preprint arXiv:2404.18201

work page arXiv 2024

[23] [23]

Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. 2024. Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning. arXiv preprint arXiv:2403.07376

work page arXiv 2024

[24] [24]

Rui Liu, Wenguan Wang, and Yi Yang. 2024. Volumetric environment representation for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16317--16328

work page 2024

[25] [25]

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. 2024 a . Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882

work page arXiv 2024

[26] [26]

Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. 2024 b . Discuss before moving: Visual language navigation via multi-expert discussions. In IEEE International Conference on Robotics and Automation, pages 17380--17387

work page 2024

[27] [27]

Sang-Min Park and Young-Gab Kim. 2023. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, pages 365--427

work page 2023

[28] [28]

Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020. Object-and-action aware model for visual language navigation. In European Conference on Computer Vision, pages 303--317

work page 2020

[29] [29]

St \'e phane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the international conference on artificial intelligence and statistics, pages 627--635

work page 2011

[30] [30]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339--9347

work page 2019

[31] [31]

Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. 2023. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492--504

work page 2023

[32] [32]

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. 2025. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. arXiv preprint arXiv:2503.00778

work page arXiv 2025

[33] [33]

Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394--406

work page 2020

[34] [34]

Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. 2021. Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision, pages 246--266

work page 2021

[35] [35]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. 2023. Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625--15636

work page 2023

[37] [37]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, pages 24824--24837

work page 2022

[38] [38]

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. 2024. Voronav: Voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695

work page arXiv 2024

[39] [39]

Siying Wu, Xueyang Fu, Feng Wu, and Zheng-Jun Zha. 2022. Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the ACM International Conference on Multimedia, pages 4233--4241

work page 2022

[40] [40]

Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. 2025. Evaluating gpt-4o's embodied intelligence: A comprehensive empirical study. TechRxiv preprint techrxiv.174495686.69962588/v1

work page arXiv 2025

[41] [41]

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. 2024. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In IEEE International Conference on Robotics and Automation, pages 42--48

work page 2024

[42] [42]

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. L3mvn: Leveraging large language models for visual target navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3554--3560

work page 2023

[43] [43]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975--11986

work page 2023

[44] [44]

Zhaohuan Zhan, Lisha Yu, Sijie Yu, and Guang Tan. 2024. Mc-gpt: Empowering vision-and-language navigation with memory map and reasoning chains. arXiv preprint arXiv:2405.10620

work page arXiv 2024

[45] [45]

Chaofan Zhang, Peng Hao, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. 2025. Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation. arXiv preprint arXiv:2505.09577

work page arXiv 2025

[46] [46]

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. 2023. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. 2024 a . Navid: Video-based vlm plans the next step for vision-and-language navigation. In Proceedings of Robotics: Science and Systems

work page 2024

[48] [48]

Lingfeng Zhang, Hao Wang, Erjia Xiao, Xinyao Zhang, Qiang Zhang, Zixuan Jiang, and Renjing Xu. 2024 b . Multi-floor zero-shot object navigation policy. arXiv preprint arXiv:2409.10906

work page arXiv 2024

[49] [49]

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. 2024 c . Trihelper: Zero-shot object navigation with dynamic assistance. arXiv preprint arXiv:2403.15223

work page arXiv 2024

[50] [50]

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. 2024. Towards learning a generalist model for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624--13634

work page 2024

[51] [51]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, pages 46595--46623

work page 2023

[52] [52]

Gengze Zhou, Yicong Hong, and Qi Wu. 2024. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7641--7649

work page 2024

[53] [53]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[54] [54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page