GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

Chengnuo Sun; Chenjia Bai; Chi Zhang; Jiankun Dong; Qizhen Weng; Sunyao Zhou; Tianhang Wang; Xiaotao Zhang; Xinhai Li; Xuelong Li

arxiv: 2606.03682 · v1 · pith:OAXXYZZUnew · submitted 2026-06-02 · 💻 cs.RO

GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

Xinhai Li , Xiaotao Zhang , Yuehao Huang , Jiankun Dong , Tianhang Wang , Sunyao Zhou , Yunzi Wu , Chengnuo Sun

show 5 more authors

Yunfei Ge Qizhen Weng Chi Zhang Chenjia Bai Xuelong Li

This is my paper

Pith reviewed 2026-06-28 09:48 UTC · model grok-4.3

classification 💻 cs.RO

keywords visual-language navigationembodied navigation3D Gaussian Splattingreinforcement learningdataset curationpolicy learningsimulation platformbenchmark

0 comments

The pith

A unified paradigm for visual-language navigation curates large-scale 3D data, simulates with Gaussian splatting, and trains an RL model that outperforms prior methods on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to overcome data scarcity that limits generalization and long-horizon performance in vision-and-language navigation systems. It creates the GN-Matrix dataset through an automated curation pipeline on diverse 3D scenes, a 3D Gaussian Splatting simulation platform for interactive navigation, and the GN-Bench benchmark with dynamic avatars. The authors then introduce the Break and Establish model that applies DAgger after supervised learning to break expert distributions and support RL exploration. This produces a single framework that handles instruction following, human following, and goal navigation while claiming superior results on GN-Bench and VLN-CE.

Core claim

The authors claim that curating the GN-Matrix dataset via automated pipeline, building a high-fidelity 3DGS simulation engine, releasing GN-Bench with dynamic 3DGS avatars, and developing the GN-BAE model that uses DAgger to expose agents to rollout states before RL training together establish GN0 as a unified paradigm spanning data generation, evaluation, and policy learning that integrates map-based and map-free VLN tasks and outperforms state-of-the-art methods.

What carries the argument

The Break and Establish (BAE) model, which formalizes 3DGS-rendered Bird's Eye View representations as compact memory and applies DAgger after supervised learning to break narrow expert distributions and enable downstream RL exploration.

If this is right

The approach integrates instruction following, human following, and goal navigation tasks within one model.
High-fidelity 3DGS simulation supports collision-aware navigation and dynamic human-robot interaction evaluation.
3DGS-rendered BEV memory unlocks latent spatial reasoning inside vision-language models.
The framework spans data, simulation, and learning to advance embodied navigation for both research and applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the automated curation scales reliably, training data volume could increase dramatically beyond current manual collection limits.
The DAgger-plus-RL sequence might transfer to other policy domains where expert demonstrations are narrow or expensive.
Real-world deployment could use the same 3DGS pipeline to adapt agents to scanned physical environments without additional annotation.

Load-bearing premise

The automated pipeline for curating diverse 3D scenes produces navigation data of sufficient quality and diversity to overcome the stated limitations in generalization and long-horizon capabilities of existing VLN systems.

What would settle it

If GN-BAE trained on GN-Matrix data shows no outperformance against state-of-the-art VLN methods when evaluated on GN-Bench or VLN-CE, the claim that the unified paradigm advances capabilities would be disproven.

read the original abstract

Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New dataset and benchmark for VLN via automated 3D pipeline, but data quality and results lack visible validation.

read the letter

This paper introduces GN-Matrix, a dataset from an automated pipeline curating diverse 3D scenes, GN-Bench as a BEV benchmark with dynamic 3DGS avatars, and the BAE model that runs supervised learning then DAgger then RL on 3DGS-rendered views to unify map-based and map-free VLN tasks.

It does a few things right. The automated data generation directly targets the data scarcity that limits generalization and long-horizon performance in VLN. The 3DGS simulator for collision-aware navigation and the BEV memory format for VLMs are practical choices that could support better spatial reasoning. Using DAgger to expose the model to rollout states before RL is a standard way to move beyond narrow expert distributions, and folding instruction following, human following, and goal navigation into one setup keeps the framing clean.

The soft spots are in the missing checks. The abstract states that the pipeline results in GN-Matrix but reports no trajectory validity rates, diversity statistics, collision realism scores, or human ratings, so it is unclear whether the data actually improves on prior corpora or simply scales quantity. The claim that GN0 outperforms SOTA on GN-Bench and VLN-CE is stated without numbers, ablations, or error analysis, which makes it hard to judge whether the gains are robust. The stress-test concern about unvalidated data quality holds from the abstract.

This is for VLN and embodied navigation researchers who need new benchmarks and data sources. It deserves peer review so the methods, data metrics, and experimental details can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces GN0, a unified paradigm for visual-language navigation (VLN) comprising: (1) an automated pipeline to curate diverse 3D scenes into the large-scale GN-Matrix dataset, (2) a 3D Gaussian Splatting (3DGS) simulator supporting interactive roaming and collision-aware navigation, (3) GN-Bench, a BEV-based benchmark with dynamic 3DGS avatars for human-robot interaction, and (4) the GN-BAE foundation model that applies supervised learning, DAgger for distribution breaking, and RL exploration to handle map-based and map-free tasks (instruction following, human following, goal navigation). It claims GN0 outperforms SOTA VLN methods on GN-Bench and VLN-CE, offering a framework spanning data, simulation, and learning.

Significance. If the empirical claims hold and the automated data pipeline produces high-quality, diverse trajectories that demonstrably improve generalization and long-horizon performance, the work could meaningfully advance embodied navigation by scaling data generation beyond manual curation and unifying map-based/map-free paradigms under a single RL-driven model with 3DGS-rendered BEV memory. The integration of high-fidelity simulation with VLM spatial reasoning is a potentially valuable direction.

major comments (2)

[Abstract] Abstract: The central claim that the automated pipeline 'results in the GN-Matrix dataset' and thereby overcomes 'generalization and long-horizon capabilities' limitations rests on unvalidated data quality. No metrics are reported for trajectory validity rates, scene diversity statistics, collision realism, or human ratings, leaving open whether the generated data differs substantively from prior VLN corpora or merely increases quantity.
[Abstract] Abstract: The assertion that 'Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods' is presented without any quantitative results, error bars, baseline comparisons, ablation studies, or statistical tests. This absence makes it impossible to evaluate whether the outperformance claim is supported by the data or affected by post-hoc choices.

minor comments (2)

[Abstract] Abstract: The relationship between 'GN0', 'GN-Matrix', 'GN-Bench', and 'GN-BAE' is introduced without a clear nomenclature or diagram; a single overview figure early in the paper would improve readability.
[Abstract] Abstract: The phrase 'unlocking latent spatial reasoning in VLMs' is used without specifying which VLM backbone is employed or how the BEV representation interfaces with it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments on the abstract point by point below. Where the concerns identify gaps in the presented evidence, we agree to revise the abstract to incorporate additional supporting details from the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the automated pipeline 'results in the GN-Matrix dataset' and thereby overcomes 'generalization and long-horizon capabilities' limitations rests on unvalidated data quality. No metrics are reported for trajectory validity rates, scene diversity statistics, collision realism, or human ratings, leaving open whether the generated data differs substantively from prior VLN corpora or merely increases quantity.

Authors: We agree that the abstract would be strengthened by explicit data-quality metrics. The manuscript's Section 3 details the automated pipeline and reports aggregate statistics on GN-Matrix (e.g., number of scenes, trajectory counts, and environment diversity). In the revision we will add concise quantitative indicators—trajectory validity rate, scene diversity measures, and collision statistics—directly into the abstract to make the claim more verifiable. Human ratings were not collected; we therefore cannot add them without new experiments. revision: yes
Referee: [Abstract] Abstract: The assertion that 'Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods' is presented without any quantitative results, error bars, baseline comparisons, ablation studies, or statistical tests. This absence makes it impossible to evaluate whether the outperformance claim is supported by the data or affected by post-hoc choices.

Authors: The abstract is a high-level summary; the full experimental sections (4 and 5) contain the requested quantitative results, baseline comparisons, ablations, and performance tables on both GN-Bench and VLN-CE. To address the concern, we will revise the abstract to include the key numerical improvements (e.g., success-rate and SPL gains versus the strongest baselines) so that the outperformance claim is immediately supported by concrete figures. revision: yes

Circularity Check

0 steps flagged

No circularity: paper presents empirical framework without derivations or self-referential predictions

full rationale

The manuscript introduces GN-Matrix dataset curation, 3DGS simulator, GN-Bench, and BAE model via descriptive pipeline and RL/DAgger training, with outperformance claims resting on external evaluations (GN-Bench, VLN-CE). No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All components are presented as novel constructions evaluated against independent benchmarks, with no reduction of results to inputs by construction. This is the common case of a self-contained systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no equations, training details, or modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5840 in / 1271 out tokens · 34365 ms · 2026-06-28T09:48:51.117994+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning
cs.RO 2026-06 unverdicted novelty 6.0

SpaceVLN proposes a stagewise closed-loop framework using Spatial Cognitive Memory and Spatial-CoT for zero-shot vision-and-language navigation and object-goal navigation, reporting SOTA results on R2R-CE, RxR-CE, GN-...

Reference graph

Works this paper leans on

19 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

2d gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11,

2024
[2]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.IEEE Transactionson Visualization and Computer Graphics, 31(9):6100–6111, 2024a

Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.IEEE Transactionson Visualization and Computer Graphics, 31(9):6100–6111, 2024a. Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhal...

Pith/arXiv arXiv
[3]

Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise.arXiv preprint arXiv:2311.11221,

Xinhai Li, Huaibin Wang, and Kuo-Kun Tseng. Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise.arXiv preprint arXiv:2311.11221,

arXiv
[4]

Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202,

Pith/arXiv arXiv
[5]

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation

Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, et al. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15379–15386. IEEE,

2025
[6]

Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839,

Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839,

arXiv
[7]

Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. arXiv preprint arXiv:2507.13019, 2025a. 34 Xiaohan Lei, Min Wang, Wengang Zhou, and Houqiang Li. Gaussnav: Gaussian splatting...

arXiv
[8]

Worldgrow: Generating infinite 3d world.arXiv preprint arXiv:2510.21682,

Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, and Qi Tian. Worldgrow: Generating infinite 3d world.arXiv preprint arXiv:2510.21682,

arXiv
[9]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024b. Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun...

arXiv
[10]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv
[11]

Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025b

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025b. Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navig...

arXiv
[12]

Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839,

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839,

arXiv
[13]

Trackvla++: Unleashing reasoning and memory capabilities in vla models for embodied visual tracking.arXiv preprint arXiv:2510.07134,

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, et al. Trackvla++: Unleashing reasoning and memory capabilities in vla models for embodied visual tracking.arXiv preprint arXiv:2510.07134,

arXiv
[14]

Deconav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486,

Sunyao Zhou, Yunzi Wu, Tianhang Wang, Xinhai Li, Guang Chen, Lizheng Liu, Chenjia Bai, and Xuelong Li. Deconav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486,

Pith/arXiv arXiv
[15]

35 Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16516–16526, 2022a. Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-...

2022
[16]

Mm- nav: Multi-view vla model for robust visual navigation via multi-expert learning.arXiv preprint arXiv:2510.03142,

Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, and He Wang. Mm- nav: Multi-view vla model for robust visual navigation via multi-expert learning.arXiv preprint arXiv:2510.03142,

arXiv
[17]

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan

URLhttps://arxiv.org/abs/2512.01009. Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17294–17303,

arXiv
[18]

Assemlm: Spatial reasoning multimodal large language models for robotic assembly.arXiv preprint arXiv:2604.08983,

Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Yu-Gang Jiang, and Chenjia Bai. Assemlm: Spatial reasoning multimodal large language models for robotic assembly.arXiv preprint arXiv:2604.08983,

Pith/arXiv arXiv
[19]

On Evaluation of Embodied Navigation Agents

doi: 10.1147/sj.41.0025. Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018b. Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1147/sj.41.0025

[1] [1]

2d gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11,

2024

[2] [2]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.IEEE Transactionson Visualization and Computer Graphics, 31(9):6100–6111, 2024a

Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.IEEE Transactionson Visualization and Computer Graphics, 31(9):6100–6111, 2024a. Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhal...

Pith/arXiv arXiv

[3] [3]

Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise.arXiv preprint arXiv:2311.11221,

Xinhai Li, Huaibin Wang, and Kuo-Kun Tseng. Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise.arXiv preprint arXiv:2311.11221,

arXiv

[4] [4]

Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202,

Pith/arXiv arXiv

[5] [5]

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation

Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen Feng, Lu Shi, et al. Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15379–15386. IEEE,

2025

[6] [6]

Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839,

Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator.arXiv preprint arXiv:2411.11839,

arXiv

[7] [7]

Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiangmiao Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. arXiv preprint arXiv:2507.13019, 2025a. 34 Xiaohan Lei, Min Wang, Wengang Zhou, and Houqiang Li. Gaussnav: Gaussian splatting...

arXiv

[8] [8]

Worldgrow: Generating infinite 3d world.arXiv preprint arXiv:2510.21682,

Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, and Qi Tian. Worldgrow: Generating infinite 3d world.arXiv preprint arXiv:2510.21682,

arXiv

[9] [9]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024b. Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun...

arXiv

[10] [10]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv

[11] [11]

Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025b

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025b. Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, and Kai Xu. Cognav: Cognitive process modeling for object goal navig...

arXiv

[12] [12]

Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839,

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. Octonav: Towards generalist embodied navigation.arXiv preprint arXiv:2506.09839,

arXiv

[13] [13]

Trackvla++: Unleashing reasoning and memory capabilities in vla models for embodied visual tracking.arXiv preprint arXiv:2510.07134,

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, et al. Trackvla++: Unleashing reasoning and memory capabilities in vla models for embodied visual tracking.arXiv preprint arXiv:2510.07134,

arXiv

[14] [14]

Deconav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486,

Sunyao Zhou, Yunzi Wu, Tianhang Wang, Xinhai Li, Guang Chen, Lizheng Liu, Chenjia Bai, and Xuelong Li. Deconav: Dialog enhanced long-horizon collaborative vision-language navigation.arXiv preprint arXiv:2604.12486,

Pith/arXiv arXiv

[15] [15]

35 Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16516–16526, 2022a. Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-...

2022

[16] [16]

Mm- nav: Multi-view vla model for robust visual navigation via multi-expert learning.arXiv preprint arXiv:2510.03142,

Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, and He Wang. Mm- nav: Multi-view vla model for robust visual navigation via multi-expert learning.arXiv preprint arXiv:2510.03142,

arXiv

[17] [17]

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan

URLhttps://arxiv.org/abs/2512.01009. Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17294–17303,

arXiv

[18] [18]

Assemlm: Spatial reasoning multimodal large language models for robotic assembly.arXiv preprint arXiv:2604.08983,

Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Yu-Gang Jiang, and Chenjia Bai. Assemlm: Spatial reasoning multimodal large language models for robotic assembly.arXiv preprint arXiv:2604.08983,

Pith/arXiv arXiv

[19] [19]

On Evaluation of Embodied Navigation Agents

doi: 10.1147/sj.41.0025. Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018b. Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1147/sj.41.0025