MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

Krish Pandya; Madhava Krishna; Muhammad Haris Khan; Rohit Jayanti; Sarthak Chittawar; Siddharth Tourani; Sourav Garg; Vansh Garg

arxiv: 2605.24111 · v1 · pith:EPGZICUWnew · submitted 2026-05-22 · 💻 cs.RO · cs.AI

MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

Vansh Garg , Rohit Jayanti , Krish Pandya , Sarthak Chittawar , Siddharth Tourani , Muhammad Haris Khan , Sourav Garg , Madhava Krishna This is my paper

Pith reviewed 2026-06-30 16:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords visual navigationrelative 3D mapspixel correspondencescostmap conditioningimage matchingtrajectory predictionrobot control

0 comments

The pith

Pixel-relative connectivity from 3D image matching yields a dense costmap that conditions more accurate navigation control than image- or object-level maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing map representations for visual navigation either enforce globally consistent 3D geometry or discard most geometric structure in favor of topological graphs. The paper constructs an alternative from pairwise pixel correspondences placed in each image pair's relative 3D frame, producing a pixel-level connectivity graph that remains geometrically accurate without global consistency. This graph is sparsified into a WayPixel Costmap that directly conditions a learned controller to output trajectory rollouts. Validation across four simulator navigation tasks plus real-world trials shows the pixel-level costmap supplies better conditioning than coarser image- or object-based inputs.

Core claim

The central claim is that a map built solely from inter-image pixel correspondences in relative 3D coordinate systems supports global path planning and high-performance control. By approximating intra-image pixel connectivity and deriving the WayPixel Costmap, the representation preserves dense geometric cues locally while avoiding any requirement for a single consistent 3D reconstruction, resulting in measurably more accurate trajectory predictions than image- or object-level alternatives.

What carries the argument

The WayPixel Costmap, obtained by sparsifying the pixel-relative connectivity graph constructed from 3D-grounded image matching on image sequences.

If this is right

Global path planning remains possible by sparsifying the intra-image portion of the pixel connectivity graph.
A controller trained on the dense pixel costmap produces more accurate trajectory rollouts than one trained on image- or object-level maps.
The same representation supports four distinct navigation task types in simulation and transfers to real-world robot demonstrations.
Navigation capability exceeds teach-and-repeat limits while still avoiding the need for globally consistent geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could enable incremental map building in large or changing environments where drift prevents global consistency.
Failure modes would concentrate on image pairs where matching produces unreliable correspondences rather than on global alignment errors.
Combining the costmap with online matching updates might allow reactive replanning without rebuilding an entire map.

Load-bearing premise

Reliable pixel correspondences can be extracted from 3D-grounded image matching applied to the input image sequences.

What would settle it

An ablation that replaces the pixel-level costmap with an image-level or object-level map on the same four navigation tasks and measures the resulting drop in success rate or increase in trajectory error.

Figures

Figures reproduced from arXiv: 2605.24111 by Krish Pandya, Madhava Krishna, Muhammad Haris Khan, Rohit Jayanti, Sarthak Chittawar, Siddharth Tourani, Sourav Garg, Vansh Garg.

**Figure 1.** Figure 1: Overview of MASt3R-Nav. 1 Consecutive RGB frames are matched using MASt3R to obtain dense pixel correspondences and relative 3D point maps. 2 Matched pixels are composed into a pixel-level topological graph, where inter-image correspondences form zero-cost edges, and intra-image edges are weighted by inter-pixel 3D Euclidean distance. 3 Shortest-path planning over this graph yields dense pixel-wise costs f… view at source ↗

**Figure 2.** Figure 2: Through our proposed pixel-relative map representa [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: MASt3R-Nav Architecture. Mapping phase involves constructing a pixel-level topological graph by linking correspondences across frames and encoding traversal costs using 3D geometry from MASt3R. In the Execution phase, the agent localizes itself against the map and generate a fine-grained pixel costmap by matching current observations and propagating their costs to all pixels. Planning is performed by compu… view at source ↗

**Figure 4.** Figure 4: WayPixel Costmap generation. Given the pixel-relative map representation and a query 1 , we obtain pixel-level planning costs through a series of steps that form a path highlighted in white background 2 from p q u to pg through p q m, p r m and p r m∗ . 3 We show the flow of cost gradients from each pixel to its closest least-cost matched pixel and 4 the final dense WayPixel Costmap on which we condition o… view at source ↗

**Figure 5.** Figure 5: Real World Demonstration. We show RGB observations, their WayPixel costmaps and the controller waypoints towards the goal object (shaded in blue in image 4) on four different locations in the robot trajectory. in under five minutes (294.41s), requiring 540.18 MB of storage. E. Real-World Demonstration To validate the practical applicability and sim-to-real transfer capabilities of our proposed method, we … view at source ↗

read the original abstract

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ''WayPixel Costmap'' representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces pixel-relative connectivity via MASt3R matches to build a WayPixel Costmap that keeps local geometry without global consistency, and claims it conditions control better than image or object baselines.

read the letter

The central new piece is the construction of a navigation map from pairwise relative 3D pixel correspondences, then turning the resulting graph into a dense costmap for trajectory prediction. This sits between rigid global 3D maps and purely topological graphs, which is a reasonable practical move for robots that cannot afford consistent reconstruction.

The work does a clean job of describing how inter-image matches are turned into intra-image connectivity and then sparsified for planning. The four simulator tasks plus real-world demos give some indication that the controller can handle varied scenarios when conditioned on this representation.

The load-bearing claim is that the pixel-level costmap is strictly more accurate for control than the image- and object-level alternatives. That claim only holds if the underlying correspondences remain reliable enough that local geometric errors do not accumulate into the costmap. The stress-test note correctly flags that no transitivity or drift correction is enforced across the sequence, so matching failures under texture or lighting changes would directly degrade the signal. Without seeing detailed ablations on correspondence quality or error propagation in the full experiments, it is hard to judge how much of the reported advantage survives those conditions.

This is the kind of paper that robotics groups working on visual navigation would want to read. It is coherent on its own terms and engages the right literature, so it deserves a serious referee even if the experiments need tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MASt3R-Nav for visual navigation using a novel pixel-relative connectivity map derived from pairwise 3D correspondences obtained via MASt3R image matching on image sequences. The method constructs an inter-image graph in relative 3D coordinates, approximates and sparsifies intra-image pixel connectivity to form a WayPixel Costmap, and trains a controller to predict trajectory rollouts conditioned on this costmap. The central claim is that this dense pixel-level costmap based on relative geometry provides a more accurate conditioning signal for control than image- or object-level alternatives, enabling capable performance on four simulator navigation tasks and real-world demonstrations without requiring globally consistent 3D geometry.

Significance. If the superiority claim holds under the stated assumptions, the work offers a useful intermediate representation between globally consistent metric maps and purely topological graphs, potentially improving navigation robustness in environments where full SLAM is impractical. The grounding in recent 3D image matching and the multi-task validation are positive elements; the approach could influence downstream work on relative-geometry navigation if the error-propagation issues are addressed.

major comments (3)

[Abstract / Method (pixel-relative connectivity)] Abstract and method description: the superiority of the pixel-level WayPixel Costmap over image- and object-level baselines is the load-bearing claim, yet the construction relies exclusively on pairwise MASt3R correspondences without transitivity enforcement or drift correction across the sequence. Local matching failures (common under varying texture, lighting, or motion in the four tasks) propagate directly into the sparsified costmap used for global planning, which directly risks the accuracy advantage asserted.
[Experiments] Experiments section: the validation on four simulator tasks and real-world demos is cited to support the 'highly capable' claim, but no quantitative comparison (e.g., success rate deltas, trajectory error, or conditioning accuracy metrics) versus the image- and object-level baselines is referenced; without these, the central empirical assertion cannot be evaluated.
[Method] Method (approximating and sparsifying intra-image pixel connectivity): the step that converts the pixel graph into the costmap for control prediction is described at a high level only; any approximation that discards geometric fidelity would undermine the 'dense pixel-level' advantage, yet no error bounds or sensitivity analysis is indicated.

minor comments (2)

[Abstract] Abstract: the phrase 'highly capable navigation system' is qualitative; replace or qualify with reference to specific performance numbers once the experiments section is expanded.
Notation: 'WayPixel Costmap' is introduced without an explicit definition or equation; add a short formal description or diagram reference for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below. Where the comments identify gaps in analysis or presentation, we agree that revisions are warranted and will incorporate them.

read point-by-point responses

Referee: [Abstract / Method (pixel-relative connectivity)] Abstract and method description: the superiority of the pixel-level WayPixel Costmap over image- and object-level baselines is the load-bearing claim, yet the construction relies exclusively on pairwise MASt3R correspondences without transitivity enforcement or drift correction across the sequence. Local matching failures (common under varying texture, lighting, or motion in the four tasks) propagate directly into the sparsified costmap used for global planning, which directly risks the accuracy advantage asserted.

Authors: The use of purely pairwise MASt3R correspondences without transitivity or global drift correction is an intentional design choice that enables navigation without requiring globally consistent 3D geometry, which is a central contribution. We acknowledge that local matching failures can propagate and that this could affect the asserted accuracy advantage over baselines. In the revision we will add a dedicated discussion subsection analyzing error propagation through the pipeline and the role of the sparsification step in limiting its impact. revision: yes
Referee: [Experiments] Experiments section: the validation on four simulator tasks and real-world demos is cited to support the 'highly capable' claim, but no quantitative comparison (e.g., success rate deltas, trajectory error, or conditioning accuracy metrics) versus the image- and object-level baselines is referenced; without these, the central empirical assertion cannot be evaluated.

Authors: The experiments section reports success rates and trajectory metrics for MASt3R-Nav on the four tasks and real-world demonstrations. Direct quantitative comparisons (including deltas) against the image- and object-level baselines are not presented in tabular form with the requested metrics. We will revise the experiments section to include an explicit comparison table reporting success-rate deltas, trajectory error, and conditioning accuracy versus the baselines. revision: yes
Referee: [Method] Method (approximating and sparsifying intra-image pixel connectivity): the step that converts the pixel graph into the costmap for control prediction is described at a high level only; any approximation that discards geometric fidelity would undermine the 'dense pixel-level' advantage, yet no error bounds or sensitivity analysis is indicated.

Authors: The approximation and sparsification procedure is outlined in Section 3.4 via a connectivity-strength sampling rule. We agree that the current description is high-level and that the absence of error bounds or sensitivity analysis weakens the claim of preserving dense pixel-level fidelity. In the revision we will expand the method section with a more detailed algorithmic description, provide approximate error bounds derived from the sampling density, and add a sensitivity analysis on the sparsification threshold. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of costmap superiority is independent of construction

full rationale

The derivation proceeds by constructing a pixel-relative connectivity graph from MASt3R pairwise matches, sparsifying it into a WayPixel Costmap, and training a controller on that representation. Superiority over image/object baselines is shown via direct experimental comparison on four navigation tasks plus real-world demos. No equation or claim reduces the output to a fitted input by construction, no self-citation is load-bearing for the central result, and the graph construction step is presented as an engineering choice rather than a uniqueness theorem. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5770 in / 1085 out tokens · 41047 ms · 2026-06-30T16:10:45.422007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Using occupancy grids for mobile robot perception and navigation,

A. Elfes, “Using occupancy grids for mobile robot perception and navigation,”Computer, vol. 22, no. 6, pp. 46–57, 1989

1989
[2]

Semi-parametric Topological Memory for Navigation

N. Savinov, A. Dosovitskiy, and V . Koltun, “Semi-parametric topolog- ical memory for navigation,”arXiv preprint arXiv:1803.00653, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Gnm: A general navigation model to drive any robot,

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “Gnm: A general navigation model to drive any robot,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 7226–7233

2023
[4]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615

2023
[5]

Robohop: Segment-based topological map representation for open-world visual navigation,

S. Garg, K. Rana, M. Hosseinzadeh, L. Mares, N. S ¨underhauf, F. Dayoub, and I. Reid, “Robohop: Segment-based topological map representation for open-world visual navigation,” in2024 IEEE In- ternational Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 4090–4097

2024
[6]

Objectreact: Learning object-relative control for visual navigation,

S. Garg, D. Craggs, V . Bhat, L. Mares, S. Podgorski, M. Krishna, F. Dayoub, and I. Reid, “Objectreact: Learning object-relative control for visual navigation,” inConference on Robot Learning. PMLR, 2025

2025
[7]

Tango: Traversablility-aware navigation with local metric control for topological goals,

S. Podgorski, S. Garg, M. Hosseinzadeh, L. Mares, F. Dayoub, and I. Reid, “Tango: Traversablility-aware navigation with local metric control for topological goals,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025

2025
[8]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 697–20 709

2024
[9]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 71–91

2024
[10]

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017

2017
[11]

Lsd-slam: Large-scale di- rect monocular slam,

J. Engel, T. Sch ¨ops, and D. Cremers, “Lsd-slam: Large-scale di- rect monocular slam,” inEuropean conference on computer vision. Springer, 2014, pp. 834–849

2014
[12]

Parallel tracking and mapping for small ar workspaces,

G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in2007 6th IEEE and ACM international symposium on mixed and augmented reality. IEEE, 2007, pp. 225–234

2007
[13]

Slam++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1352–1359

2013
[14]

Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,

L. Nicholson, M. Milford, and N. S ¨underhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018

2018
[15]

Kimera: From slam to spatial perception with 3d dynamic scene graphs,

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,”The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021

2021
[16]

3d scene graph: A structure for unified semantics, 3d space, and camera,

I. Armeni, Z.-Y . He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673

2019
[17]

Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 9272–9279

2022
[18]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

2024
[19]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

2025
[20]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 695–16 705

2025
[21]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in2025 International Conference on 3D Vi- sion (3DV). IEEE, 2025, pp. 1–10

2025
[22]

ViKiNG: Vision-Based Kilometer-Scale Nav- igation with Geographic Hints,

D. Shah and S. Levine, “ViKiNG: Vision-Based Kilometer-Scale Nav- igation with Geographic Hints,” inProceedings of Robotics: Science and Systems, 2022

2022
[23]

LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in6th Annual Conference on Robot Learning, 2022

2022
[24]

Learning view and target invariant visual servoing for navigation,

Y . Li and J. Ko ˇsecka, “Learning view and target invariant visual servoing for navigation,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 658–664

2020
[25]

Scaling local control to large-scale topological navigation,

X. Meng, N. Ratliff, Y . Xiang, and D. Fox, “Scaling local control to large-scale topological navigation,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 672–678

2020
[26]

Zero-shot visual imitation,

D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y . Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shot visual imitation,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 2050–2053

2018
[27]

A tutorial on visual servo control,

“A tutorial on visual servo control,”IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 2002

2002
[28]

Path planning for robust image-based control,

Y . Mezouar and F. Chaumette, “Path planning for robust image-based control,”IEEE transactions on robotics and automation, vol. 18, no. 4, pp. 534–549, 2002

2002
[29]

Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5228–5234

2024
[30]

Self-attention with relative position representations,

P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 464–468

2018
[31]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017
[32]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva,et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Integrating egocentric localization for more realistic point-goal navi- gation agents,

S. Datta, O. Maksymets, J. Hoffman, S. Lee, D. Batra, and D. Parikh, “Integrating egocentric localization for more realistic point-goal navi- gation agents,” inConference on Robot Learning. PMLR, 2021, pp. 313–328

2021

[1] [1]

Using occupancy grids for mobile robot perception and navigation,

A. Elfes, “Using occupancy grids for mobile robot perception and navigation,”Computer, vol. 22, no. 6, pp. 46–57, 1989

1989

[2] [2]

Semi-parametric Topological Memory for Navigation

N. Savinov, A. Dosovitskiy, and V . Koltun, “Semi-parametric topolog- ical memory for navigation,”arXiv preprint arXiv:1803.00653, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Gnm: A general navigation model to drive any robot,

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “Gnm: A general navigation model to drive any robot,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 7226–7233

2023

[4] [4]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615

2023

[5] [5]

Robohop: Segment-based topological map representation for open-world visual navigation,

S. Garg, K. Rana, M. Hosseinzadeh, L. Mares, N. S ¨underhauf, F. Dayoub, and I. Reid, “Robohop: Segment-based topological map representation for open-world visual navigation,” in2024 IEEE In- ternational Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 4090–4097

2024

[6] [6]

Objectreact: Learning object-relative control for visual navigation,

S. Garg, D. Craggs, V . Bhat, L. Mares, S. Podgorski, M. Krishna, F. Dayoub, and I. Reid, “Objectreact: Learning object-relative control for visual navigation,” inConference on Robot Learning. PMLR, 2025

2025

[7] [7]

Tango: Traversablility-aware navigation with local metric control for topological goals,

S. Podgorski, S. Garg, M. Hosseinzadeh, L. Mares, F. Dayoub, and I. Reid, “Tango: Traversablility-aware navigation with local metric control for topological goals,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025

2025

[8] [8]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 697–20 709

2024

[9] [9]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 71–91

2024

[10] [10]

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017

2017

[11] [11]

Lsd-slam: Large-scale di- rect monocular slam,

J. Engel, T. Sch ¨ops, and D. Cremers, “Lsd-slam: Large-scale di- rect monocular slam,” inEuropean conference on computer vision. Springer, 2014, pp. 834–849

2014

[12] [12]

Parallel tracking and mapping for small ar workspaces,

G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in2007 6th IEEE and ACM international symposium on mixed and augmented reality. IEEE, 2007, pp. 225–234

2007

[13] [13]

Slam++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1352–1359

2013

[14] [14]

Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,

L. Nicholson, M. Milford, and N. S ¨underhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018

2018

[15] [15]

Kimera: From slam to spatial perception with 3d dynamic scene graphs,

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,”The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021

2021

[16] [16]

3d scene graph: A structure for unified semantics, 3d space, and camera,

I. Armeni, Z.-Y . He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673

2019

[17] [17]

Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 9272–9279

2022

[18] [18]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,

H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation,”Advances in neural information processing systems, vol. 37, pp. 5285–5307, 2024

2024

[19] [19]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

2025

[20] [20]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 695–16 705

2025

[21] [21]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in2025 International Conference on 3D Vi- sion (3DV). IEEE, 2025, pp. 1–10

2025

[22] [22]

ViKiNG: Vision-Based Kilometer-Scale Nav- igation with Geographic Hints,

D. Shah and S. Levine, “ViKiNG: Vision-Based Kilometer-Scale Nav- igation with Geographic Hints,” inProceedings of Robotics: Science and Systems, 2022

2022

[23] [23]

LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in6th Annual Conference on Robot Learning, 2022

2022

[24] [24]

Learning view and target invariant visual servoing for navigation,

Y . Li and J. Ko ˇsecka, “Learning view and target invariant visual servoing for navigation,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 658–664

2020

[25] [25]

Scaling local control to large-scale topological navigation,

X. Meng, N. Ratliff, Y . Xiang, and D. Fox, “Scaling local control to large-scale topological navigation,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 672–678

2020

[26] [26]

Zero-shot visual imitation,

D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y . Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shot visual imitation,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 2050–2053

2018

[27] [27]

A tutorial on visual servo control,

“A tutorial on visual servo control,”IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 2002

2002

[28] [28]

Path planning for robust image-based control,

Y . Mezouar and F. Chaumette, “Path planning for robust image-based control,”IEEE transactions on robotics and automation, vol. 18, no. 4, pp. 534–549, 2002

2002

[29] [29]

Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,

W. Cai, S. Huang, G. Cheng, Y . Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5228–5234

2024

[30] [30]

Self-attention with relative position representations,

P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 464–468

2018

[31] [31]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017

[32] [32]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva,et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Integrating egocentric localization for more realistic point-goal navi- gation agents,

S. Datta, O. Maksymets, J. Hoffman, S. Lee, D. Batra, and D. Parikh, “Integrating egocentric localization for more realistic point-goal navi- gation agents,” inConference on Robot Learning. PMLR, 2021, pp. 313–328

2021