CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

Anh Nguyen; Baoru Huang; Feng Tian; Hesheng Wang; Qi Shao; Shuhao Zhai; Zhihao Cao

arxiv: 2605.30488 · v1 · pith:4E4GJ4MTnew · submitted 2026-05-28 · 💻 cs.RO

CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

Zhihao Cao , Qi Shao , Shuhao Zhai , Feng Tian , Anh Nguyen , Hesheng Wang , Baoru Huang This is my paper

Pith reviewed 2026-06-29 06:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords collaborative SLAMmonocular mapping3D reconstruction priorsmulti-agent systemsoutdoor SLAMdense fusionbundle adjustment

0 comments

The pith

Learned 3D reconstruction priors allow monocular RGB cameras to produce globally consistent metric maps for multiple outdoor robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoMo3R-SLAM as the first collaborative monocular dense RGB SLAM system for outdoor multi-agent teams. It addresses scale ambiguity and unreliable feature matching in low-overlap outdoor scenes by relying on learned feed-forward 3D reconstruction priors instead of depth sensors. Each agent uses the priors for real-time tracking and local dense fusion, while a central coordinator matches pointmaps across agents to create cross-agent constraints and perform global optimization. This results in robust, consistent maps from monocular RGB alone, achieving competitive accuracy on benchmarks like Tanks and Temples and Waymo while running at 8 FPS.

Core claim

CoMo3R-SLAM demonstrates that learned feed-forward 3D reconstruction priors can provide the robust geometric information required for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment, enabling the production of globally consistent metric maps from monocular RGB inputs in outdoor multi-agent systems without depth sensors or parametric intrinsics.

What carries the argument

Learned feed-forward 3D reconstruction priors that guide the front-end tracking and enable dense pointmap matching for cross-agent constraints.

If this is right

The system achieves the best absolute trajectory error on three of four Tanks and Temples scenes.
It matches or exceeds state-of-the-art RGB-D methods on Waymo sequences.
The system operates online at 8 FPS using only monocular RGB.
Robust cross-agent constraints are generated without relying on traditional feature matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such systems could reduce hardware requirements and costs for multi-robot deployments in large outdoor areas.
Extensions might include handling dynamic environments or integrating with semantic understanding.
The approach suggests learned priors can substitute for depth in other SLAM variants.

Load-bearing premise

Learned feed-forward 3D reconstruction priors supply reliable geometric information sufficient for cross-agent verification and scale recovery in outdoor scenes with low overlap and repetitive structures.

What would settle it

Demonstrating a set of outdoor scenes where the learned priors produce inaccurate geometry that leads to failed cross-agent data association or inconsistent scale in the final map.

Figures

Figures reproduced from arXiv: 2605.30488 by Anh Nguyen, Baoru Huang, Feng Tian, Hesheng Wang, Qi Shao, Shuhao Zhai, Zhihao Cao.

**Figure 2.** Figure 2: Overview of CoMo3R-SLAM. Each agent runs a prior-guided front-end (Sec. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of dense reconstructions on the Tanks and Temples dataset [ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Dense reconstruction results of CoMo3R-SLAM on three Waymo driving scenes [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: CoMo3R-SLAM performs collaborative monocular dense SLAM from two RGB-only agent streams [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of the three depth parameterisations of Table [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of the calibrated and uncalibrated variants of CoMo3R-SLAM on the same [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: All-time robustness on in-the-wild handheld iPhone captures. We run CoMo3R-SLAM as two inde [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-agent reconstruction with three and four agents on Tanks and Temples [ [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoMo3R-SLAM claims the first learned-prior approach to collaborative monocular dense outdoor SLAM and shows competitive ATE numbers, but the key evidence gap is whether the priors actually deliver reliable scale and cross-agent geometry where feature matching fails.

read the letter

The paper introduces a pipeline where each agent uses feed-forward 3D priors for tracking and local fusion, then a coordinator does dense pointmap matching, closed-form Sim(3) sync, and global BA with segment depth optimization. That architecture is new for the monocular collaborative outdoor setting and removes the need for depth sensors or known intrinsics.

It reports the best ATE on three of four Tanks and Temples scenes plus competitive Waymo numbers while running at 8 FPS online. Those results are worth noting because they come from monocular RGB alone and match or beat some RGB-D baselines.

The soft spot is the missing check on the priors themselves. The central claim rests on the priors supplying metric-consistent geometry accurate enough for verification and scale recovery in low-overlap outdoor scenes with repetitive texture. The abstract gives no pointmap error numbers against LiDAR or other ground truth under those exact conditions, no ablation on prior quality, and no error bars on the ATE figures. Without that, it is hard to tell whether the reported performance comes from the priors doing what is claimed or from other parts of the pipeline.

This is a paper for robotics groups working on multi-agent mapping who want to explore learned priors as a way around depth hardware. It deserves a serious referee because the system is described in enough detail to evaluate and the results are on standard benchmarks, even if the priors' contribution needs tighter quantification.

Referee Report

3 major / 1 minor

Summary. The manuscript presents CoMo3R-SLAM, a collaborative monocular dense RGB SLAM system for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion; a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. The system requires neither depth sensors nor known intrinsics and claims to produce robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences it reports the best ATE on three of four Tanks and Temples scenes, competitive Waymo accuracy, and online operation at 8 FPS.

Significance. If the central claims hold after validation, the work would be significant for multi-robot perception: it demonstrates that learned feed-forward 3D priors can enable metric-scale collaborative dense mapping with only monocular cameras in large-scale outdoor scenes, addressing payload and calibration costs of depth sensors. The online 8 FPS performance and explicit handling of low-overlap repetitive-structure regimes are practical strengths.

major comments (3)

[Abstract] Abstract: the claim of best ATE on three of four Tanks and Temples scenes is presented without error bars, standard deviations across runs, or ablation details, making it impossible to assess whether the reported gains over baselines are statistically meaningful or robust.
[Abstract] Abstract: no derivation, equations, or algorithmic description is supplied for the closed-form Sim(3) synchronization step, which is load-bearing for the metric consistency claim across agents.
[Abstract] Abstract / Methods: the central claim that learned feed-forward 3D reconstruction priors supply reliable geometry for cross-agent verification and scale recovery in low-overlap outdoor scenes is not supported by any independent quantification (e.g., pointmap error versus LiDAR ground truth under the exact conditions where traditional matching fails); any systematic bias in the priors would directly undermine the Sim(3) constraints and global BA.

minor comments (1)

[Abstract] Abstract: the statement 'matching or exceeding state-of-the-art RGB-D methods' would be strengthened by explicit citations to the specific RGB-D baselines being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to strengthen the manuscript while maintaining scientific accuracy.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of best ATE on three of four Tanks and Temples scenes is presented without error bars, standard deviations across runs, or ablation details, making it impossible to assess whether the reported gains over baselines are statistically meaningful or robust.

Authors: We agree that error bars and standard deviations would improve assessment of robustness. The reported ATE values are from single runs, consistent with much of the SLAM literature, but we will rerun experiments with varied initializations on the Tanks and Temples sequences, report means and standard deviations, and expand ablation details on the priors' contribution in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: no derivation, equations, or algorithmic description is supplied for the closed-form Sim(3) synchronization step, which is load-bearing for the metric consistency claim across agents.

Authors: The closed-form Sim(3) synchronization is derived in Section 3.2 using pointmap correspondences to solve the similarity transformation via a closed-form least-squares formulation. We will add a brief summary of the key equations and a pointer to Section 3.2 directly in the abstract to make the metric consistency claim more self-contained. revision: partial
Referee: [Abstract] Abstract / Methods: the central claim that learned feed-forward 3D reconstruction priors supply reliable geometry for cross-agent verification and scale recovery in low-overlap outdoor scenes is not supported by any independent quantification (e.g., pointmap error versus LiDAR ground truth under the exact conditions where traditional matching fails); any systematic bias in the priors would directly undermine the Sim(3) constraints and global BA.

Authors: The end-to-end ATE results on challenging low-overlap scenes provide indirect support for the priors' utility, as the system outperforms baselines that rely on traditional matching. Direct independent pointmap error quantification against LiDAR ground truth in failure regimes of traditional methods is not present in the current experiments. We will add a limitations discussion on potential prior biases and their impact on Sim(3) constraints. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and system description present an architecture that applies external learned feed-forward 3D reconstruction priors to front-end tracking, dense pointmap matching, Sim(3) synchronization, and global BA. No equations, fitted parameters, or self-citations are shown that would make any claimed output (cross-agent constraints, metric maps, ATE scores) equivalent to the inputs by construction. The performance claims rest on benchmark results (Tanks and Temples, Waymo) that are independent of any internal fitting loop described here, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5796 in / 1065 out tokens · 21761 ms · 2026-06-29T06:52:17.517370+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Knapitsch, J

A. Knapitsch, J. Park, Q.-Y . Zhou, and V . Koltun. Tanks and temples: Benchmarking large- scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

2017
[2]

Schmuck, T

P. Schmuck, T. Ziegler, M. Karrer, J. Perraudin, and M. Chli. Covins: Visual-inertial slam for centralized collaboration.arXiv preprint arXiv:2108.05756, 2021

work page arXiv 2021
[3]

Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone. Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems.IEEE transactions on robotics, 38(4), 2022

2022
[4]

Yugay, T

V . Yugay, T. Gevers, and M. R. Oswald. Magic-slam: Multi-agent gaussian globally consistent slam. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6741–6750, 2025

2025
[5]

T. Deng, G. Shen, C. Xun, S. Yuan, T. Jin, H. Shen, Y . Wang, J. Wang, H. Wang, D. Wang, et al. Mne-slam: Multi-agent neural slam for mobile robots. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1485–1494, 2025

2025
[6]

J. Hu, M. Mao, H. Bao, G. Zhang, and Z. Cui. Cp-slam: Collaborative neural point-based slam system.Advances in Neural Information Processing Systems, 36:39429–39442, 2023

2023
[7]

X. Xu, F. Xue, S. Zhao, Y . Pan, S. Scherer, and X. Huang. Mac-ego3d: Multi-agent gaussian consensus for real-time collaborative ego-motion and photorealistic 3d reconstruction. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 854–863, 2025

2025
[8]

L. Chen, Y . Su, J. Wang, P. Han, Z. Xia, S. Bu, K. Li, B. Hu, S. Meng, and G. Wang. Coma- slam: Collaborative multi-agent gaussian slam with geometric consistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2922–2929, 2026

2026
[9]

T. Deng, G. Shen, X. Chen, S. Yuan, H. Shen, G. Peng, Z. Wu, J. Wang, L. Xie, D. Wang, et al. Mcn-slam: Multi-agent collaborative neural slam with hybrid implicit neural scene representation.arXiv preprint arXiv:2506.18678, 2025

work page arXiv 2025
[10]

Thomas, A

A. Thomas, A. Sonawalla, A. Rose, and J. P. How. Grand-slam: Local optimization for globally consistent large-scale multi-agent gaussian slam.IEEE Robotics and Automation Letters, 2025

2025
[11]

Schmuck and M

P. Schmuck and M. Chli. Ccm-slam: Robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams.Journal of Field Robotics, 36(4): 763–781, 2019

2019
[12]

Lipson and J

L. Lipson and J. Deng. Multi-session slam with differentiable wide-baseline pose optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19626–19635, 2024

2024
[13]

Zhou and H

Y . Zhou and H. Wu. Multi-agent monocular dense slam with 3d reconstruction priors.arXiv preprint arXiv:2511.19031, 2025

work page arXiv 2025
[14]

Y . Li, P. Ye, and Q. Jia. Mang-slam: Multi-agent neural submap and gaussian representation for dense mapping.IEEE Robotics and Automation Letters, 11(2):2242–2249, 2025

2025
[15]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[16]

Leroy, Y

V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024. 10

2024
[17]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual ge- ometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[18]

Murai, E

R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real-time dense slam with 3d recon- struction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

2025
[19]

Maggio, H

D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.Advances in Neural Information Processing Systems, 38:129839–129867, 2026

2026
[20]

Lajoie, B

P.-Y . Lajoie, B. Ramtoula, Y . Chang, L. Carlone, and G. Beltrame. Door-slam: Distributed, online, and outlier resilient slam for robotic teams.IEEE Robotics and Automation Letters, 5 (2):1656–1663, 2020

2020
[21]

Lajoie and G

P.-Y . Lajoie and G. Beltrame. Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems.IEEE Robotics and Automation Letters, 9(1):475–482, 2023

2023
[22]

M. M. Li, P.-Y . Lajoie, J. Liu, and G. Beltrame. Compact keyframe-optimized multi-agent gaussian splatting slam.arXiv preprint arXiv:2604.00804, 2026

work page arXiv 2026
[23]

J. Yu, T. Chen, and M. Schwager. Hammer: heterogeneous, multi-robot semantic gaussian splatting.IEEE Robotics and Automation Letters, 2025

2025
[24]

Z. Cao, Q. Shao, S. Zhai, J. Zhang, A. Nguyen, and B. Huang. Mags-slam: Monocular multi- agent gaussian splatting slam for geometrically and photometrically consistent reconstruction. arXiv preprint arXiv:2605.10760, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single cam- era slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007

2007
[26]

Campos, R

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. Montiel, and J. D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

2021
[27]

Engel, T

J. Engel, T. Sch ¨ops, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. InEuro- pean conference on computer vision, pages 834–849. Springer, 2014

2014
[28]

Engel, V

J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017

2017
[29]

R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In2011 international conference on computer vision, pages 2320–2327. IEEE, 2011

2011
[30]

Teed and J

Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021

2021
[31]

Lipson, Z

L. Lipson, Z. Teed, and J. Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2024

2024
[32]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[33]

Sucar, S

E. Sucar, S. Liu, J. Ortiz, and A. J. Davison. imap: Implicit mapping and positioning in real- time. InProceedings of the IEEE/CVF international conference on computer vision, pages 6229–6238, 2021. 11

2021
[34]

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice- slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

2022
[35]

Z. Zhu, S. Peng, V . Larsson, Z. Cui, M. R. Oswald, A. Geiger, and M. Pollefeys. Nicer-slam: Neural implicit scene encoding for rgb slam. In2024 International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2024

2024
[36]

M. M. Johari, C. Carta, and F. Fleuret. Eslam: Efficient dense slam system based on hybrid rep- resentation of signed distance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17408–17419, 2023

2023
[37]

Sandstr ¨om, Y

E. Sandstr ¨om, Y . Li, L. Van Gool, and M. R. Oswald. Point-slam: Dense neural point cloud- based slam. InProceedings of the IEEE/CVF international conference on computer vision, pages 18433–18444, 2023

2023
[38]

L. Liso, E. Sandstr ¨om, V . Yugay, L. Van Gool, and M. R. Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20363–20373, 2024

2024
[39]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[40]

Keetha, J

N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

2024
[41]

Sandstr ¨om, G

E. Sandstr ¨om, G. Zhang, K. Tateno, M. Oechsle, M. Niemeyer, Y . Zhang, M. Patel, L. Van Gool, M. Oswald, and F. Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1680–1691, 2025

2025
[42]

Zhang, Q

W. Zhang, Q. Cheng, D. Skuddis, N. Zeller, D. Cremers, and N. Haala. Hi-slam2: Geometry- aware gaussian slam for fast monocular scene reconstruction.IEEE Transactions on Robotics, 41:6478–6493, 2025

2025
[43]

M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, T. Deng, and H. Wang. Sgs-slam: Semantic gaussian splatting for neural dense slam. InEuropean Conference on Computer Vision, pages 163–179. Springer, 2024

2024
[44]

Gaussian-slam: Photo-realistic dense slam with gaussian splatting,

V . Yugay, Y . Li, T. Gevers, and M. R. Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting.arXiv preprint arXiv:2312.10070, 2023

work page arXiv 2023
[45]

Matsuki, R

H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18039–18048, 2024

2024
[46]

Z. Peng, T. Shao, Y . Liu, J. Zhou, Y . Yang, J. Wang, and K. Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024
[47]

Z. Cao, H. Wu, L. W. Tang, Z. Luo, W. Zhang, M. Pollefeys, Z. Zhu, and M. R. Oswald. Mcgs- slam: A multi-camera slam framework using gaussian splatting for high-fidelity mapping. arXiv preprint arXiv:2509.14191, 2025

work page arXiv 2025
[48]

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025. 12

2025
[49]

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025
[50]

Y . Liu, S. Dong, S. Wang, Y . Yin, Y . Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

2025
[51]

Wang and L

H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Con- ference on 3D Vision (3DV), pages 78–89. IEEE, 2025

2025
[52]

Teed and J

Z. Teed and J. Deng. Tangent space backpropagation for 3d transformation groups. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10338– 10347, 2021

2021
[53]

Tolias, T

G. Tolias, T. Jenicek, and O. Chum. Learning and aggregating deep local descriptors for instance-level recognition. InEuropean Conference on Computer Vision, pages 460–477. Springer, 2020

2020
[54]

Tolias, Y

G. Tolias, Y . Avrithis, and H. J´egou. To aggregate or not to aggregate: Selective match kernels for image search. InProceedings of the IEEE international conference on computer vision, pages 1401–1408, 2013

2013
[55]

S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 1991

1991
[56]

Achanta, A

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨usstrunk. Slic superpixels compared to state-of-the-art superpixel methods.IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012

2012
[57]

Concha and J

A. Concha and J. Civera. Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence. In2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5686–5693. IEEE, 2015

2015
[58]

Mazur, G

K. Mazur, G. Bae, and A. J. Davison. Superprimitive: Scene reconstruction at a primitive level. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4979–4989, 2024

2024
[59]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 13 Appendix / Supplemental Materials 3D Reconstruction (Norma...

2020
[60]

(5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm

Run global pose LM-IRLS over Eq. (5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm
[61]

w/ Loop Closure

Forr= 1, . . . , RwithR= 100: (a) build segments from each keyframe’s current canonical pointmap via Eq. (19); (b) refinesby Eq. (24) (poses frozen,10inner iterations); (c) write the updated ˆXback into the canonical pointmaps; (d) run a short pose pass (Eq. (5),50iterations) so the poses absorb the depth change; (e) broadcast the new global poses to all ...

[1] [1]

Knapitsch, J

A. Knapitsch, J. Park, Q.-Y . Zhou, and V . Koltun. Tanks and temples: Benchmarking large- scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

2017

[2] [2]

Schmuck, T

P. Schmuck, T. Ziegler, M. Karrer, J. Perraudin, and M. Chli. Covins: Visual-inertial slam for centralized collaboration.arXiv preprint arXiv:2108.05756, 2021

work page arXiv 2021

[3] [3]

Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone. Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems.IEEE transactions on robotics, 38(4), 2022

2022

[4] [4]

Yugay, T

V . Yugay, T. Gevers, and M. R. Oswald. Magic-slam: Multi-agent gaussian globally consistent slam. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6741–6750, 2025

2025

[5] [5]

T. Deng, G. Shen, C. Xun, S. Yuan, T. Jin, H. Shen, Y . Wang, J. Wang, H. Wang, D. Wang, et al. Mne-slam: Multi-agent neural slam for mobile robots. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1485–1494, 2025

2025

[6] [6]

J. Hu, M. Mao, H. Bao, G. Zhang, and Z. Cui. Cp-slam: Collaborative neural point-based slam system.Advances in Neural Information Processing Systems, 36:39429–39442, 2023

2023

[7] [7]

X. Xu, F. Xue, S. Zhao, Y . Pan, S. Scherer, and X. Huang. Mac-ego3d: Multi-agent gaussian consensus for real-time collaborative ego-motion and photorealistic 3d reconstruction. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 854–863, 2025

2025

[8] [8]

L. Chen, Y . Su, J. Wang, P. Han, Z. Xia, S. Bu, K. Li, B. Hu, S. Meng, and G. Wang. Coma- slam: Collaborative multi-agent gaussian slam with geometric consistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2922–2929, 2026

2026

[9] [9]

T. Deng, G. Shen, X. Chen, S. Yuan, H. Shen, G. Peng, Z. Wu, J. Wang, L. Xie, D. Wang, et al. Mcn-slam: Multi-agent collaborative neural slam with hybrid implicit neural scene representation.arXiv preprint arXiv:2506.18678, 2025

work page arXiv 2025

[10] [10]

Thomas, A

A. Thomas, A. Sonawalla, A. Rose, and J. P. How. Grand-slam: Local optimization for globally consistent large-scale multi-agent gaussian slam.IEEE Robotics and Automation Letters, 2025

2025

[11] [11]

Schmuck and M

P. Schmuck and M. Chli. Ccm-slam: Robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams.Journal of Field Robotics, 36(4): 763–781, 2019

2019

[12] [12]

Lipson and J

L. Lipson and J. Deng. Multi-session slam with differentiable wide-baseline pose optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19626–19635, 2024

2024

[13] [13]

Zhou and H

Y . Zhou and H. Wu. Multi-agent monocular dense slam with 3d reconstruction priors.arXiv preprint arXiv:2511.19031, 2025

work page arXiv 2025

[14] [14]

Y . Li, P. Ye, and Q. Jia. Mang-slam: Multi-agent neural submap and gaussian representation for dense mapping.IEEE Robotics and Automation Letters, 11(2):2242–2249, 2025

2025

[15] [15]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024

[16] [16]

Leroy, Y

V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024. 10

2024

[17] [17]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual ge- ometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[18] [18]

Murai, E

R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real-time dense slam with 3d recon- struction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

2025

[19] [19]

Maggio, H

D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.Advances in Neural Information Processing Systems, 38:129839–129867, 2026

2026

[20] [20]

Lajoie, B

P.-Y . Lajoie, B. Ramtoula, Y . Chang, L. Carlone, and G. Beltrame. Door-slam: Distributed, online, and outlier resilient slam for robotic teams.IEEE Robotics and Automation Letters, 5 (2):1656–1663, 2020

2020

[21] [21]

Lajoie and G

P.-Y . Lajoie and G. Beltrame. Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems.IEEE Robotics and Automation Letters, 9(1):475–482, 2023

2023

[22] [22]

M. M. Li, P.-Y . Lajoie, J. Liu, and G. Beltrame. Compact keyframe-optimized multi-agent gaussian splatting slam.arXiv preprint arXiv:2604.00804, 2026

work page arXiv 2026

[23] [23]

J. Yu, T. Chen, and M. Schwager. Hammer: heterogeneous, multi-robot semantic gaussian splatting.IEEE Robotics and Automation Letters, 2025

2025

[24] [24]

Z. Cao, Q. Shao, S. Zhai, J. Zhang, A. Nguyen, and B. Huang. Mags-slam: Monocular multi- agent gaussian splatting slam for geometrically and photometrically consistent reconstruction. arXiv preprint arXiv:2605.10760, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single cam- era slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007

2007

[26] [26]

Campos, R

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. Montiel, and J. D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

2021

[27] [27]

Engel, T

J. Engel, T. Sch ¨ops, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. InEuro- pean conference on computer vision, pages 834–849. Springer, 2014

2014

[28] [28]

Engel, V

J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017

2017

[29] [29]

R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In2011 international conference on computer vision, pages 2320–2327. IEEE, 2011

2011

[30] [30]

Teed and J

Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021

2021

[31] [31]

Lipson, Z

L. Lipson, Z. Teed, and J. Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2024

2024

[32] [32]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021

[33] [33]

Sucar, S

E. Sucar, S. Liu, J. Ortiz, and A. J. Davison. imap: Implicit mapping and positioning in real- time. InProceedings of the IEEE/CVF international conference on computer vision, pages 6229–6238, 2021. 11

2021

[34] [34]

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice- slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

2022

[35] [35]

Z. Zhu, S. Peng, V . Larsson, Z. Cui, M. R. Oswald, A. Geiger, and M. Pollefeys. Nicer-slam: Neural implicit scene encoding for rgb slam. In2024 International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2024

2024

[36] [36]

M. M. Johari, C. Carta, and F. Fleuret. Eslam: Efficient dense slam system based on hybrid rep- resentation of signed distance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17408–17419, 2023

2023

[37] [37]

Sandstr ¨om, Y

E. Sandstr ¨om, Y . Li, L. Van Gool, and M. R. Oswald. Point-slam: Dense neural point cloud- based slam. InProceedings of the IEEE/CVF international conference on computer vision, pages 18433–18444, 2023

2023

[38] [38]

L. Liso, E. Sandstr ¨om, V . Yugay, L. Van Gool, and M. R. Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20363–20373, 2024

2024

[39] [39]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[40] [40]

Keetha, J

N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

2024

[41] [41]

Sandstr ¨om, G

E. Sandstr ¨om, G. Zhang, K. Tateno, M. Oechsle, M. Niemeyer, Y . Zhang, M. Patel, L. Van Gool, M. Oswald, and F. Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1680–1691, 2025

2025

[42] [42]

Zhang, Q

W. Zhang, Q. Cheng, D. Skuddis, N. Zeller, D. Cremers, and N. Haala. Hi-slam2: Geometry- aware gaussian slam for fast monocular scene reconstruction.IEEE Transactions on Robotics, 41:6478–6493, 2025

2025

[43] [43]

M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, T. Deng, and H. Wang. Sgs-slam: Semantic gaussian splatting for neural dense slam. InEuropean Conference on Computer Vision, pages 163–179. Springer, 2024

2024

[44] [44]

Gaussian-slam: Photo-realistic dense slam with gaussian splatting,

V . Yugay, Y . Li, T. Gevers, and M. R. Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting.arXiv preprint arXiv:2312.10070, 2023

work page arXiv 2023

[45] [45]

Matsuki, R

H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18039–18048, 2024

2024

[46] [46]

Z. Peng, T. Shao, Y . Liu, J. Zhou, Y . Yang, J. Wang, and K. Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024

[47] [47]

Z. Cao, H. Wu, L. W. Tang, Z. Luo, W. Zhang, M. Pollefeys, Z. Zhu, and M. R. Oswald. Mcgs- slam: A multi-camera slam framework using gaussian splatting for high-fidelity mapping. arXiv preprint arXiv:2509.14191, 2025

work page arXiv 2025

[48] [48]

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025. 12

2025

[49] [49]

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025

[50] [50]

Y . Liu, S. Dong, S. Wang, Y . Yin, Y . Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

2025

[51] [51]

Wang and L

H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Con- ference on 3D Vision (3DV), pages 78–89. IEEE, 2025

2025

[52] [52]

Teed and J

Z. Teed and J. Deng. Tangent space backpropagation for 3d transformation groups. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10338– 10347, 2021

2021

[53] [53]

Tolias, T

G. Tolias, T. Jenicek, and O. Chum. Learning and aggregating deep local descriptors for instance-level recognition. InEuropean Conference on Computer Vision, pages 460–477. Springer, 2020

2020

[54] [54]

Tolias, Y

G. Tolias, Y . Avrithis, and H. J´egou. To aggregate or not to aggregate: Selective match kernels for image search. InProceedings of the IEEE international conference on computer vision, pages 1401–1408, 2013

2013

[55] [55]

S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 1991

1991

[56] [56]

Achanta, A

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨usstrunk. Slic superpixels compared to state-of-the-art superpixel methods.IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012

2012

[57] [57]

Concha and J

A. Concha and J. Civera. Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence. In2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5686–5693. IEEE, 2015

2015

[58] [58]

Mazur, G

K. Mazur, G. Bae, and A. J. Davison. Superprimitive: Scene reconstruction at a primitive level. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4979–4989, 2024

2024

[59] [59]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 13 Appendix / Supplemental Materials 3D Reconstruction (Norma...

2020

[60] [60]

(5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm

Run global pose LM-IRLS over Eq. (5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm

[61] [61]

w/ Loop Closure

Forr= 1, . . . , RwithR= 100: (a) build segments from each keyframe’s current canonical pointmap via Eq. (19); (b) refinesby Eq. (24) (poses frozen,10inner iterations); (c) write the updated ˆXback into the canonical pointmaps; (d) run a short pose pass (Eq. (5),50iterations) so the poses absorb the depth change; (e) broadcast the new global poses to all ...