pith. sign in

arxiv: 2605.30488 · v1 · pith:4E4GJ4MTnew · submitted 2026-05-28 · 💻 cs.RO

CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

Pith reviewed 2026-06-29 06:52 UTC · model grok-4.3

classification 💻 cs.RO
keywords collaborative SLAMmonocular mapping3D reconstruction priorsmulti-agent systemsoutdoor SLAMdense fusionbundle adjustment
0
0 comments X

The pith

Learned 3D reconstruction priors allow monocular RGB cameras to produce globally consistent metric maps for multiple outdoor robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoMo3R-SLAM as the first collaborative monocular dense RGB SLAM system for outdoor multi-agent teams. It addresses scale ambiguity and unreliable feature matching in low-overlap outdoor scenes by relying on learned feed-forward 3D reconstruction priors instead of depth sensors. Each agent uses the priors for real-time tracking and local dense fusion, while a central coordinator matches pointmaps across agents to create cross-agent constraints and perform global optimization. This results in robust, consistent maps from monocular RGB alone, achieving competitive accuracy on benchmarks like Tanks and Temples and Waymo while running at 8 FPS.

Core claim

CoMo3R-SLAM demonstrates that learned feed-forward 3D reconstruction priors can provide the robust geometric information required for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment, enabling the production of globally consistent metric maps from monocular RGB inputs in outdoor multi-agent systems without depth sensors or parametric intrinsics.

What carries the argument

Learned feed-forward 3D reconstruction priors that guide the front-end tracking and enable dense pointmap matching for cross-agent constraints.

If this is right

  • The system achieves the best absolute trajectory error on three of four Tanks and Temples scenes.
  • It matches or exceeds state-of-the-art RGB-D methods on Waymo sequences.
  • The system operates online at 8 FPS using only monocular RGB.
  • Robust cross-agent constraints are generated without relying on traditional feature matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such systems could reduce hardware requirements and costs for multi-robot deployments in large outdoor areas.
  • Extensions might include handling dynamic environments or integrating with semantic understanding.
  • The approach suggests learned priors can substitute for depth in other SLAM variants.

Load-bearing premise

Learned feed-forward 3D reconstruction priors supply reliable geometric information sufficient for cross-agent verification and scale recovery in outdoor scenes with low overlap and repetitive structures.

What would settle it

Demonstrating a set of outdoor scenes where the learned priors produce inaccurate geometry that leads to failed cross-agent data association or inconsistent scale in the final map.

Figures

Figures reproduced from arXiv: 2605.30488 by Anh Nguyen, Baoru Huang, Feng Tian, Hesheng Wang, Qi Shao, Shuhao Zhai, Zhihao Cao.

Figure 1
Figure 1. Figure 1: CoMo3R-SLAM performs collaborative monocular dense SLAM from two RGB-only agent streams [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CoMo3R-SLAM. Each agent runs a prior-guided front-end (Sec. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of dense reconstructions on the Tanks and Temples dataset [ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dense reconstruction results of CoMo3R-SLAM on three Waymo driving scenes [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CoMo3R-SLAM performs collaborative monocular dense SLAM from two RGB-only agent streams [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of the three depth parameterisations of Table [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of the calibrated and uncalibrated variants of CoMo3R-SLAM on the same [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: All-time robustness on in-the-wild handheld iPhone captures. We run CoMo3R-SLAM as two inde [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-agent reconstruction with three and four agents on Tanks and Temples [ [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
read the original abstract

Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents CoMo3R-SLAM, a collaborative monocular dense RGB SLAM system for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion; a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. The system requires neither depth sensors nor known intrinsics and claims to produce robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences it reports the best ATE on three of four Tanks and Temples scenes, competitive Waymo accuracy, and online operation at 8 FPS.

Significance. If the central claims hold after validation, the work would be significant for multi-robot perception: it demonstrates that learned feed-forward 3D priors can enable metric-scale collaborative dense mapping with only monocular cameras in large-scale outdoor scenes, addressing payload and calibration costs of depth sensors. The online 8 FPS performance and explicit handling of low-overlap repetitive-structure regimes are practical strengths.

major comments (3)
  1. [Abstract] Abstract: the claim of best ATE on three of four Tanks and Temples scenes is presented without error bars, standard deviations across runs, or ablation details, making it impossible to assess whether the reported gains over baselines are statistically meaningful or robust.
  2. [Abstract] Abstract: no derivation, equations, or algorithmic description is supplied for the closed-form Sim(3) synchronization step, which is load-bearing for the metric consistency claim across agents.
  3. [Abstract] Abstract / Methods: the central claim that learned feed-forward 3D reconstruction priors supply reliable geometry for cross-agent verification and scale recovery in low-overlap outdoor scenes is not supported by any independent quantification (e.g., pointmap error versus LiDAR ground truth under the exact conditions where traditional matching fails); any systematic bias in the priors would directly undermine the Sim(3) constraints and global BA.
minor comments (1)
  1. [Abstract] Abstract: the statement 'matching or exceeding state-of-the-art RGB-D methods' would be strengthened by explicit citations to the specific RGB-D baselines being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to strengthen the manuscript while maintaining scientific accuracy.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of best ATE on three of four Tanks and Temples scenes is presented without error bars, standard deviations across runs, or ablation details, making it impossible to assess whether the reported gains over baselines are statistically meaningful or robust.

    Authors: We agree that error bars and standard deviations would improve assessment of robustness. The reported ATE values are from single runs, consistent with much of the SLAM literature, but we will rerun experiments with varied initializations on the Tanks and Temples sequences, report means and standard deviations, and expand ablation details on the priors' contribution in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: no derivation, equations, or algorithmic description is supplied for the closed-form Sim(3) synchronization step, which is load-bearing for the metric consistency claim across agents.

    Authors: The closed-form Sim(3) synchronization is derived in Section 3.2 using pointmap correspondences to solve the similarity transformation via a closed-form least-squares formulation. We will add a brief summary of the key equations and a pointer to Section 3.2 directly in the abstract to make the metric consistency claim more self-contained. revision: partial

  3. Referee: [Abstract] Abstract / Methods: the central claim that learned feed-forward 3D reconstruction priors supply reliable geometry for cross-agent verification and scale recovery in low-overlap outdoor scenes is not supported by any independent quantification (e.g., pointmap error versus LiDAR ground truth under the exact conditions where traditional matching fails); any systematic bias in the priors would directly undermine the Sim(3) constraints and global BA.

    Authors: The end-to-end ATE results on challenging low-overlap scenes provide indirect support for the priors' utility, as the system outperforms baselines that rely on traditional matching. Direct independent pointmap error quantification against LiDAR ground truth in failure regimes of traditional methods is not present in the current experiments. We will add a limitations discussion on potential prior biases and their impact on Sim(3) constraints. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and system description present an architecture that applies external learned feed-forward 3D reconstruction priors to front-end tracking, dense pointmap matching, Sim(3) synchronization, and global BA. No equations, fitted parameters, or self-citations are shown that would make any claimed output (cross-agent constraints, metric maps, ATE scores) equivalent to the inputs by construction. The performance claims rest on benchmark results (Tanks and Temples, Waymo) that are independent of any internal fitting loop described here, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5796 in / 1065 out tokens · 21761 ms · 2026-06-29T06:52:17.517370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Knapitsch, J

    A. Knapitsch, J. Park, Q.-Y . Zhou, and V . Koltun. Tanks and temples: Benchmarking large- scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

  2. [2]

    Schmuck, T

    P. Schmuck, T. Ziegler, M. Karrer, J. Perraudin, and M. Chli. Covins: Visual-inertial slam for centralized collaboration.arXiv preprint arXiv:2108.05756, 2021

  3. [3]

    Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone. Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems.IEEE transactions on robotics, 38(4), 2022

  4. [4]

    Yugay, T

    V . Yugay, T. Gevers, and M. R. Oswald. Magic-slam: Multi-agent gaussian globally consistent slam. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6741–6750, 2025

  5. [5]

    T. Deng, G. Shen, C. Xun, S. Yuan, T. Jin, H. Shen, Y . Wang, J. Wang, H. Wang, D. Wang, et al. Mne-slam: Multi-agent neural slam for mobile robots. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1485–1494, 2025

  6. [6]

    J. Hu, M. Mao, H. Bao, G. Zhang, and Z. Cui. Cp-slam: Collaborative neural point-based slam system.Advances in Neural Information Processing Systems, 36:39429–39442, 2023

  7. [7]

    X. Xu, F. Xue, S. Zhao, Y . Pan, S. Scherer, and X. Huang. Mac-ego3d: Multi-agent gaussian consensus for real-time collaborative ego-motion and photorealistic 3d reconstruction. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 854–863, 2025

  8. [8]

    L. Chen, Y . Su, J. Wang, P. Han, Z. Xia, S. Bu, K. Li, B. Hu, S. Meng, and G. Wang. Coma- slam: Collaborative multi-agent gaussian slam with geometric consistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2922–2929, 2026

  9. [9]

    T. Deng, G. Shen, X. Chen, S. Yuan, H. Shen, G. Peng, Z. Wu, J. Wang, L. Xie, D. Wang, et al. Mcn-slam: Multi-agent collaborative neural slam with hybrid implicit neural scene representation.arXiv preprint arXiv:2506.18678, 2025

  10. [10]

    Thomas, A

    A. Thomas, A. Sonawalla, A. Rose, and J. P. How. Grand-slam: Local optimization for globally consistent large-scale multi-agent gaussian slam.IEEE Robotics and Automation Letters, 2025

  11. [11]

    Schmuck and M

    P. Schmuck and M. Chli. Ccm-slam: Robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams.Journal of Field Robotics, 36(4): 763–781, 2019

  12. [12]

    Lipson and J

    L. Lipson and J. Deng. Multi-session slam with differentiable wide-baseline pose optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19626–19635, 2024

  13. [13]

    Zhou and H

    Y . Zhou and H. Wu. Multi-agent monocular dense slam with 3d reconstruction priors.arXiv preprint arXiv:2511.19031, 2025

  14. [14]

    Y . Li, P. Ye, and Q. Jia. Mang-slam: Multi-agent neural submap and gaussian representation for dense mapping.IEEE Robotics and Automation Letters, 11(2):2242–2249, 2025

  15. [15]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  16. [16]

    Leroy, Y

    V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024. 10

  17. [17]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual ge- ometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  18. [18]

    Murai, E

    R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real-time dense slam with 3d recon- struction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

  19. [19]

    Maggio, H

    D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.Advances in Neural Information Processing Systems, 38:129839–129867, 2026

  20. [20]

    Lajoie, B

    P.-Y . Lajoie, B. Ramtoula, Y . Chang, L. Carlone, and G. Beltrame. Door-slam: Distributed, online, and outlier resilient slam for robotic teams.IEEE Robotics and Automation Letters, 5 (2):1656–1663, 2020

  21. [21]

    Lajoie and G

    P.-Y . Lajoie and G. Beltrame. Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems.IEEE Robotics and Automation Letters, 9(1):475–482, 2023

  22. [22]

    M. M. Li, P.-Y . Lajoie, J. Liu, and G. Beltrame. Compact keyframe-optimized multi-agent gaussian splatting slam.arXiv preprint arXiv:2604.00804, 2026

  23. [23]

    J. Yu, T. Chen, and M. Schwager. Hammer: heterogeneous, multi-robot semantic gaussian splatting.IEEE Robotics and Automation Letters, 2025

  24. [24]

    Z. Cao, Q. Shao, S. Zhai, J. Zhang, A. Nguyen, and B. Huang. Mags-slam: Monocular multi- agent gaussian splatting slam for geometrically and photometrically consistent reconstruction. arXiv preprint arXiv:2605.10760, 2026

  25. [25]

    A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single cam- era slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007

  26. [26]

    Campos, R

    C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. Montiel, and J. D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

  27. [27]

    Engel, T

    J. Engel, T. Sch ¨ops, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. InEuro- pean conference on computer vision, pages 834–849. Springer, 2014

  28. [28]

    Engel, V

    J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017

  29. [29]

    R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In2011 international conference on computer vision, pages 2320–2327. IEEE, 2011

  30. [30]

    Teed and J

    Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021

  31. [31]

    Lipson, Z

    L. Lipson, Z. Teed, and J. Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2024

  32. [32]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  33. [33]

    Sucar, S

    E. Sucar, S. Liu, J. Ortiz, and A. J. Davison. imap: Implicit mapping and positioning in real- time. InProceedings of the IEEE/CVF international conference on computer vision, pages 6229–6238, 2021. 11

  34. [34]

    Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice- slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

  35. [35]

    Z. Zhu, S. Peng, V . Larsson, Z. Cui, M. R. Oswald, A. Geiger, and M. Pollefeys. Nicer-slam: Neural implicit scene encoding for rgb slam. In2024 International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2024

  36. [36]

    M. M. Johari, C. Carta, and F. Fleuret. Eslam: Efficient dense slam system based on hybrid rep- resentation of signed distance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17408–17419, 2023

  37. [37]

    Sandstr ¨om, Y

    E. Sandstr ¨om, Y . Li, L. Van Gool, and M. R. Oswald. Point-slam: Dense neural point cloud- based slam. InProceedings of the IEEE/CVF international conference on computer vision, pages 18433–18444, 2023

  38. [38]

    L. Liso, E. Sandstr ¨om, V . Yugay, L. Van Gool, and M. R. Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20363–20373, 2024

  39. [39]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  40. [40]

    Keetha, J

    N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

  41. [41]

    Sandstr ¨om, G

    E. Sandstr ¨om, G. Zhang, K. Tateno, M. Oechsle, M. Niemeyer, Y . Zhang, M. Patel, L. Van Gool, M. Oswald, and F. Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1680–1691, 2025

  42. [42]

    Zhang, Q

    W. Zhang, Q. Cheng, D. Skuddis, N. Zeller, D. Cremers, and N. Haala. Hi-slam2: Geometry- aware gaussian slam for fast monocular scene reconstruction.IEEE Transactions on Robotics, 41:6478–6493, 2025

  43. [43]

    M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, T. Deng, and H. Wang. Sgs-slam: Semantic gaussian splatting for neural dense slam. InEuropean Conference on Computer Vision, pages 163–179. Springer, 2024

  44. [44]

    Gaussian-slam: Photo-realistic dense slam with gaussian splatting,

    V . Yugay, Y . Li, T. Gevers, and M. R. Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting.arXiv preprint arXiv:2312.10070, 2023

  45. [45]

    Matsuki, R

    H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18039–18048, 2024

  46. [46]

    Z. Peng, T. Shao, Y . Liu, J. Zhou, Y . Yang, J. Wang, and K. Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

  47. [47]

    Z. Cao, H. Wu, L. W. Tang, Z. Luo, W. Zhang, M. Pollefeys, Z. Zhu, and M. R. Oswald. Mcgs- slam: A multi-camera slam framework using gaussian splatting for high-fidelity mapping. arXiv preprint arXiv:2509.14191, 2025

  48. [48]

    B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025. 12

  49. [49]

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  50. [50]

    Y . Liu, S. Dong, S. Wang, Y . Yin, Y . Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

  51. [51]

    Wang and L

    H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Con- ference on 3D Vision (3DV), pages 78–89. IEEE, 2025

  52. [52]

    Teed and J

    Z. Teed and J. Deng. Tangent space backpropagation for 3d transformation groups. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10338– 10347, 2021

  53. [53]

    Tolias, T

    G. Tolias, T. Jenicek, and O. Chum. Learning and aggregating deep local descriptors for instance-level recognition. InEuropean Conference on Computer Vision, pages 460–477. Springer, 2020

  54. [54]

    Tolias, Y

    G. Tolias, Y . Avrithis, and H. J´egou. To aggregate or not to aggregate: Selective match kernels for image search. InProceedings of the IEEE international conference on computer vision, pages 1401–1408, 2013

  55. [55]

    S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 1991

  56. [56]

    Achanta, A

    R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨usstrunk. Slic superpixels compared to state-of-the-art superpixel methods.IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012

  57. [57]

    Concha and J

    A. Concha and J. Civera. Dpptam: Dense piecewise planar tracking and mapping from a monocular sequence. In2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5686–5693. IEEE, 2015

  58. [58]

    Mazur, G

    K. Mazur, G. Bae, and A. J. Davison. Superprimitive: Scene reconstruction at a primitive level. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4979–4989, 2024

  59. [59]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 13 Appendix / Supplemental Materials 3D Reconstruction (Norma...

  60. [60]

    (5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm

    Run global pose LM-IRLS over Eq. (5) (depths frozen) for at most100iterations; repeat up to5passes untilmax i∥∆tW Ci ∥2 <5mm

  61. [61]

    w/ Loop Closure

    Forr= 1, . . . , RwithR= 100: (a) build segments from each keyframe’s current canonical pointmap via Eq. (19); (b) refinesby Eq. (24) (poses frozen,10inner iterations); (c) write the updated ˆXback into the canonical pointmaps; (d) run a short pose pass (Eq. (5),50iterations) so the poses absorb the depth change; (e) broadcast the new global poses to all ...