hub

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, Luca Carlone · 2025 · cs.CV · arXiv 2505.12549

30 Pith papers cite this work. Polarity classification is still indexing.

30 Pith papers citing it

open full Pith review browse 30 citing papers arXiv PDF

abstract

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

cs.CV · 2026-04-29 · unverdicted · novelty 7.0 · 2 refs

AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

cs.RO · 2026-04-16 · unverdicted · novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

Cotraining on 532 everyday human videos with accurate hand labels improves robot policies by 29.7% when networks specialize to human versus robot embodiments.

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

PRISM-SLAM adds a Plücker Ray-Distance Factor and dynamic uncertainty gating to a VFM-augmented factor graph to deliver scale-consistent metric SLAM at 30 FPS from monocular RGB.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

cs.CV · 2026-04-06 · conditional · novelty 6.0 · 2 refs

ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.

Depth Anything 3: Recovering the Visual Space from Any Views

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

cs.CV · 2025-07-23 · unverdicted · novelty 6.0

PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

cs.CV · 2025-07-10 · unverdicted · novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 5.0

KiloGS-SLAM is a monocular 3DGS SLAM system with condition-triggered hybrid tracking and probabilistic chunk-based Gaussian mapping that scales to over 10,000 frames in outdoor environments while maintaining accuracy and efficiency.

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

cs.CV · 2026-06-16 · unverdicted · novelty 5.0

MoonSplat adds global Sim(3) loop closure and color residual learning to voxelized online 3D Gaussian Splatting for improved monocular camera tracking and rendering quality.

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

cs.CV · 2026-06-09 · unverdicted · novelty 5.0

DarkVGGT introduces physics-aware thermal factorization and geometry-shared routing modules in an RGB-T feed-forward framework to improve depth and camera pose estimation under degraded RGB conditions.

$R^3$: 3D Reconstruction via Relative Regression

cs.CV · 2026-05-26 · unverdicted · novelty 5.0

R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.

Global Structure-from-Motion Meets Feedforward Reconstruction

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

cs.RO · 2026-04-12 · unverdicted · novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 23 · 3 links · internal anchor
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer