hub

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, Luca Carlone · 2025 · cs.CV · arXiv 2505.12549

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

cs.CV · 2026-04-29 · conditional · novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

cs.RO · 2026-04-16 · unverdicted · novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

cs.CV · 2026-04-06 · conditional · novelty 6.0 · 2 refs

ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.

Depth Anything 3: Recovering the Visual Space from Any Views

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

cs.CV · 2025-07-23 · unverdicted · novelty 6.0

PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

cs.CV · 2025-07-10 · unverdicted · novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

$R^3$: 3D Reconstruction via Relative Regression

cs.CV · 2026-05-26 · unverdicted · novelty 5.0

R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.

Global Structure-from-Motion Meets Feedforward Reconstruction

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

cs.RO · 2026-04-12 · unverdicted · novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

Metric, inertially aligned monocular state estimation via kinetodynamic priors

cs.RO · 2025-11-25 · unverdicted · novelty 5.0

The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-rigid platforms.

SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

cs.CV · 2025-11-21 · unverdicted · novelty 5.0

SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.

TTT3R: 3D Reconstruction as Test-Time Training

cs.CV · 2025-09-30 · unverdicted · novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

ViPE: Video Pose Engine for 3D Geometric Perception

cs.CV · 2025-08-12 · unverdicted · novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.

VGGT-SLAM++

cs.CV · 2026-04-08 · unverdicted · novelty 4.0

VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

cs.CV · 2025-07-22 · conditional · novelty 4.0

VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

cs.RO · 2026-05-19

citing papers explorer

Showing 24 of 24 citing papers.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 17 · internal anchor
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model cs.RO · 2026-05-17 · unverdicted · none · ref 30 · internal anchor
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction cs.CV · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · conditional · none · ref 30 · internal anchor
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye cs.RO · 2026-04-16 · unverdicted · none · ref 18 · internal anchor
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 23 · 3 links · internal anchor
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 10 · internal anchor
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 41 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging cs.CV · 2026-04-06 · conditional · none · ref 11 · 2 links · internal anchor
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 54 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving cs.CV · 2025-07-23 · unverdicted · none · ref 38 · internal anchor
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 47 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 38 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
Global Structure-from-Motion Meets Feedforward Reconstruction cs.CV · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM cs.RO · 2026-04-12 · unverdicted · none · ref 11 · internal anchor
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
Metric, inertially aligned monocular state estimation via kinetodynamic priors cs.RO · 2025-11-25 · unverdicted · none · ref 26 · internal anchor
The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-rigid platforms.
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors cs.CV · 2025-11-21 · unverdicted · none · ref 20 · internal anchor
SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
TTT3R: 3D Reconstruction as Test-Time Training cs.CV · 2025-09-30 · unverdicted · none · ref 49 · internal anchor
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
ViPE: Video Pose Engine for 3D Geometric Perception cs.CV · 2025-08-12 · unverdicted · none · ref 47 · internal anchor
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
VGGT-SLAM++ cs.CV · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences cs.CV · 2025-07-22 · conditional · none · ref 19 · internal anchor
VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unreviewed · ref 44 · internal anchor
PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM cs.RO · 2026-05-19 · unreviewed · ref 49 · internal anchor

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer