hub

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, Luca Carlone · 2025 · cs.CV · arXiv 2505.12549

30 Pith papers cite this work. Polarity classification is still indexing.

30 Pith papers citing it

open full Pith review browse 30 citing papers arXiv PDF

abstract

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

cs.CV · 2026-04-29 · unverdicted · novelty 7.0 · 2 refs

AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

cs.RO · 2026-04-16 · unverdicted · novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

Cotraining on 532 everyday human videos with accurate hand labels improves robot policies by 29.7% when networks specialize to human versus robot embodiments.

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

PRISM-SLAM adds a Plücker Ray-Distance Factor and dynamic uncertainty gating to a VFM-augmented factor graph to deliver scale-consistent metric SLAM at 30 FPS from monocular RGB.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

cs.CV · 2026-04-06 · conditional · novelty 6.0 · 2 refs

ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.

Depth Anything 3: Recovering the Visual Space from Any Views

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

cs.CV · 2025-07-23 · unverdicted · novelty 6.0

PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

cs.CV · 2025-07-10 · unverdicted · novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 5.0

KiloGS-SLAM is a monocular 3DGS SLAM system with condition-triggered hybrid tracking and probabilistic chunk-based Gaussian mapping that scales to over 10,000 frames in outdoor environments while maintaining accuracy and efficiency.

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

cs.CV · 2026-06-16 · unverdicted · novelty 5.0

MoonSplat adds global Sim(3) loop closure and color residual learning to voxelized online 3D Gaussian Splatting for improved monocular camera tracking and rendering quality.

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

cs.CV · 2026-06-09 · unverdicted · novelty 5.0

DarkVGGT introduces physics-aware thermal factorization and geometry-shared routing modules in an RGB-T feed-forward framework to improve depth and camera pose estimation under degraded RGB conditions.

$R^3$: 3D Reconstruction via Relative Regression

cs.CV · 2026-05-26 · unverdicted · novelty 5.0

R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.

Global Structure-from-Motion Meets Feedforward Reconstruction

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

cs.RO · 2026-04-12 · unverdicted · novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

citing papers explorer

Showing 24 of 24 citing papers after filters.

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos cs.CV · 2026-06-01 · unverdicted · none · ref 36 · internal anchor
TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 17 · internal anchor
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction cs.CV · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · unverdicted · none · ref 30 · 2 links · internal anchor
AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping cs.CV · 2026-06-03 · unverdicted · none · ref 55 · internal anchor
Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unverdicted · none · ref 44 · 2 links · internal anchor
Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 23 · 3 links · internal anchor
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 10 · internal anchor
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 41 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging cs.CV · 2026-04-06 · conditional · none · ref 11 · 2 links · internal anchor
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 54 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving cs.CV · 2025-07-23 · unverdicted · none · ref 38 · internal anchor
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 47 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes cs.CV · 2026-06-29 · unverdicted · none · ref 25 · internal anchor
KiloGS-SLAM is a monocular 3DGS SLAM system with condition-triggered hybrid tracking and probabilistic chunk-based Gaussian mapping that scales to over 10,000 frames in outdoor environments while maintaining accuracy and efficiency.
MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization cs.CV · 2026-06-16 · unverdicted · none · ref 26 · internal anchor
MoonSplat adds global Sim(3) loop closure and color residual learning to voxelized online 3D Gaussian Splatting for improved monocular camera tracking and rendering quality.
DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax cs.CV · 2026-06-09 · unverdicted · none · ref 29 · internal anchor
DarkVGGT introduces physics-aware thermal factorization and geometry-shared routing modules in an RGB-T feed-forward framework to improve depth and camera pose estimation under degraded RGB conditions.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 38 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
Global Structure-from-Motion Meets Feedforward Reconstruction cs.CV · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors cs.CV · 2025-11-21 · unverdicted · none · ref 20 · internal anchor
SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
TTT3R: 3D Reconstruction as Test-Time Training cs.CV · 2025-09-30 · unverdicted · none · ref 49 · internal anchor
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
ViPE: Video Pose Engine for 3D Geometric Perception cs.CV · 2025-08-12 · unverdicted · none · ref 47 · internal anchor
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
VGGT-SLAM++ cs.CV · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences cs.CV · 2025-07-22 · conditional · none · ref 19 · internal anchor
VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer