Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
hub
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
24 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-rigid platforms.
SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.
citing papers explorer
-
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
-
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
-
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
-
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
-
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
$R^3$: 3D Reconstruction via Relative Regression
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
-
Global Structure-from-Motion Meets Feedforward Reconstruction
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
-
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
-
Metric, inertially aligned monocular state estimation via kinetodynamic priors
The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-rigid platforms.
-
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors
SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
VGGT-SLAM++
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
-
VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.
- Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
- PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM