TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.
hub
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
30 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
Cotraining on 532 everyday human videos with accurate hand labels improves robot policies by 29.7% when networks specialize to human versus robot embodiments.
Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.
Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.
PRISM-SLAM adds a Plücker Ray-Distance Factor and dynamic uncertainty gating to a VFM-augmented factor graph to deliver scale-consistent metric SLAM at 30 FPS from monocular RGB.
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
KiloGS-SLAM is a monocular 3DGS SLAM system with condition-triggered hybrid tracking and probabilistic chunk-based Gaussian mapping that scales to over 10,000 frames in outdoor environments while maintaining accuracy and efficiency.
MoonSplat adds global Sim(3) loop closure and color residual learning to voxelized online 3D Gaussian Splatting for improved monocular camera tracking and rendering quality.
DarkVGGT introduces physics-aware thermal factorization and geometry-shared routing modules in an RGB-T feed-forward framework to improve depth and camera pose estimation under degraded RGB conditions.
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
citing papers explorer
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.