MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
hub Mixed citations
The Replica Dataset: A Digital Replica of Indoor Spaces
Mixed citation behavior. Most common role is background (41%).
abstract
We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Integrating direction-of-arrival spectra and binaural embeddings from passive audio into vision models improves relative camera pose estimation in in-the-wild videos and adds robustness to visual corruption.
HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
REST3D reconstructs physically stable 3D scenes from single images via agentic scene-tree understanding and physics-constrained optimization.
OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at the object level.
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation protocol plus α-aware budgeting.
BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
EAG-PT reconstructs indoor scenes with emission-separated 2D Gaussians and uses path tracing for physically consistent editing of diffuse global illumination.
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
DecHOI decouples trajectory planning from motion synthesis to produce realistic human-object interactions without prescribed waypoints and with improved contact dynamics.
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
OpenTrack3D achieves state-of-the-art open-vocabulary 3D instance segmentation by generating cross-view consistent proposals online with a visual-spatial tracker and replacing CLIP with an MLLM for improved compositional reasoning.
MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.
Presents SLAM&Render, a robot-recorded benchmark dataset with 40 multi-modal sequences for testing SLAM, novel view synthesis, and Gaussian Splatting under controlled variations in lighting, arrangements, and occlusions.
DéjàView applies a single transformer block recurrently for K refinement steps, matching or exceeding larger feed-forward models on five multi-view 3D benchmarks with fewer parameters and comparable compute.
citing papers explorer
-
MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction
MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
-
Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video
Integrating direction-of-arrival spectra and binaural embeddings from passive audio into vision models improves relative camera pose estimation in in-the-wild videos and adds robustness to visual corruption.
-
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
-
REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image
REST3D reconstructs physically stable 3D scenes from single images via agentic scene-tree understanding and physics-constrained optimization.
-
OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives
OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at the object level.
-
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
-
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
-
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
-
Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis
Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
-
MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
-
Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
-
A Survey of Spatial Memory Representations for Efficient Robot Navigation
The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation protocol plus α-aware budgeting.
-
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction
BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
-
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
-
VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping
VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.
-
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
-
EAG-PT: Emission-Aware Gaussians and Path Tracing for Diffuse Indoor Scene Reconstruction and Editing
EAG-PT reconstructs indoor scenes with emission-separated 2D Gaussians and uses path tracing for physically consistent editing of diffuse global illumination.
-
3AM: 3egment Anything with Geometric Consistency in Videos
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
-
Decoupled Generative Modeling for Human-Object Interaction Synthesis
DecHOI decouples trajectory planning from motion synthesis to produce realistic human-object interactions without prescribed waypoints and with improved contact dynamics.
-
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
-
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
OpenTrack3D achieves state-of-the-art open-vocabulary 3D instance segmentation by generating cross-view consistent proposals online with a visual-spatial tracker and replacing CLIP with an MLLM for improved compositional reasoning.
-
MODEST: Multi-Optics Depth-of-Field Stereo Dataset
MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.
-
SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM
Presents SLAM&Render, a robot-recorded benchmark dataset with 40 multi-modal sequences for testing SLAM, novel view synthesis, and Gaussian Splatting under controlled variations in lighting, arrangements, and occlusions.
-
D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction
DéjàView applies a single transformer block recurrently for K refinement steps, matching or exceeding larger feed-forward models on five multi-view 3D benchmarks with fewer parameters and comparable compute.
-
DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
DGSG-Mind is a hybrid 3D Gaussian dynamic scene graph system with an embodied reasoning agent for robust instance fusion, dynamic updates, and multimodal grounding in self-reconstructed maps.
-
S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields
S2MDF is a lightweight add-on layer that applies a hard no-intersection constraint to any compositional SDF representation, reducing overlaps to numerical precision without architecture changes.
-
Provably Guaranteed Polytopic Uncertainty Quantification for SLAM
Presents polytopic forward/backward/compound UQ modules for full 3D SLAM pipeline with deterministic containment guarantees and conformal prediction calibration from data.
-
RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots
RGB-only active 3D scene graph generation unifies perception and planning to achieve depth-baseline parity and more than double object detection in active indoor exploration.
-
Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
Fixed external cameras as Common Prior Maps boost initial object recall in 3D scene graph generation by up to 79% and improve active exploration efficiency.
-
CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
Presents COVER, a greedy ERP viewpoint curator with coverage scoring and depth conflict penalization, and releases the CM-EVS dataset of 36k sparse panoramic RGB-D-pose frames from 1,275 indoor scenes plus outdoor data.
-
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors
VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.
-
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
-
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
-
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
-
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting
PointSplat uses 3D-geometry-only pruning and a dual-branch transformer to reduce Gaussian count in 3DGS scenes, delivering competitive quality and better efficiency without per-scene fine-tuning.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation
RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
-
Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation
POMP is a parallel OctoMap-based mapping method that refines free space at fixed resolution to raise pathfinding success rates and shorten paths while preserving compatibility with existing planners.
-
C3G: Learning Compact 3D Representations with 2K Gaussians
C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
-
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
-
OREN: Octree Residual Network for Real-Time Euclidean Signed Distance Mapping
OREN is a hybrid octree-neural residual method for real-time Euclidean SDF reconstruction that claims efficiency comparable to volumetric approaches and accuracy/differentiability comparable to neural networks.
-
Differentiable Acoustic Radiance Transfer
DART adds differentiability to acoustic radiance transfer, enabling material optimization and improved performance on sparse acoustic field prediction tasks compared to signal processing and neural baselines.
-
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
InternScenes is a new dataset of approximately 40,000 simulatable indoor scenes that combines real scans, procedural, and designer sources, preserves small objects for realistic layouts, and includes processing for simulation and interaction.
-
Compact 3D Gaussian Splatting For Dense Visual SLAM
A compact 3D Gaussian Splatting SLAM system reduces Gaussian count and parameter size via masking and a geometry codebook while preserving SOTA reconstruction quality and pose accuracy.