SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
hub Mixed citations
The Replica Dataset: A Digital Replica of Indoor Spaces
Mixed citation behavior. Most common role is background (41%).
abstract
We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.
GaussLite conditions 3D Gaussian Splatting seeding density, gradient flow, and scaling on task relevance masks derived from LLM-parsed natural language and open-vocabulary detection, yielding +2.72 dB ROI PSNR gains on Replica and +2.23 dB on real hardware at fixed budget.
ActMVS is the first monocular framework for active scene reconstruction that combines view factor graph construction with global depth optimization to generate online, globally consistent dense depth maps competitive with RGB-D methods on Replica datasets.
REST3D reconstructs physically stable 3D scenes from single images via agentic scene-tree understanding and physics-constrained optimization.
OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at the object level.
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation protocol plus α-aware budgeting.
BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
EAG-PT reconstructs indoor scenes with emission-separated 2D Gaussians and uses path tracing for physically consistent editing of diffuse global illumination.
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
DecHOI decouples trajectory planning from motion synthesis to produce realistic human-object interactions without prescribed waypoints and with improved contact dynamics.
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
citing papers explorer
-
MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction
MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
-
Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.