hub

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green · 2019 · cs.CV · arXiv 1906.05797

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction

cs.RO · 2026-05-11 · unverdicted · novelty 8.0

MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

cs.CV · 2021-09-16 · accept · novelty 8.0

HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.

Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.

MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.

Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.

A Survey of Spatial Memory Representations for Efficient Robot Navigation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation protocol plus α-aware budgeting.

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.

SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.

VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.

MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

cs.RO · 2026-04-30 · unverdicted · novelty 6.0

FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve serialization to predict instance masks from learned queries.

Geometric Context Transformer for Streaming 3D Reconstruction

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.

ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

PointSplat uses 3D-geometry-only pruning and a dual-branch transformer to reduce Gaussian count in 3DGS scenes, delivering competitive quality and better efficiency without per-scene fine-tuning.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation

cs.SD · 2026-04-02 · unverdicted · novelty 6.0

RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.

Depth Anything 3: Recovering the Visual Space from Any Views

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

cs.RO · 2025-11-06 · unverdicted · novelty 6.0

Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.

citing papers explorer

Showing 31 of 31 citing papers.

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction cs.RO · 2026-05-11 · unverdicted · none · ref 42 · internal anchor
MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI cs.CV · 2021-09-16 · accept · none · ref 15 · internal anchor
HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images cs.CV · 2026-05-12 · unverdicted · none · ref 41 · 2 links · internal anchor
A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.
Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis cs.CV · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation cs.RO · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes cs.CV · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
A Survey of Spatial Memory Representations for Efficient Robot Navigation cs.CV · 2026-04-13 · unverdicted · none · ref 61 · internal anchor
The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation protocol plus α-aware budgeting.
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction cs.SD · 2026-04-06 · unverdicted · none · ref 24 · internal anchor
BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction cs.CV · 2026-04-03 · unverdicted · none · ref 28 · internal anchor
SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping cs.CV · 2026-04-03 · unverdicted · none · ref 38 · internal anchor
VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors cs.CV · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning cs.LG · 2026-05-08 · unverdicted · none · ref 54 · internal anchor
MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction cs.RO · 2026-04-30 · unverdicted · none · ref 65 · internal anchor
FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation cs.CV · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve serialization to predict instance masks from learned queries.
Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 63 · internal anchor
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 23 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting cs.RO · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CV · 2026-04-14 · unverdicted · none · ref 33 · internal anchor
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 48 · internal anchor
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting cs.CV · 2026-04-10 · unverdicted · none · ref 23 · internal anchor
PointSplat uses 3D-geometry-only pruning and a dual-branch transformer to reduce Gaussian count in 3DGS scenes, delivering competitive quality and better efficiency without per-scene fine-tuning.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 63 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation cs.SD · 2026-04-02 · unverdicted · none · ref 22 · internal anchor
RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 83 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning cs.RO · 2025-11-06 · unverdicted · none · ref 99 · internal anchor
Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
Syn4D: A Multiview Synthetic 4D Dataset cs.CV · 2026-05-06 · unverdicted · none · ref 90 · internal anchor
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers cs.RO · 2026-05-05 · unverdicted · none · ref 25 · internal anchor
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
First Shape, Then Meaning: Efficient Geometry and Semantics Learning for Indoor Reconstruction cs.CV · 2026-05-05 · unverdicted · none · ref 57 · internal anchor
FSTM improves indoor reconstruction by training geometry first without semantic supervision, then adding semantics, achieving 2.3x faster training and higher object surface recall than joint optimization.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM cs.RO · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation cs.CV · 2026-04-10 · unverdicted · none · ref 47 · internal anchor
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
Audio Spatially-Guided Fusion for Audio-Visual Navigation cs.SD · 2026-04-02 · unverdicted · none · ref 7 · internal anchor
Audio Spatially-Guided Fusion improves generalization in audio-visual navigation on unheard sound sources by extracting spatial audio features and adaptively fusing them with visual data.
3D Generation for Embodied AI and Robotic Simulation: A Survey cs.RO · 2026-04-29 · unverdicted · none · ref 209 · 3 links · internal anchor
The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and transfer.

The Replica Dataset: A Digital Replica of Indoor Spaces

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer