The Replica Dataset: A Digital Replica of Indoor Spaces

Anton Clarkson; Brian Budge; Carl Ren; Dhruv Batra; Elias Mueggler; Erik Wijmans; Hauke M. Strasdat; Jakob J. Engel; Jesus Briales; Julian Straub

arxiv: 1906.05797 · v1 · submitted 2019-06-13 · 💻 cs.CV · cs.GR· eess.IV

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub , Thomas Whelan , Lingni Ma , Yufan Chen , Erik Wijmans , Simon Green , Jakob J. Engel , Raul Mur-Artal

show 22 more authors

Carl Ren Shobhit Verma Anton Clarkson Mingfei Yan Brian Budge Yajie Yan Xiaqing Pan June Yon Yuyang Zou Kimberly Leon Nigel Carter Jesus Briales Tyler Gillingham Elias Mueggler Luis Pesqueira Manolis Savva Dhruv Batra Hauke M. Strasdat Renzo De Nardi Michael Goesele Steven Lovegrove Richard Newcombe

This is my paper

Pith reviewed 2026-05-12 16:29 UTC · model grok-4.3

classification 💻 cs.CV cs.GReess.IV

keywords Replica dataset3D indoor reconstructionphoto-realistic scenessemantic segmentationembodied agentsmachine learningcomputer visionsim-to-real transfer

0 comments

The pith

Replica is a dataset of 18 photo-realistic 3D indoor scenes designed so machine learning models trained on it may work directly on real-world data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Replica dataset containing 18 highly detailed 3D reconstructions of indoor environments. Each reconstruction includes a dense mesh, high-resolution HDR textures, semantic labels for classes and instances, and information on reflective surfaces like mirrors and glass. The purpose is to create realistic virtual worlds for training AI in computer vision and robotics tasks such as navigation and question answering. A key hope is that the realism allows models to transfer to actual images and videos without extra adjustments. This addresses the challenge of obtaining large amounts of labeled real-world data for such applications.

Core claim

What carries the argument

The Replica dataset of 18 indoor scenes, each supplying a dense mesh, HDR textures, per-primitive semantic labels, and reflector data to act as a realistic generative model for ML training.

If this is right

Enables training and evaluation of 2D and 3D semantic segmentation models on accurate per-primitive labels.
Supports geometric inference research using dense, textured 3D meshes.
Allows creation of embodied agents for navigation, instruction following, and question answering in realistic settings.
Provides native compatibility with Habitat for virtual robot training and testing.
Supplies a minimal C++ SDK to facilitate immediate use of the reconstructions and renderings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset may lower reliance on domain-adaptation methods by shrinking the visual gap between simulation and reality.
Direct performance comparisons on Replica versus real data could quantify how much scene fidelity is required for different tasks.
Adding dynamic objects or time-varying lighting to the scenes could extend the work toward video-based and interactive AI.

Load-bearing premise

The 18 scenes achieve sufficient photo-realism and geometric accuracy in meshes, textures, and semantics that ML models trained on them transfer directly to real-world image and video data without domain adaptation.

What would settle it

Train a semantic segmentation or navigation model on Replica renderings and measure its accuracy on real captured indoor images or videos; comparable results to models trained on real data would support the direct-transfer claim.

read the original abstract

We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Replica dataset consisting of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene provides a dense mesh, high-resolution HDR textures, per-primitive semantic class and instance labels, and information on planar mirrors and glass reflectors. The goal is to support ML research in egocentric computer vision, semantic segmentation, geometric inference, and embodied AI, with the hope that trained models can transfer directly to real-world data. The authors release a minimal C++ SDK and note compatibility with the Habitat simulator.

Significance. The release of this dataset, along with the SDK and Habitat compatibility, represents a useful contribution to the field by providing a resource for training and testing models in highly detailed simulated indoor environments. If the claimed photo-realism holds, it could help advance research on sim-to-real transfer for tasks like navigation and question answering by embodied agents. The provision of semantic labels and reflector information strengthens its applicability to a range of CV and robotics tasks.

major comments (1)

[Abstract] Abstract: The claim that 'due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data' is presented without quantitative support such as FID/KID scores, perceptual similarity metrics, mesh reconstruction error statistics, or side-by-side comparisons to real RGB-D captures of the same rooms. This is load-bearing for the central motivation of direct transfer without domain adaptation.

minor comments (2)

The manuscript would be strengthened by including a clear description of the data capture and reconstruction pipeline (including any accuracy metrics for geometry and textures) in a dedicated methods section.
Clarify whether the released dataset includes the original captured RGB-D images in addition to the reconstructed meshes and textures, as this affects usability for validation studies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and the helpful comment on the abstract. We address the concern point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data' is presented without quantitative support such as FID/KID scores, perceptual similarity metrics, mesh reconstruction error statistics, or side-by-side comparisons to real RGB-D captures of the same rooms. This is load-bearing for the central motivation of direct transfer without domain adaptation.

Authors: We agree that the statement in the abstract is aspirational and lacks the quantitative evidence (FID/KID, perceptual metrics, reconstruction errors, or direct real-world comparisons) that would be needed to substantiate direct sim-to-real transfer. The phrasing uses 'there is hope' to reflect an intended outcome rather than a demonstrated result, and the manuscript's motivation section grounds the realism in the capture pipeline (high-resolution HDR textures, dense meshes, and reflector modeling) rather than in transfer experiments. Because the paper's primary contribution is the dataset release and not a transfer benchmark, we do not have these metrics available. We will therefore revise the abstract to qualify the claim, emphasizing that Replica provides a high-fidelity simulation environment intended to support research on sim-to-real transfer while making clear that direct transfer without adaptation remains an open question to be investigated by the community. revision: yes

Circularity Check

0 steps flagged

No derivation chain or predictions present in dataset release paper

full rationale

The paper introduces Replica as a collection of 18 reconstructed indoor scenes with meshes, HDR textures, semantics, and reflectors. Its sole forward-looking statement is an informal hope that renderings may enable direct ML transfer to real data. No equations, fitted parameters, uniqueness theorems, ansatzes, or predictions are defined or derived anywhere in the manuscript. The work is a data resource release whose claims rest on descriptive pipeline details rather than any self-referential reduction or construction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the reconstructions are faithful to real spaces; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The provided 3D reconstructions and renderings are sufficiently accurate and photo-realistic to represent real indoor environments.
Invoked in the abstract when stating the goal of enabling direct transfer to real-world data.

pith-pipeline@v0.9.0 · 5614 in / 1195 out tokens · 102986 ms · 2026-05-12T16:29:51.021545+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction
cs.RO 2026-05 unverdicted novelty 8.0

MAGS-SLAM is the first RGB-only multi-agent 3D Gaussian Splatting SLAM framework that matches RGB-D performance via compact submap sharing, geometry-appearance loop verification, and occupancy-aware fusion.
Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video
cs.CV 2025-12 unverdicted novelty 8.0

Integrating direction-of-arrival spectra and binaural embeddings from passive audio into vision models improves relative camera pose estimation in in-the-wild videos and adds robustness to visual corruption.
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
cs.CV 2021-09 accept novelty 8.0

HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives
cs.CV 2026-05 unverdicted novelty 7.0

OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at ...
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
cs.CV 2026-05 unverdicted novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 conditional novelty 7.0

MLLMs exhibit a large perception-reasoning gap on perspective-conditioned spatial reasoning in omnidirectional images, with accuracy falling from 57% on basic direction tasks to under 1% on compositional reasoning, th...
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains...
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.
Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation
cs.RO 2026-05 unverdicted novelty 7.0

MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes
cs.CV 2026-05 unverdicted novelty 7.0

S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
A Survey of Spatial Memory Representations for Efficient Robot Navigation
cs.CV 2026-04 unverdicted novelty 7.0

The survey reviews spatial memory methods across 88 references, defines α as peak runtime memory over map size, profiles neural methods showing α from 2.3 to 215 on A100 GPU, and proposes a standardized evaluation pro...
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction
cs.SD 2026-04 unverdicted novelty 7.0

BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard soun...
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
cs.CV 2026-04 unverdicted novelty 7.0

SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping
cs.CV 2026-04 unverdicted novelty 7.0

VBGS-SLAM uses variational inference on conjugate Gaussian properties to couple 3DGS map refinement and pose tracking with closed-form updates and posterior uncertainty, reducing drift compared to deterministic methods.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
cs.CV 2026-03 unverdicted novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
EAG-PT: Emission-Aware Gaussians and Path Tracing for Diffuse Indoor Scene Reconstruction and Editing
cs.GR 2026-01 unverdicted novelty 7.0

EAG-PT reconstructs indoor scenes with emission-separated 2D Gaussians and uses path tracing for physically consistent editing of diffuse global illumination.
3AM: 3egment Anything with Geometric Consistency in Videos
cs.CV 2026-01 unverdicted novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
Decoupled Generative Modeling for Human-Object Interaction Synthesis
cs.CV 2025-12 unverdicted novelty 7.0

DecHOI decouples trajectory planning from motion synthesis to produce realistic human-object interactions without prescribed waypoints and with improved contact dynamics.
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
cs.CV 2025-12 unverdicted novelty 7.0

A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
cs.CV 2025-12 unverdicted novelty 7.0

OpenTrack3D achieves state-of-the-art open-vocabulary 3D instance segmentation by generating cross-view consistent proposals online with a visual-spatial tracker and replacing CLIP with an MLLM for improved compositio...
MODEST: Multi-Optics Depth-of-Field Stereo Dataset
cs.CV 2025-11 accept novelty 7.0

MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.
SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM
cs.RO 2025-04 unverdicted novelty 7.0

Presents SLAM&Render, a robot-recorded benchmark dataset with 40 multi-modal sequences for testing SLAM, novel view synthesis, and Gaussian Splatting under controlled variations in lighting, arrangements, and occlusions.
RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots
cs.RO 2026-05 unverdicted novelty 6.0

RGB-only active 3D scene graph generation unifies perception and planning to achieve depth-baseline parity and more than double object detection in active indoor exploration.
Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
cs.RO 2026-05 unverdicted novelty 6.0

Fixed external cameras as Common Prior Maps boost initial object recall in 3D scene graph generation by up to 79% and improve active exploration efficiency.
CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
cs.CV 2026-05 unverdicted novelty 6.0

Presents COVER, a greedy ERP viewpoint curator with coverage scoring and depth conflict penalization, and releases the CM-EVS dataset of 36k sparse panoramic RGB-D-pose frames from 1,275 indoor scenes plus outdoor data.
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors
cs.CV 2026-05 unverdicted novelty 6.0

VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
cs.LG 2026-05 unverdicted novelty 6.0

MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
cs.LG 2026-05 unverdicted novelty 6.0

MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
cs.RO 2026-04 unverdicted novelty 6.0

FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x ...
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve seriali...
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching S...
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

PointSplat uses 3D-geometry-only pruning and a dual-branch transformer to reduce Gaussian count in 3DGS scenes, delivering competitive quality and better efficiency without per-scene fine-tuning.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation
cs.SD 2026-04 unverdicted novelty 6.0

RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation
cs.RO 2026-03 unverdicted novelty 6.0

POMP is a parallel OctoMap-based mapping method that refines free space at fixed resolution to raise pathfinding success rates and shorten paths while preserving compatibility with existing planners.
TACO: Temporal Consensus Optimization for Continual Neural Mapping
cs.RO 2026-02 unverdicted novelty 6.0

TACO reformulates neural implicit mapping as temporal consensus optimization to enable continual adaptation to scene changes without data replay or storage.
C3G: Learning Compact 3D Representations with 2K Gaussians
cs.CV 2025-12 unverdicted novelty 6.0

C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
cs.CV 2025-11 unverdicted novelty 6.0

RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
cs.RO 2025-11 unverdicted novelty 6.0

Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
OREN: Octree Residual Network for Real-Time Euclidean Signed Distance Mapping
cs.RO 2025-10 unverdicted novelty 6.0

OREN is a hybrid octree-neural residual method for real-time Euclidean SDF reconstruction that claims efficiency comparable to volumetric approaches and accuracy/differentiability comparable to neural networks.
Differentiable Acoustic Radiance Transfer
cs.SD 2025-09 unverdicted novelty 6.0

DART adds differentiability to acoustic radiance transfer, enabling material optimization and improved performance on sparse acoustic field prediction tasks compared to signal processing and neural baselines.
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
cs.CV 2025-09 unverdicted novelty 6.0

InternScenes is a new dataset of approximately 40,000 simulatable indoor scenes that combines real scans, procedural, and designer sources, preserves small objects for realistic layouts, and includes processing for si...
Compact 3D Gaussian Splatting For Dense Visual SLAM
cs.CV 2024-03 unverdicted novelty 6.0

A compact 3D Gaussian Splatting SLAM system reduces Gaussian count and parameter size via masking and a geometry codebook while preserving SOTA reconstruction quality and pose accuracy.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matchi...
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
cs.RO 2026-05 unverdicted novelty 5.0

FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
First Shape, Then Meaning: Efficient Geometry and Semantics Learning for Indoor Reconstruction
cs.CV 2026-05 unverdicted novelty 5.0

FSTM improves indoor reconstruction by training geometry first without semantic supervision, then adding semantics, achieving 2.3x faster training and higher object surface recall than joint optimization.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
cs.RO 2026-04 unverdicted novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
Audio Spatially-Guided Fusion for Audio-Visual Navigation
cs.SD 2026-04 unverdicted novelty 5.0

Audio Spatially-Guided Fusion improves generalization in audio-visual navigation on unheard sound sources by extracting spatial audio features and adaptively fusing them with visual data.
A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

A systematic literature survey that categorizes deep learning architectures for point cloud classification, part segmentation, and semantic segmentation, evaluates them on benchmarks, and discusses innovations, limita...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 3.0

The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 59 Pith papers · 1 internal anchor

[1]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir Roshan Zamir. On evaluation of embodied navigation agents. arXiv:1807.06757, 2018

work page internal anchor Pith review arXiv 2018
[2]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton van den Hen- gel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018

work page 2018
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

work page 2015
[4]

Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese

Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large- scale indoor spaces. In CVPR, 2016

work page 2016
[5]

Ptex: Per-face texture mapping for production rendering

Brent Burley and Dylan Lacewell. Ptex: Per-face texture mapping for production rendering. In Computer Graphics Forum , volume 27, pages 1155–1164. Wiley Online Library, 2008

work page 2008
[6]

Matterport3D: Learning from RGB-D data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017. https://niessner.github.io/Matterport/

work page 2017
[7]

Kenneth J. W. Craik. The Nature of Explanation . Cambridge University Press, 1943

work page 1943
[8]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. http://www.scan- net.org/

work page 2017
[9]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018

work page 2018
[10]

Direct sparse odometry

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017

work page 2017
[11]

Efﬁcient graph- based image segmentation

Pedro F Felzenszwalb and Daniel P Huttenlocher. Efﬁcient graph- based image segmentation. IJCV, 59(2):167–181, 2004

work page 2004
[12]

Example-based synthesis of 3D object arrange- ments

Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. Example-based synthesis of 3D object arrange- ments. In ACM SIGGRAPH Asia , 2012

work page 2012
[13]

The robotrix: An extremely photorealistic and very-large-scale indoor dataset of sequences with robot trajectories and interactions

Alberto Garcia-Garcia, Pablo Martinez-Gonzalez, Sergiu Oprea, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia- Rodriguez, and Alvaro Jover-Alvarez. The robotrix: An extremely photorealistic and very-large-scale indoor dataset of sequences with robot trajectories and interactions. In IROS, pages 6790–6797. IEEE, 2018

work page 2018
[14]

Scenenet: understanding real world indoor scenes with synthetic data

A Handa, V Patraucean, V Badrinarayanan, S Stent, and R Cipolla. Scenenet: understanding real world indoor scenes with synthetic data. arxiv preprint (2015). arXiv preprint arXiv:1511.07041 , 2015

work page arXiv 2015
[15]

Instant ﬁeld-aligned meshes

Wenzel Jakob, Marco Tarini, Daniele Panozzo, and Olga Sorkine- Hornung. Instant ﬁeld-aligned meshes. ACM Transactions on Graph- ics, 34(6), November 2015

work page 2015
[16]

ImageNet classi- ﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. ImageNet classi- ﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[17]

Interiornet: Mega-scale multi-sensor photo-realistic in- door scenes dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic in- door scenes dataset. In BMVC, 2018

work page 2018
[18]

Filling holes in meshes

Peter Liepa. Filling holes in meshes. In ACM SIGGRAPH Symposium on Geometry Processing , pages 200–205, 2003

work page 2003
[19]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014

work page 2014
[20]

Lorensen and Harvey E

William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3D surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, pages 163–169, New York, NY , USA,

work page
[21]

S ´ebastien Loriot, Jane Tournois, and Ilker O. Yaz. Polygon mesh processing. In CGAL User and Reference Manual . CGAL Editorial Board, 4.14 edition, 2019

work page 2019
[22]

ORB-SLAM: a versatile and accurate monocular SLAM system

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. TRO, 31(5):1147–1163, 2015

work page 2015
[23]

Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon

Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real- time dense surface mapping and tracking. In 2011 IEEE International Symposium on Mixed and Augmented Reality , pages 127–136. IEEE, 2011

work page 2011
[24]

Habitat: A platform for embod- ied ai research

Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. arXiv preprint arXiv:1904.01201 , 2019

work page arXiv 1904
[25]

Semantic scene completion from a single depth image

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

work page 2017
[26]

R. S. Sutton and A. G. Barto. An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory , 1981

work page 1981
[27]

Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterﬁeld, Shobhit Verma, and Richard Newcombe

Thomas Whelan, Michael Goesele, Steven J. Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterﬁeld, Shobhit Verma, and Richard Newcombe. Reconstructing scenes with mirror and glass surfaces. ACM Transactions on Graphics (TOG) , 37(4):102, 2018

work page 2018
[28]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. http://gibsonenv.stanford.edu/database/

work page 2018

[1] [1]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir Roshan Zamir. On evaluation of embodied navigation agents. arXiv:1807.06757, 2018

work page internal anchor Pith review arXiv 2018

[2] [2]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton van den Hen- gel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018

work page 2018

[3] [3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

work page 2015

[4] [4]

Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese

Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large- scale indoor spaces. In CVPR, 2016

work page 2016

[5] [5]

Ptex: Per-face texture mapping for production rendering

Brent Burley and Dylan Lacewell. Ptex: Per-face texture mapping for production rendering. In Computer Graphics Forum , volume 27, pages 1155–1164. Wiley Online Library, 2008

work page 2008

[6] [6]

Matterport3D: Learning from RGB-D data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017. https://niessner.github.io/Matterport/

work page 2017

[7] [7]

Kenneth J. W. Craik. The Nature of Explanation . Cambridge University Press, 1943

work page 1943

[8] [8]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. http://www.scan- net.org/

work page 2017

[9] [9]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018

work page 2018

[10] [10]

Direct sparse odometry

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017

work page 2017

[11] [11]

Efﬁcient graph- based image segmentation

Pedro F Felzenszwalb and Daniel P Huttenlocher. Efﬁcient graph- based image segmentation. IJCV, 59(2):167–181, 2004

work page 2004

[12] [12]

Example-based synthesis of 3D object arrange- ments

Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. Example-based synthesis of 3D object arrange- ments. In ACM SIGGRAPH Asia , 2012

work page 2012

[13] [13]

The robotrix: An extremely photorealistic and very-large-scale indoor dataset of sequences with robot trajectories and interactions

Alberto Garcia-Garcia, Pablo Martinez-Gonzalez, Sergiu Oprea, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia- Rodriguez, and Alvaro Jover-Alvarez. The robotrix: An extremely photorealistic and very-large-scale indoor dataset of sequences with robot trajectories and interactions. In IROS, pages 6790–6797. IEEE, 2018

work page 2018

[14] [14]

Scenenet: understanding real world indoor scenes with synthetic data

A Handa, V Patraucean, V Badrinarayanan, S Stent, and R Cipolla. Scenenet: understanding real world indoor scenes with synthetic data. arxiv preprint (2015). arXiv preprint arXiv:1511.07041 , 2015

work page arXiv 2015

[15] [15]

Instant ﬁeld-aligned meshes

Wenzel Jakob, Marco Tarini, Daniele Panozzo, and Olga Sorkine- Hornung. Instant ﬁeld-aligned meshes. ACM Transactions on Graph- ics, 34(6), November 2015

work page 2015

[16] [16]

ImageNet classi- ﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. ImageNet classi- ﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012

[17] [17]

Interiornet: Mega-scale multi-sensor photo-realistic in- door scenes dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic in- door scenes dataset. In BMVC, 2018

work page 2018

[18] [18]

Filling holes in meshes

Peter Liepa. Filling holes in meshes. In ACM SIGGRAPH Symposium on Geometry Processing , pages 200–205, 2003

work page 2003

[19] [19]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014

work page 2014

[20] [20]

Lorensen and Harvey E

William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3D surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, pages 163–169, New York, NY , USA,

work page

[21] [21]

S ´ebastien Loriot, Jane Tournois, and Ilker O. Yaz. Polygon mesh processing. In CGAL User and Reference Manual . CGAL Editorial Board, 4.14 edition, 2019

work page 2019

[22] [22]

ORB-SLAM: a versatile and accurate monocular SLAM system

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. TRO, 31(5):1147–1163, 2015

work page 2015

[23] [23]

Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon

Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real- time dense surface mapping and tracking. In 2011 IEEE International Symposium on Mixed and Augmented Reality , pages 127–136. IEEE, 2011

work page 2011

[24] [24]

Habitat: A platform for embod- ied ai research

Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. arXiv preprint arXiv:1904.01201 , 2019

work page arXiv 1904

[25] [25]

Semantic scene completion from a single depth image

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

work page 2017

[26] [26]

R. S. Sutton and A. G. Barto. An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory , 1981

work page 1981

[27] [27]

Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterﬁeld, Shobhit Verma, and Richard Newcombe

Thomas Whelan, Michael Goesele, Steven J. Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterﬁeld, Shobhit Verma, and Richard Newcombe. Reconstructing scenes with mirror and glass surfaces. ACM Transactions on Graphics (TOG) , 37(4):102, 2018

work page 2018

[28] [28]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. http://gibsonenv.stanford.edu/database/

work page 2018