hub Baseline reference

Matterport3D: Learning from RGB-D Data in Indoor Environments

· 2017 · cs.CV · arXiv 1709.06158

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

70 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 7 background 3 baseline 1

citation-polarity summary

use dataset 5 background 4 baseline 1 unclear 1

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

cs.CV · 2021-11-17 · accept · novelty 8.0

ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.

UniDAC: Universal Metric Depth Estimation for Any Camera

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

cs.RO · 2026-05-10 · unverdicted · novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.

AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.

$\phi$-Scene: Physically Grounded Image-to-3D Scene Reconstruction

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

φ-Scene performs image-to-3D scene reconstruction via topology-driven physical assembly that resolves penetrations with SDF optimization and settles objects with rigid-body simulation.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

Advancing DialNav through Automatic Embodied Dialog Augmentation

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.

Segment and Select: Vision-Language Segmentation in 3D Scenarios

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

SEGA3D improves 3D vision-language segmentation on ScanNet and Matterport3D by operating on fine-grained masks with LLM-assisted selection, claiming gains of 8.3 and 5.3 mIoU over prior top methods.

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.

VLM3: Vision Language Models Are Native 3D Learners

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 89 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Advancing DialNav through Automatic Embodied Dialog Augmentation cs.AI · 2026-06-18 · unverdicted · none · ref 1 · internal anchor
Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation cs.AI · 2026-06-16 · unverdicted · none · ref 7 · internal anchor
EvolveNav adds an agentic rule memory with UCB retrieval and a memory-guided preflection module to enable continuous improvement in zero-shot object goal navigation, reporting a 10.1% success rate gain over baselines.
Why Build an Assistant in Minecraft? cs.AI · 2019-07-22 · unverdicted · none · ref 16 · internal anchor
A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.

Matterport3D: Learning from RGB-D Data in Indoor Environments

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer