hub Baseline reference

Matterport3D: Learning from RGB-D Data in Indoor Environments

· 2017 · cs.CV · arXiv 1709.06158

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

70 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 7 background 3 baseline 1

citation-polarity summary

use dataset 5 background 4 baseline 1 unclear 1

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

cs.CV · 2021-11-17 · accept · novelty 8.0

ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.

UniDAC: Universal Metric Depth Estimation for Any Camera

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

cs.RO · 2026-05-10 · unverdicted · novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.

AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.

$\phi$-Scene: Physically Grounded Image-to-3D Scene Reconstruction

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

φ-Scene performs image-to-3D scene reconstruction via topology-driven physical assembly that resolves penetrations with SDF optimization and settles objects with rigid-body simulation.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

Advancing DialNav through Automatic Embodied Dialog Augmentation

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.

Segment and Select: Vision-Language Segmentation in 3D Scenarios

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

SEGA3D improves 3D vision-language segmentation on ScanNet and Matterport3D by operating on fine-grained masks with LLM-assisted selection, claiming gains of 8.3 and 5.3 mIoU over prior top methods.

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.

VLM3: Vision Language Models Are Native 3D Learners

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.

citing papers explorer

Showing 21 of 21 citing papers after filters.

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation cs.RO · 2026-06-05 · unverdicted · none · ref 6 · internal anchor
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation cs.RO · 2026-05-10 · unverdicted · none · ref 5
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 6 · internal anchor
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making cs.RO · 2026-05-31 · unverdicted · none · ref 1 · internal anchor
PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.
Autonomous Frontier-Based Exploration with VLM Guidance cs.RO · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
A VLM-based method for selecting exploration frontiers in robotics achieves up to 24% better map coverage than standard geometric heuristics in simulated indoor environments.
ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation cs.RO · 2026-03-25 · conditional · none · ref 41 · internal anchor
ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
Memory Over Maps: 3D Object Localization Without Reconstruction cs.RO · 2026-03-20 · unverdicted · none · ref 52 · internal anchor
A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation benchmarks with far lower build cost.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World cs.RO · 2025-10-23 · unverdicted · none · ref 21 · internal anchor
C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark while using less memory.
Personalized Embodied Navigation for Portable Object Finding cs.RO · 2024-03-14 · unverdicted · none · ref 8 · internal anchor
Transit-Aware Planning (TAP) enriches navigation policies with object transit data on Dynamic Object Maps, raising success rates by 21.1% in MP3D simulation and 18.3% in real-world tests for finding non-stationary targets.
NavOL: Navigation Policy with Online Imitation Learning cs.RO · 2026-05-12 · unverdicted · none · ref 2
NavOL collects expert trajectory labels online from a global planner during policy rollouts in simulation to train a diffusion navigation policy, mitigating distribution shift and improving performance on visual navigation tasks.
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation cs.RO · 2026-05-07 · unverdicted · none · ref 2
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation cs.RO · 2026-04-14 · unverdicted · none · ref 36
OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 78 · internal anchor
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
RoamFlow: Reinforcement-Aligned One-Step Action MeanFlow Policy for Image-Goal Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 32 · internal anchor
RoamFlow applies MeanFlow to predict average velocity fields for one-step action policies in image-goal navigation, trained via expert imitation followed by RL refinement.
AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning cs.RO · 2026-06-09 · unverdicted · none · ref 16 · internal anchor
AllDayNav encodes scene dynamics into a large model's parameters via RL and a multimodal memory, achieving near-100% success rates in lifelong navigation and outperforming map-based and VLM baselines.
IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations cs.RO · 2026-06-06 · unverdicted · none · ref 25 · internal anchor
IntentNav is a spatial-visual imitation framework that infers human search intent via frontier labeling to train VLM policies for object navigation, reporting SOTA on MP3D and HM3D benchmarks with zero-shot transfer to wheeled, quadruped, and humanoid robots.
Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction cs.RO · 2026-05-18 · unverdicted · none · ref 5 · internal anchor
Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
Think before Go: Hierarchical Reasoning for Image-goal Navigation cs.RO · 2026-04-19 · unverdicted · none · ref 52
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
Flying to Image-Specified Objects: 3D Quadrotor Navigation via Cross-Graph Memory and Viewpoint Planning cs.RO · 2026-06-29 · unverdicted · none · ref 35 · internal anchor
Proposes a hierarchical navigation framework with viewpoint-aware action nodes, cross-graph memory, and learning-based policy for quadrotor InstanceImageNav, claiming improvements over baselines in simulation and real-world validation.
MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments cs.RO · 2025-11-06 · unverdicted · none · ref 25 · internal anchor
MacroNav learns multi-scale navigation-centric representations through multi-task self-supervised learning and combines them with graph-based reinforcement learning for efficient action selection, reporting gains in success rate and path efficiency over prior methods.
A Modular Vision-Language-Action Robotics Framework for Indoor Environments cs.RO · 2026-06-30 · unverdicted · none · ref 6 · internal anchor
Describes a modular VLA framework with semantic voxel mapping via OwlViT and VLM-based command classification and grounding for the CMU VLA Challenge.

Matterport3D: Learning from RGB-D Data in Indoor Environments

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer