SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
hub Baseline reference
Matterport3D: Learning from RGB-D Data in Indoor Environments
Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.
abstract
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.
PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.
SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.
φ-Scene performs image-to-3D scene reconstruction via topology-driven physical assembly that resolves penetrations with SDF optimization and settles objects with rigid-body simulation.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.
SEGA3D improves 3D vision-language segmentation on ScanNet and Matterport3D by operating on fine-grained masks with LLM-assisted selection, claiming gains of 8.3 and 5.3 mIoU over prior top methods.
A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.
GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
citing papers explorer
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Advancing DialNav through Automatic Embodied Dialog Augmentation
Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.
-
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
EvolveNav adds an agentic rule memory with UCB retrieval and a memory-guided preflection module to enable continuous improvement in zero-shot object goal navigation, reporting a 10.1% success rate gain over baselines.
-
Why Build an Assistant in Minecraft?
A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.