SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
hub Baseline reference
Matterport3D: Learning from RGB-D Data in Indoor Environments
Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.
abstract
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.
PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.
SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.
φ-Scene performs image-to-3D scene reconstruction via topology-driven physical assembly that resolves penetrations with SDF optimization and settles objects with rigid-body simulation.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Automatic augmentation turns VLN datasets into 238K multi-turn dialog episodes; combined with dual-strategy training and localization, this doubles success rates on DialNav Val Seen and Val Unseen splits.
SEGA3D improves 3D vision-language segmentation on ScanNet and Matterport3D by operating on fine-grained masks with LLM-assisted selection, claiming gains of 8.3 and 5.3 mIoU over prior top methods.
A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.
GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.
Goal2Pixel grounds VLN-CE goals to image pixels via VLM prediction plus keyframe memory, reaching 54.1% SR on R2R-CE Val-Unseen with 7.75 calls per episode versus 46.62 for action prediction.
PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
citing papers explorer
-
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
-
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 170 environments.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making
PSG-Nav introduces a probabilistic scene graph with multiverse sampling and an evidential calibrator to achieve new state-of-the-art success rates of 66.1%, 44.8%, and 67.9% on MP3D, HM3D, and HSSD open-vocabulary navigation benchmarks.
-
Autonomous Frontier-Based Exploration with VLM Guidance
A VLM-based method for selecting exploration frontiers in robotics achieves up to 24% better map coverage than standard geometric heuristics in simulated indoor environments.
-
ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation
ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
-
Memory Over Maps: 3D Object Localization Without Reconstruction
A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation benchmarks with far lower build cost.
-
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark while using less memory.
-
Personalized Embodied Navigation for Portable Object Finding
Transit-Aware Planning (TAP) enriches navigation policies with object transit data on Dynamic Object Maps, raising success rates by 21.1% in MP3D simulation and 18.3% in real-world tests for finding non-stationary targets.
-
NavOL: Navigation Policy with Online Imitation Learning
NavOL collects expert trajectory labels online from a global planner during policy rollouts in simulation to train a diffusion navigation policy, mitigating distribution shift and improving performance on visual navigation tasks.
-
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
-
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
-
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
-
RoamFlow: Reinforcement-Aligned One-Step Action MeanFlow Policy for Image-Goal Navigation
RoamFlow applies MeanFlow to predict average velocity fields for one-step action policies in image-goal navigation, trained via expert imitation followed by RL refinement.
-
AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning
AllDayNav encodes scene dynamics into a large model's parameters via RL and a multimodal memory, achieving near-100% success rates in lifelong navigation and outperforming map-based and VLM baselines.
-
IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations
IntentNav is a spatial-visual imitation framework that infers human search intent via frontier labeling to train VLM policies for object navigation, reporting SOTA on MP3D and HM3D benchmarks with zero-shot transfer to wheeled, quadruped, and humanoid robots.
-
Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction
Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
-
Think before Go: Hierarchical Reasoning for Image-goal Navigation
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
-
Flying to Image-Specified Objects: 3D Quadrotor Navigation via Cross-Graph Memory and Viewpoint Planning
Proposes a hierarchical navigation framework with viewpoint-aware action nodes, cross-graph memory, and learning-based policy for quadrotor InstanceImageNav, claiming improvements over baselines in simulation and real-world validation.
-
MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments
MacroNav learns multi-scale navigation-centric representations through multi-task self-supervised learning and combines them with graph-based reinforcement learning for efficient action selection, reporting gains in success rate and path efficiency over prior methods.
-
A Modular Vision-Language-Action Robotics Framework for Indoor Environments
Describes a modular VLA framework with semantic voxel mapping via OwlViT and VLM-based command classification and grounding for the CMU VLA Challenge.