AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
hub
Sti-bench: Are mllms ready for precise spatial-temporal world understanding?
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
citing papers explorer
-
AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.