hub Mixed citations

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan · 2025 · cs.CV · arXiv 2505.23747

Mixed citation behavior. Most common role is background (62%).

52 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 52 citing papers arXiv PDF

abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 2 dataset 1 method 1

citation-polarity summary

background 8 baseline 3 unclear 1 use method 1

representative citing papers

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

Why MLLMs Struggle to Determine Object Orientations

cs.CV · 2026-04-14 · accept · novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

cs.CV · 2026-04-10 · unverdicted · novelty 7.0 · 2 refs

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

cs.AI · 2025-11-26 · unverdicted · novelty 7.0

SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

cs.CV · 2025-06-10 · unverdicted · novelty 7.0

AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.

Agentic Collaborative Cognition for Zero-Shot 3D Understanding

cs.CV · 2026-06-23 · unverdicted · novelty 6.0 · 2 refs

A collaborative Planning-Perception agent framework using MLLMs constructs a holistic cognitive map through iterative viewpoint supplementation and achieves reported SOTA gains on six 3D benchmarks.

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.

OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

OmniSpace is a plug-and-play method that improves spatial reasoning in MLLMs for AV by injecting camera pose, using epipolar attention across views, and distilling 3D geometric knowledge to overcome weak cross-view correspondence and depth estimation.

CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

cs.CV · 2026-06-18 · unverdicted · novelty 6.0 · 2 refs

S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

Occ-VLM reconstructs 3D occupancy from 2D images via a single encoder to ground vision-language reasoning, claiming SOTA occupancy prediction and parity with 3D-input VLMs on VQA and captioning.

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

ReRe boosts open-source MLLMs on spatial reasoning benchmarks VSI-Bench and STI-Bench to rival proprietary SOTA by using a two-phase Reason then Re-reason process with Geometry-to-Video novel view synthesis.

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

CoCoSI is a training-free multi-agent system for collaborative cognitive map construction that improves spatial understanding in arbitrary pretrained MLLMs.

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.

citing papers explorer

Showing 1 of 1 citing paper after filters.

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 46 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer