hub

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan · 2025 · arXiv 2505.23747

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

Why MLLMs Struggle to Determine Object Orientations

cs.CV · 2026-04-14 · accept · novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

cs.CV · 2026-04-19 · unverdicted · novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06 · unverdicted · novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

citing papers explorer

Showing 14 of 14 citing papers.

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos cs.CV · 2026-04-10 · unverdicted · none · ref 36
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 39
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Exploring Spatial Intelligence from a Generative Perspective cs.CV · 2026-04-22 · unverdicted · none · ref 34
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
Why MLLMs Struggle to Determine Object Orientations cs.CV · 2026-04-14 · accept · none · ref 34
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 74
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 9
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CV · 2026-04-29 · unverdicted · none · ref 52
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 59
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 42
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unverdicted · none · ref 44
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CV · 2026-05-11 · unverdicted · none · ref 31
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CV · 2026-04-19 · unverdicted · none · ref 47
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 46
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unverdicted · none · ref 133
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer