SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Dongming Wu; Huadai Liu; Jianan Wang; Kaituo Feng; Peiwen Sun; Rui Liu; Shiqiang Lang; Xiangyu Yue; Yi Ding; Yun-hui Liu

arxiv: 2510.09606 · v2 · pith:ATTPJXZLnew · submitted 2025-10-10 · 💻 cs.CV

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun , Shiqiang Lang , Dongming Wu , Yi Ding , Kaituo Feng , Huadai Liu , Zhen Ye , Rui Liu

show 3 more authors

Yun-Hui Liu Jianan Wang Xiangyu Yue

This is my paper

classification 💻 cs.CV

keywords spatialall-scalereasoningacrossknowledgedatasetdiversescenes

0 comments

read the original abstract

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning
cs.CV 2026-06 unverdicted novelty 7.0

A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magn...
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
cs.CV 2026-06 unverdicted novelty 7.0

X-Stream benchmark shows state-of-the-art MLLMs achieve only about 50% on multi-stream video tasks and exhibit poor proactive ability.
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
cs.CV 2026-06 unverdicted novelty 7.0

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
cs.CV 2026-05 unverdicted novelty 7.0

SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
cs.CV 2025-12 unverdicted novelty 7.0

4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchm...
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
cs.CV 2026-06 unverdicted novelty 5.0

Presents LongSpace-Bench benchmark and LongSpace framework that chunks long videos, adds 3D structural cues, and builds layer-aware memory to improve spatial reasoning in multimodal LLMs.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.