hub Mixed citations

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500

Li, Y · 2025 · arXiv 2505.21500

Mixed citation behavior. Most common role is background (62%).

33 Pith papers citing it

Background 62% of classified citations

read on arXiv browse 33 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 dataset 1 method 1

citation-polarity summary

background 5 baseline 1 extend 1 use dataset 1

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

cs.CV · 2026-06-02 · accept · novelty 7.0

OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

ETCHR: Editing To Clarify and Harness Reasoning

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

cs.CV · 2025-12-29 · unverdicted · novelty 7.0

SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

cs.AI · 2025-11-26 · unverdicted · novelty 7.0

SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.

AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.

CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

cs.CV · 2026-06-18 · unverdicted · novelty 6.0 · 2 refs

S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

ProSR adds a Counterfactual Invariance Penalty and a Tail Drift Penalty to shape VLM reasoning trajectories for better visual dependence and stability on spatial tasks.

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.

citing papers explorer

Showing 29 of 29 citing papers after filters.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 20
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning cs.CV · 2026-07-01 · unverdicted · none · ref 45
P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.
MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos cs.CV · 2026-07-01 · unverdicted · none · ref 45
MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.
AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration cs.CV · 2026-06-26 · unverdicted · none · ref 21
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 29
A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators cs.CV · 2026-06-04 · unverdicted · none · ref 5
Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs cs.CV · 2026-06-02 · accept · none · ref 24
OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes cs.CV · 2026-05-29 · unverdicted · none · ref 10
SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.
Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 45
GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.
ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 16
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens? cs.CV · 2026-05-20 · unverdicted · none · ref 12
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 18
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images cs.CV · 2026-05-12 · unverdicted · none · ref 29 · 3 links
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 22
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV · 2026-04-10 · unverdicted · none · ref 15
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 50
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility cs.CV · 2025-12-29 · unverdicted · none · ref 21
SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.
AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience cs.CV · 2026-06-30 · unverdicted · none · ref 32
SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.
CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming cs.CV · 2026-06-21 · unverdicted · none · ref 30
CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence cs.CV · 2026-06-18 · unverdicted · none · ref 13 · 2 links
S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.
ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs cs.CV · 2026-05-25 · unverdicted · none · ref 42
ProSR adds a Counterfactual Invariance Penalty and a Tail Drift Penalty to shape VLM reasoning trajectories for better visual dependence and stability on spatial tasks.
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 23
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · conditional · none · ref 18 · 2 links
SpaMEM is a diagnostic benchmark showing that current vision-language models exhibit a sharp collapse in spatial reasoning when transitioning from text-aided state tracking to purely visual memory in dynamic environments.
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 18
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 25
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision cs.CV · 2026-06-18 · unverdicted · none · ref 12
SpatialSV trains MLLMs to lift 2D visual features into explicit 3D representations via task-oriented supervision for interpretable spatial awareness.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 22
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 65
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CV · 2026-04-19 · unverdicted · none · ref 26
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer