Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li · 2025 · arXiv 2503.01773

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

3D Primitives are a Spatial Language for VLMs

cs.CV · 2026-05-12 · conditional · novelty 7.0

3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

cs.CL · 2025-09-18 · conditional · novelty 7.0

V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.

citing papers explorer

Showing 4 of 4 citing papers.

3D Primitives are a Spatial Language for VLMs cs.CV · 2026-05-12 · conditional · none · ref 4
3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models cs.CL · 2025-09-18 · conditional · none · ref 9
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 13
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models cs.CV · 2026-02-19 · unverdicted · none · ref 7
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer