3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.
citing papers explorer
-
3D Primitives are a Spatial Language for VLMs
3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
-
EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.