VLMs primarily reason in textual space with limited reliance on visual evidence, shown by consistent performance drops when images are added to text in a controlled aligned benchmark.
Mert Unsal and Aylin Akkus
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.
citing papers explorer
-
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
VLMs primarily reason in textual space with limited reliance on visual evidence, shown by consistent performance drops when images are added to text in a controlled aligned benchmark.
-
LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression
LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.