Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

· 2026 · cs.CV · arXiv 2605.15864

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

representative citing papers

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Text knowledge edits in UMMs reach 92% text efficacy but only 18.5% VQA accuracy on images, with reasoning-augmented editing narrowing the cross-modal gap.

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

EASE augments multimodal RLVR with evidence-anchored spatial attention supervision using privileged annotations, improving average benchmark scores by 2.5-3.1 points over DAPO on Qwen VL models.

citing papers explorer

Showing 2 of 2 citing papers.

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs cs.CL · 2026-05-30 · unverdicted · none · ref 4 · internal anchor
Text knowledge edits in UMMs reach 92% text efficacy but only 18.5% VQA accuracy on images, with reasoning-augmented editing narrowing the cross-modal gap.
Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
EASE augments multimodal RLVR with evidence-anchored spatial attention supervision using privileged annotations, improving average benchmark scores by 2.5-3.1 points over DAPO on Qwen VL models.

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

fields

years

verdicts

representative citing papers

citing papers explorer