Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
Pith reviewed 2026-06-30 19:46 UTC · model grok-4.3
The pith
Vision-language models often fail to detect image swaps during reasoning despite claiming to re-examine visuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After a model reasons over an image, replacing it with a visually similar but semantically different one causes the model to miss the change in the vast majority of cases and continue with its original incorrect reasoning chain; this holds across Qwen3-VL, Kimi-VL, and ERNIE-VL, with thinking variants showing nearly three times the failure rate of instructed ones, and attention analysis confirming that self-reflective text does not elevate focus on visual tokens.
What carries the argument
VisualSwap, an image-swap probing framework that inserts a semantically altered but visually similar replacement image after initial reasoning and measures whether the model revises its output accordingly.
If this is right
- Accuracy on reasoning tasks drops sharply when the image is swapped mid-process.
- Models that generate longer internal thinking traces are substantially more likely to ignore visual changes.
- Increasing model scale does not reduce the tendency to treat reflective statements as pure text patterns.
- Explicit multi-turn user instructions can restore visual grounding, but self-generated reflections cannot.
- Attention to visual tokens rises under user prompts but stays flat during self-reflection.
Where Pith is reading between the lines
- Training objectives may reward fluent reflective language without requiring actual visual verification.
- Architectures could be modified to force re-attention to image tokens whenever reflective phrases appear.
- The same pattern may appear in non-math domains such as chart or diagram reasoning where visual details matter.
Load-bearing premise
The curated image pairs are similar enough visually that a model truly re-examining the visual input would register the semantic difference and change its reasoning.
What would settle it
Running the VS-Bench pairs on a model and observing that it consistently revises its final answer or reasoning steps upon image swap would falsify the central claim.
Figures
read the original abstract
Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLMs produce self-reflective statements (e.g., 'let me check the figure again') during reasoning without triggering genuine visual re-examination. It introduces the VisualSwap probing framework and VS-Bench (800 image pairs from MathVista, MathVerse, MathVision, MMMU-Pro) to test this by swapping the input image mid-reasoning with a visually similar but semantically different one. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL show accuracy drops up to 60%, with thinking models ~3x more vulnerable than instructed ones; scaling does not help. Multi-turn user instructions restore grounding and elevate visual-token attention, but self-generated reflections do not. Attention maps are used to explain the difference. Code and dataset are released.
Significance. If the empirical results hold after methodological details are clarified, the work provides a clear, falsifiable demonstration that current VLMs rely on textual patterns rather than visual re-grounding during self-reflection. The positive-control contrast between self-generated vs. user instructions, combined with attention analysis, strengthens the central claim and offers a concrete diagnostic for future model development. Releasing the benchmark and code is a notable strength that enables direct replication and extension.
major comments (3)
- [§3 (VS-Bench construction)] The curation criteria for VS-Bench pairs (visual similarity vs. semantic difference) are not specified with sufficient detail or quantitative metrics; this is load-bearing for the central claim because the accuracy drops and the interpretation of 'missing the swap' depend on the assumption that genuine re-examination would reliably detect the change.
- [§4 (VisualSwap framework)] The exact timing of the image swap within the generation process (e.g., after which token or reasoning step) and any controls for generation length or continuation artifacts are not described; this directly affects whether the observed 60% drops can be attributed to failure of self-reflection rather than implementation details of the probe.
- [§5 (Experiments)] Statistical controls (number of runs, variance across seeds, comparison to chance-level performance given the multiple-choice or open-ended formats in the source benchmarks) are not reported for the accuracy drops; without these, the magnitude of the effect and the claim that 'thinking models are nearly 3x more vulnerable' cannot be fully evaluated.
minor comments (1)
- [Abstract] The abstract states results for 'Qwen3-VL, Kimi-VL, and ERNIE-VL' but does not clarify whether these are the only models tested or if additional models appear in the full experimental section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on methodological clarity. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.
read point-by-point responses
-
Referee: [§3 (VS-Bench construction)] The curation criteria for VS-Bench pairs (visual similarity vs. semantic difference) are not specified with sufficient detail or quantitative metrics; this is load-bearing for the central claim because the accuracy drops and the interpretation of 'missing the swap' depend on the assumption that genuine re-examination would reliably detect the change.
Authors: We agree that more explicit quantitative metrics would improve the paper. In the revision we will expand §3 with the exact curation protocol, including the perceptual similarity thresholds (e.g., LPIPS < 0.15 combined with SSIM > 0.85) used to ensure visual similarity and the multi-stage verification (CLIP-based semantic distance plus human review) used to confirm semantic difference. We will also add summary statistics (mean and distribution of similarity scores) for the 800 pairs. revision: yes
-
Referee: [§4 (VisualSwap framework)] The exact timing of the image swap within the generation process (e.g., after which token or reasoning step) and any controls for generation length or continuation artifacts are not described; this directly affects whether the observed 60% drops can be attributed to failure of self-reflection rather than implementation details of the probe.
Authors: We will clarify the probe implementation in the revised §4. The swap is triggered immediately after the model emits a self-reflective phrase (detected by a fixed set of trigger tokens), at the exact generation step where re-examination would begin. To mitigate length and continuation artifacts we enforce identical continuation length across conditions and include an ablation that pads or truncates to a fixed token budget; these controls will be documented with pseudocode. revision: yes
-
Referee: [§5 (Experiments)] Statistical controls (number of runs, variance across seeds, comparison to chance-level performance given the multiple-choice or open-ended formats in the source benchmarks) are not reported for the accuracy drops; without these, the magnitude of the effect and the claim that 'thinking models are nearly 3x more vulnerable' cannot be fully evaluated.
Authors: We acknowledge the omission and will add the requested statistics in the revised experiments section. Results will be reported as means and standard deviations over three independent seeds; we will also include chance-level baselines (25 % for four-option items, 20 % for five-option items, and empirical random baselines for open-ended items) drawn from the source benchmarks to contextualize the observed drops. revision: yes
Circularity Check
No significant circularity
full rationale
The paper conducts a purely empirical study via the VisualSwap probing framework and VS-Bench dataset of image pairs, measuring model accuracy drops and attention shifts under image swaps. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text; claims rest on direct experimental measurements against external VLMs rather than any reduction to inputs by construction. Self-citations are absent from the abstract and described content, and the positive control (multi-turn user instructions) provides independent falsifiability.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs
Text knowledge edits in UMMs reach 92% text efficacy but only 18.5% VQA accuracy on images, with reasoning-augmented editing narrowing the cross-modal gap.
-
Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR
EASE augments multimodal RLVR with evidence-anchored spatial attention supervision using privileged annotations, improving average benchmark scores by 2.5-3.1 points over DAPO on Qwen VL models.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W
Accessed: 2026-01-22. Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W. OpenVLThinker: Complex vision- language reasoning via iterative SFT-RL cycles. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
2026
-
[3]
Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D
Accessed: 2026-01-22. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,
2026
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. Liu, Z., Sun, Z., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Accessed: 2026-01-22. OpenAI. Learning to reason with llms,
2026
-
[8]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Accessed: 2026-01-22. Shen, H., Liu, P., Li, J., Fang, C., Ma, Y ., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
A thorough examination of decoding methods in the era of llms
Shi, C., Yang, H., Cai, D., Zhang, Z., Wang, Y ., Yang, Y ., and Lam, W. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8601–8629,
2024
-
[10]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
and Bansal, M
Tan, H. and Bansal, M. Lxmert: Learning cross-modality en- coder representations from transformers. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 5100–5111,
2019
-
[12]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
11 Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., and Chen, W. Vl-rethinker: Incentivizing self-reflection of vision- language models with reinforcement learning.arXiv preprint arXiv:2504.08837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yang, C., Shi, C., Liu, Y ., Shui, B., Wang, J., Jing, M., Xu, L., Zhu, X., Li, S., Zhang, Y ., et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation. InInternational Conference on Learning Representations, pp. 26590–26646, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
We employ a sampling temperature of τ= 0.1
on NVIDIA H200 GPUs. We employ a sampling temperature of τ= 0.1 . This value was empirically selected to balance reproducibility and generation quality: higher temperatures introduce excessive stochasticity that confounds the measurement of visual re-examination, while lower temperatures (e.g., greedy decoding) frequently lead to repetition loops and dege...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.