pith. sign in

arxiv: 2605.15864 · v2 · pith:NQCRQ7SSnew · submitted 2026-05-15 · 💻 cs.CV · cs.CL

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Pith reviewed 2026-06-30 19:46 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsvisual re-examinationimage swap probingattention analysisreasoning benchmarksmultimodal evaluation
1
0 comments X

The pith

Vision-language models often fail to detect image swaps during reasoning despite claiming to re-examine visuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether VLMs perform genuine visual re-examination when they produce statements like "let me check the figure again." It introduces VisualSwap, a framework that replaces an image with a visually similar but semantically different one after initial reasoning, and VS-Bench, a set of 800 such pairs from math and multimodal benchmarks. Experiments across several models show they overwhelmingly miss the swap, leading to accuracy drops of up to 60 percent. Self-generated reflections during generation do not increase attention to visual tokens, while explicit multi-turn user instructions do. Scaling and thinking modes provide no protection and may increase vulnerability.

Core claim

After a model reasons over an image, replacing it with a visually similar but semantically different one causes the model to miss the change in the vast majority of cases and continue with its original incorrect reasoning chain; this holds across Qwen3-VL, Kimi-VL, and ERNIE-VL, with thinking variants showing nearly three times the failure rate of instructed ones, and attention analysis confirming that self-reflective text does not elevate focus on visual tokens.

What carries the argument

VisualSwap, an image-swap probing framework that inserts a semantically altered but visually similar replacement image after initial reasoning and measures whether the model revises its output accordingly.

If this is right

  • Accuracy on reasoning tasks drops sharply when the image is swapped mid-process.
  • Models that generate longer internal thinking traces are substantially more likely to ignore visual changes.
  • Increasing model scale does not reduce the tendency to treat reflective statements as pure text patterns.
  • Explicit multi-turn user instructions can restore visual grounding, but self-generated reflections cannot.
  • Attention to visual tokens rises under user prompts but stays flat during self-reflection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives may reward fluent reflective language without requiring actual visual verification.
  • Architectures could be modified to force re-attention to image tokens whenever reflective phrases appear.
  • The same pattern may appear in non-math domains such as chart or diagram reasoning where visual details matter.

Load-bearing premise

The curated image pairs are similar enough visually that a model truly re-examining the visual input would register the semantic difference and change its reasoning.

What would settle it

Running the VS-Bench pairs on a model and observing that it consistently revises its final answer or reasoning steps upon image swap would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15864 by Bo Shui, Cheng Yang, Chufan Shi, Linghao Jin, Taylor Berg-Kirkpatrick, Xuezhe Ma, Yaokang Wu.

Figure 1
Figure 1. Figure 1: Illustrative example of visual re-examination. Left: Given image Ia and query Q, the VLM generates reasoning chain Ra including a self-reflective trigger. Top-right: Under standard inference with Ia, the model validates its logic to reach the correct answer A1. Bottom-right: In the VISUALSWAP condition, Ia is replaced by a visually similar but semantically distinct image Ib post-reflection. Despite explici… view at source ↗
Figure 2
Figure 2. Figure 2: Visual attention score S (l) vis across decoding steps for Qwen3-VL-8B (layers 18-21). Probe shows lower attention than baseline throughout generation. Multi-turn elevates attention substantially after the user instruction. (Multi-turn), we visualize the model’s attention distribution during decoding. We define the Visual Attention Score S (l) vis (t) for a given layer l at decoding step t as the average a… view at source ↗
Figure 3
Figure 3. Figure 3: Visual attention score S (l) vis across decoding steps for Qwen3-VL-235B-A22B (layers 54-57). The same pattern holds at scale: probe suppresses visual attention while multi-turn restores it [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Probe accuracy vs. retained Ra. Performance declines monotonically as length increases, particularly for Thinking mod￾els. Per-task results are detailed in Tab. 13. equivalent but lexically diverse reflective triggers. These variants range from casual prompts (e.g., “Wait, let me look at the image again”) to formal directives (e.g., “Let me ver￾ify by examining the visual information once more”). The compl… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the VISUALSWAP framework revealing the error case of visual re-examination. Top: The model generates an initial reasoning chain Ra correctly based on image Ia (60◦ ). Bottom: Upon a self-reflective trigger (“Wait, let me check the figure again”), the input is transparently swapped to image Ib (50◦ ). Despite the explicit prompt to verify, the model fails to ground its reasoning in the new v… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the VISUALSWAP framework revealing the error case of visual re-examination. Top: The model correctly reasons on chart Ia that “Burlywood” is not the minimum category. Bottom: Upon the self-reflective trigger, the input is swapped to Ib where the bar lengths are modified such that “Burlywood” becomes the minimum. The model fails to detect this visual change, hallucinating the previous bar le… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the VISUALSWAP framework revealing the error case of visual re-examination. Top: The model correctly solves the counting task on image Ia by identifying the purple cube and cylinder to reach the answer ‘1’. Bottom: During reflection, the input is swapped to Ib where the purple cube is replaced by a cyan cube. The model fails to perceive this color change, hallucinating the “purple cube” fro… view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the VISUALSWAP framework revealing the error case of visual re-examination. Top: The model correctly analyzes the performance curves in Ia to identify “Dense” as the best model. Bottom: During reflection, the input is swapped to Ib where the curves are altered such that “Dense” is outperformed by “Soft”. The model fails to detect this shift, hallucinating the curve positions from Ia due to … view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the VISUALSWAP framework revealing the good case of visual re-examination. Top: The model acts on image Ia where Angle 1 is 60◦ , generating an initial reasoning chain Ra. Bottom: Upon the self-reflective trigger, the input is swapped to Ib where Angle 1 is 50◦ . In this success case, the model explicitly recognizes the angle value in the new image Ib (50◦ ), distinguishes it from the previ… view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the VISUALSWAP framework revealing the good case of visual re-examination. Top: The model correctly analyzes chart Ia to determine that “Black” is greater than “Deep Sky Blue”. Bottom: Upon the self-reflective trigger, the input is swapped to Ib where the “Black” bar is shortened to be less than “Deep Sky Blue”. In this success case, the model exhibits genuine visual grounding: it explicit… view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of the VISUALSWAP framework revealing the good case of visual re-examination. Top: The model processes scene Ia which contains a red sphere, correctly subtracting it along with a brown object to calculate 5 remaining items. Bottom: Upon the self-reflective trigger, the input is swapped to Ib where the red sphere is replaced by a green sphere. In this success case, the model exhibits genuine v… view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of the VISUALSWAP framework revealing the good case of visual re-examination. Top: The model analyzes the parabola in Ia (f(x) = x 2 ), correctly deducing that the derivative at x = 2 is smaller than at x = 5. Bottom: Upon the self-reflective trigger, the input is swapped to Ib, which displays an absolute value function (f(x) = |2x − 3| + 1). In this success case, the model exhibits genuine v… view at source ↗
read the original abstract

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that VLMs produce self-reflective statements (e.g., 'let me check the figure again') during reasoning without triggering genuine visual re-examination. It introduces the VisualSwap probing framework and VS-Bench (800 image pairs from MathVista, MathVerse, MathVision, MMMU-Pro) to test this by swapping the input image mid-reasoning with a visually similar but semantically different one. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL show accuracy drops up to 60%, with thinking models ~3x more vulnerable than instructed ones; scaling does not help. Multi-turn user instructions restore grounding and elevate visual-token attention, but self-generated reflections do not. Attention maps are used to explain the difference. Code and dataset are released.

Significance. If the empirical results hold after methodological details are clarified, the work provides a clear, falsifiable demonstration that current VLMs rely on textual patterns rather than visual re-grounding during self-reflection. The positive-control contrast between self-generated vs. user instructions, combined with attention analysis, strengthens the central claim and offers a concrete diagnostic for future model development. Releasing the benchmark and code is a notable strength that enables direct replication and extension.

major comments (3)
  1. [§3 (VS-Bench construction)] The curation criteria for VS-Bench pairs (visual similarity vs. semantic difference) are not specified with sufficient detail or quantitative metrics; this is load-bearing for the central claim because the accuracy drops and the interpretation of 'missing the swap' depend on the assumption that genuine re-examination would reliably detect the change.
  2. [§4 (VisualSwap framework)] The exact timing of the image swap within the generation process (e.g., after which token or reasoning step) and any controls for generation length or continuation artifacts are not described; this directly affects whether the observed 60% drops can be attributed to failure of self-reflection rather than implementation details of the probe.
  3. [§5 (Experiments)] Statistical controls (number of runs, variance across seeds, comparison to chance-level performance given the multiple-choice or open-ended formats in the source benchmarks) are not reported for the accuracy drops; without these, the magnitude of the effect and the claim that 'thinking models are nearly 3x more vulnerable' cannot be fully evaluated.
minor comments (1)
  1. [Abstract] The abstract states results for 'Qwen3-VL, Kimi-VL, and ERNIE-VL' but does not clarify whether these are the only models tested or if additional models appear in the full experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological clarity. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.

read point-by-point responses
  1. Referee: [§3 (VS-Bench construction)] The curation criteria for VS-Bench pairs (visual similarity vs. semantic difference) are not specified with sufficient detail or quantitative metrics; this is load-bearing for the central claim because the accuracy drops and the interpretation of 'missing the swap' depend on the assumption that genuine re-examination would reliably detect the change.

    Authors: We agree that more explicit quantitative metrics would improve the paper. In the revision we will expand §3 with the exact curation protocol, including the perceptual similarity thresholds (e.g., LPIPS < 0.15 combined with SSIM > 0.85) used to ensure visual similarity and the multi-stage verification (CLIP-based semantic distance plus human review) used to confirm semantic difference. We will also add summary statistics (mean and distribution of similarity scores) for the 800 pairs. revision: yes

  2. Referee: [§4 (VisualSwap framework)] The exact timing of the image swap within the generation process (e.g., after which token or reasoning step) and any controls for generation length or continuation artifacts are not described; this directly affects whether the observed 60% drops can be attributed to failure of self-reflection rather than implementation details of the probe.

    Authors: We will clarify the probe implementation in the revised §4. The swap is triggered immediately after the model emits a self-reflective phrase (detected by a fixed set of trigger tokens), at the exact generation step where re-examination would begin. To mitigate length and continuation artifacts we enforce identical continuation length across conditions and include an ablation that pads or truncates to a fixed token budget; these controls will be documented with pseudocode. revision: yes

  3. Referee: [§5 (Experiments)] Statistical controls (number of runs, variance across seeds, comparison to chance-level performance given the multiple-choice or open-ended formats in the source benchmarks) are not reported for the accuracy drops; without these, the magnitude of the effect and the claim that 'thinking models are nearly 3x more vulnerable' cannot be fully evaluated.

    Authors: We acknowledge the omission and will add the requested statistics in the revised experiments section. Results will be reported as means and standard deviations over three independent seeds; we will also include chance-level baselines (25 % for four-option items, 20 % for five-option items, and empirical random baselines for open-ended items) drawn from the source benchmarks to contextualize the observed drops. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a purely empirical study via the VisualSwap probing framework and VS-Bench dataset of image pairs, measuring model accuracy drops and attention shifts under image swaps. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text; claims rest on direct experimental measurements against external VLMs rather than any reduction to inputs by construction. Self-citations are absent from the abstract and described content, and the positive control (multi-turn user instructions) provides independent falsifiability.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no fitted parameters, no new axioms beyond standard ML evaluation assumptions, and no invented entities.

pith-pipeline@v0.9.1-grok · 5771 in / 1093 out tokens · 35169 ms · 2026-06-30T19:46:32.112115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    Text knowledge edits in UMMs reach 92% text efficacy but only 18.5% VQA accuracy on images, with reasoning-augmented editing narrowing the cross-modal gap.

  2. Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

    cs.CV 2026-05 unverdicted novelty 6.0

    EASE augments multimodal RLVR with evidence-anchored spatial attention supervision using privileged annotations, improving average benchmark scores by 2.5-3.1 points over DAPO on Qwen VL models.

Reference graph

Works this paper leans on

18 extracted references · 12 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

  2. [2]

    Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W

    Accessed: 2026-01-22. Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W. OpenVLThinker: Complex vision- language reasoning via iterative SFT-RL cycles. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  3. [3]

    Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D

    Accessed: 2026-01-22. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  5. [5]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  6. [6]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. Liu, Z., Sun, Z., ...

  7. [7]

    Accessed: 2026-01-22. OpenAI. Learning to reason with llms,

  8. [8]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Accessed: 2026-01-22. Shen, H., Liu, P., Li, J., Fang, C., Ma, Y ., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615,

  9. [9]

    A thorough examination of decoding methods in the era of llms

    Shi, C., Yang, H., Cai, D., Zhang, Z., Wang, Y ., Yang, Y ., and Lam, W. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8601–8629,

  10. [10]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  11. [11]

    and Bansal, M

    Tan, H. and Bansal, M. Lxmert: Learning cross-modality en- coder representations from transformers. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 5100–5111,

  12. [12]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  13. [13]

    Kimi-VL Technical Report

    Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,

  14. [14]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

  15. [15]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    11 Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., and Chen, W. Vl-rethinker: Incentivizing self-reflection of vision- language models with reinforcement learning.arXiv preprint arXiv:2504.08837,

  16. [16]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language ...

  17. [17]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yang, C., Shi, C., Liu, Y ., Shui, B., Wang, J., Jing, M., Xu, L., Zhu, X., Li, S., Zhang, Y ., et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation. InInternational Conference on Learning Representations, pp. 26590–26646, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, ...

  18. [18]

    We employ a sampling temperature of τ= 0.1

    on NVIDIA H200 GPUs. We employ a sampling temperature of τ= 0.1 . This value was empirically selected to balance reproducibility and generation quality: higher temperatures introduce excessive stochasticity that confounds the measurement of visual re-examination, while lower temperatures (e.g., greedy decoding) frequently lead to repetition loops and dege...