pith. machine review for the scientific record. sign in

arxiv: 2605.08841 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: no theorem link

Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsvisual illusionsimage preprocessingprompt engineeringensemble votingperception biastraining-free
0
0 comments X

The pith

A training-free preprocessing and prompting strategy enables vision-language models to correctly perceive visual illusions rather than defaulting to memorized knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that vision-language models tend to recall facts instead of analyzing the actual image when presented with visual illusions. It proposes three strategies: preprocessing the images with transformations tailored to each illusion type to reduce misleading visual cues, designing prompts that encourage direct visual comparison, and using an ensemble of multiple responses for robustness. This approach is training-free and achieves 90.48 percent accuracy on the official test set of 630 images. A reader would care because it shows how to improve model reliability on perceptual tasks without retraining the underlying model.

Core claim

Our method applies type-specific image transformations such as edge extraction, color isolation, morphological processing, and reference-line overlay to weaken illusion-inducing context, combined with anti-illusion prompts and majority voting, to resolve the perception-versus-memory conflict in VLMs and achieve high accuracy on illusion understanding tasks.

What carries the argument

Illusion-aware visual preprocessing using type-specific transformations to weaken illusion context while preserving task-relevant information, paired with anti-illusion prompting that directs the model to qualitative visual comparison.

If this is right

  • The framework improves VLM performance on illusion tasks to over 90% without requiring model fine-tuning.
  • It demonstrates that visual manipulation and prompt design can mitigate reliance on memorized facts.
  • The approach generalizes across different VLMs as shown by results with Claude.
  • Ensemble voting further enhances robustness to individual model errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar techniques could address other cases where VLMs hallucinate or ignore visual input in favor of language priors.
  • The success suggests that illusion understanding is more about input preparation than inherent model limitations.
  • Extending this to dynamic video or real-world scenes might reveal how well the transformations hold up outside the challenge dataset.
  • It opens the possibility of developing general visual debiasing methods for other perceptual biases.

Load-bearing premise

The selected type-specific transformations weaken illusion-inducing elements without removing or altering information essential for determining the correct answer on the test images.

What would settle it

A new set of illusion images where the preprocessing transformations either fail to reduce the illusion effect or cause the model to misinterpret non-illusion aspects, leading to accuracy below random guessing or significantly lower than baseline.

Figures

Figures reproduced from arXiv: 2605.08841 by Jiahui Wang, Jinbo Wang, Junli Zha, Xinkai Lu.

Figure 1
Figure 1. Figure 1: Multi-VLM collaborative strategy discovery. Three [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of per-type image preprocessing. Each strat [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of two visual examples demonstrating the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our illusion-aware VLM pipeline. The system classifies each question into one of seven illusion types, applies [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at https://github.com/jasminezz/sf-illusion-aware-vlm.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free framework for the 5th DataCV Challenge Task 1, combining illusion-aware image preprocessing via type-specific transformations (edge extraction, color isolation, morphological processing, reference-line overlay), anti-illusion prompt engineering, and a 5-vote majority ensemble with Claude to improve VLM performance on visual illusions. It reports 90.48% accuracy on the official 630-image held-out test set and 98.41% on a human-verified subset, securing 2nd place without any fine-tuning.

Significance. If the transformations are shown to preserve task-critical cues, the work provides a practical, reproducible zero-shot approach to mitigating VLM reliance on memorized facts over visual perception in illusion tasks. The public code release and strong results on an external challenge test set add immediate utility for robust visual reasoning.

major comments (2)
  1. [illusion-aware image preprocessing] The central performance claim (90.48% accuracy) rests on the unverified assumption that the type-specific preprocessing transformations weaken only illusion-inducing context while retaining every cue needed for correct answers across all illusion categories in the 630-image distribution. No ablation studies, per-type error analysis, or quantitative preservation checks (e.g., VLM accuracy with vs. without each transformation) are described.
  2. [Methods and experimental setup] No details are provided on the process for selecting or tuning the transformation parameters, nor any sensitivity analysis, which is required to substantiate that the reported accuracy is attributable to the claimed mechanism rather than ad-hoc choices.
minor comments (2)
  1. [Abstract] Clarify the exact Claude model version referenced as 'claude-opus-4-6' and ensure consistency with standard naming conventions.
  2. [Results] Consider adding a table or figure showing component-wise contributions (preprocessing, prompting, ensemble) to the final accuracy for improved clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and methodological detail would strengthen the manuscript. We address each major comment below and will incorporate revisions to improve the rigor of the presentation.

read point-by-point responses
  1. Referee: [illusion-aware image preprocessing] The central performance claim (90.48% accuracy) rests on the unverified assumption that the type-specific preprocessing transformations weaken only illusion-inducing context while retaining every cue needed for correct answers across all illusion categories in the 630-image distribution. No ablation studies, per-type error analysis, or quantitative preservation checks (e.g., VLM accuracy with vs. without each transformation) are described.

    Authors: We agree that the current manuscript lacks ablation studies, per-type error analysis, and quantitative checks demonstrating that the transformations preserve task-critical cues while disrupting illusion-inducing context. Although the transformations were selected based on established perceptual mechanisms for each illusion category, the absence of these empirical validations is a genuine limitation. In the revised manuscript we will add (i) ablation results comparing accuracy with and without each individual transformation, (ii) per-illusion-category error breakdowns on the 630-image test set, and (iii) qualitative examples showing retained visual information for representative images from each category. These additions will directly address the concern and substantiate the central performance claim. revision: yes

  2. Referee: [Methods and experimental setup] No details are provided on the process for selecting or tuning the transformation parameters, nor any sensitivity analysis, which is required to substantiate that the reported accuracy is attributable to the claimed mechanism rather than ad-hoc choices.

    Authors: We acknowledge that the manuscript does not describe the parameter selection process or include sensitivity analysis. The parameters were chosen through visual inspection to target illusion-specific features while preserving recognizability, informed by the illusion taxonomy in the challenge data. To resolve this, the revision will include a new subsection detailing the exact parameter values for each transformation, the rationale and iterative selection procedure, and a sensitivity study showing how small perturbations in key parameters affect overall accuracy on the test set. This will clarify that the reported results stem from the intended mechanisms rather than arbitrary choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method evaluated on external held-out test set

full rationale

The paper describes a training-free pipeline of type-specific preprocessing transformations, prompt engineering, and majority-vote ensembling, with accuracy reported on the official 630-image challenge test set that was not used for any fitting or tuning. No equations, parameter estimation steps, self-citations, or uniqueness theorems appear in the provided text; the performance numbers are direct empirical measurements rather than quantities derived from the same inputs by construction. This matches the expected non-finding for a purely empirical challenge submission whose central claims do not reduce to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that classic illusions can be mitigated by standard image-processing operations and that VLMs respond to prompt wording in predictable ways; no new entities or fitted constants are introduced.

axioms (1)
  • domain assumption VLMs exhibit systematic bias toward visual illusions by recalling memorized facts rather than perceiving actual visual differences
    Stated as the core problem the framework is designed to address.

pith-pipeline@v0.9.0 · 5517 in / 1256 out tokens · 41934 ms · 2026-05-12T01:35:51.220427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Has- son, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Saman- gooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: A vi- sual languag...

  2. [2]

    The Claude model family: Opus, Sonnet, Haiku

    Anthropic. The Claude model family: Opus, Sonnet, Haiku. Technical report, Anthropic, 2025. 2, 3, 6

  3. [3]

    Y . Chen, R. K. Sikka, A. Cober, S. Ji, and S. Divvala. Mea- suring and improving chain-of-thought reasoning in vision- language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL), 2024. 2

  4. [4]

    T. N. Cornsweet.Visual Perception. Academic Press, New York, 1970. 4

  5. [5]

    The 5th DataCV challenge task 1: Classic illusion understanding

    DataCV Workshop Organizers. The 5th DataCV challenge task 1: Classic illusion understanding. InCVPR 2026 Work- shop, 2026.https://sites.google.com/view/ datacv-2026-cvpr/challenge. 1, 8

  6. [6]

    Gemini 3.1 pro: A smarter model for your most complex tasks.https : / / blog

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks.https : / / blog . google / innovation - and - ai / models - and - research / gemini-models/gemini-3-1-pro/, 2026. 3

  7. [7]

    R. L. Gregory. Knowledge in perception and illusion.Philo- sophical Transactions of the Royal Society of London. Series B: Biological Sciences, 352(1358):1121–1127, 1997. 1, 2

  8. [8]

    T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  9. [9]

    W. Hong, W. Wang, M. Ding, W. Yu, Q. Lv, Y . Wang, Y . Cheng, S. Huang, and J. Tang. CogVLM2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024. 2, 6

  10. [10]

    Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies, 2026

    Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung- Levy, and Hehe Fan. Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies, 2026. 2

  11. [11]

    G. Kanizsa. Subjective contours.Scientific American, 234 (4):48–52, 1976. 3

  12. [12]

    S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision- language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. 2, 7

  13. [13]

    H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253,

  14. [14]

    Chan, Shir Goldfinger, Emily Mackay, Brian Anthony, and Alison Pouch

    Aparna Nair-Kanneganti, Trevor J. Chan, Shir Goldfinger, Emily Mackay, Brian Anthony, and Alison Pouch. Increas- ing LLM response trustworthiness using voting ensembles. arXiv preprint, arXiv:2510.04048, 2025. 5

  15. [15]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report: Advanced vi- sual reasoning and agent capabilities.arXiv preprint arXiv:2511.21631, 2025. 3

  16. [16]

    Rostamkhani, M

    S. Rostamkhani, M. Ansari, A. Sahzevari, S. Rahmani, and S. Eetemadi. Illusory VQA: Benchmarking and enhancing multimodal models on visual illusions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025. 2

  17. [17]

    H. S. Shahgir, K. S. Sayeed, A. Bhattacharjee, W. U. Ah- mad, Y . Dong, and R. Shahriyar. IllusionVQA: A challeng- ing optical illusion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024. 2

  18. [18]

    X. Sun, M. Li, M. W. Sun, M. Endo, S. Wu, C. Li, Y . Zhang, Z. Wang, and S. Yeung-Levy. Do VLMs perceive or recall? Probing visual perception vs. memory with classic visual il- lusions.arXiv preprint arXiv:2601.22150, 2026. 1, 2

  19. [19]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR,

  20. [20]

    Illusionbench: A large-scale and comprehensive bench- mark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848,

    M. Zhang, S. Yin, L. Li, J. Zhang, Z. He, and G. Wan. Il- lusionBench+: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models. arXiv preprint arXiv:2501.00848, 2025. 2, 8