pith. sign in

arxiv: 2606.18609 · v1 · pith:T6KACKDFnew · submitted 2026-06-17 · 💻 cs.CV

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

Pith reviewed 2026-06-26 21:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucination detectionmedical vision-language modelscounter-evidence verificationhallucination correctionbidirectional verificationfour-quadrant mapmedical VQAmedical report generation
0
0 comments X

The pith

Counter-evidence verification detects and corrects hallucinations in medical vision-language models by checking each statement against its supporting image region.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a training-free method can identify and fix hallucinations in medical VLMs by testing whether each generated statement is supported by a corresponding visual evidence region extracted from the input image. CoEV performs bidirectional verification and places each statement on a four-quadrant map that combines text factuality with visual grounding. A sympathetic reader would care because this approach works without retraining the underlying model and produces measurable gains on detection and correction tasks across medical datasets. If the claim holds, clinicians would receive more dependable evidence-based outputs from these models on visual question answering and report generation.

Core claim

CoEV is a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. It performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining.

What carries the argument

Bidirectional verification between textual assertions and their corresponding visual evidence regions, organized into a four-quadrant map that classifies statements by factuality and grounding.

If this is right

  • Hallucination detection improves average PR-AUC by 3.0 percent and ROC-AUC by 3.9 percent across four medical datasets.
  • Detection gains reach up to 18.5 percent in specific medical VQA scenarios.
  • Hallucination correction improves Micro-F1 by up to 12.5 percent.
  • Hallucination rates on medical report generation drop by more than 11.9 percent.
  • Medical VQA accuracy increases as a result of the corrections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If region extraction proves reliable, the same verification step could be inserted into pipelines for non-medical image captioning to reduce unsupported claims.
  • The quadrant map could serve as a visual aid that lets a clinician quickly flag which sentences in an AI-generated report lack image support.
  • The approach might lower the cost of deploying VLMs in medicine by avoiding the need for domain-specific retraining on every new dataset.
  • Integrating the verification signal back into model decoding could test whether hallucinations can be prevented at generation time rather than corrected afterward.

Load-bearing premise

It is possible to reliably identify and extract the specific visual evidence region corresponding to each textual statement so that the bidirectional check can accurately test support.

What would settle it

A controlled test in which the extracted visual regions are deliberately replaced with unrelated image patches and the method's detection metrics fall to near-random levels.

Figures

Figures reproduced from arXiv: 2606.18609 by Huazhu Fu, Hu Chen, Jiaqi Zhu, Ke Zou, Linchao He, Meng Liu, Nan Zhou, Yi Zhang.

Figure 1
Figure 1. Figure 1: (a) CoEV Overview. The CoEV diagnostic process performs counter-evidence verification and four-quadrant mapping to identify hallucinated claims. (b) Med-VQA Verification: Using diagnostic signals to validate answers and detect hallucinations. (c) Report Refinement: Rewriting hallucinated sentences using CoEV guidance. a generated text Tgen and its corresponding image I, CoEV models the halluci￾nation detec… view at source ↗
Figure 2
Figure 2. Figure 2: Representative cases. Case 1 (Q2): description persists after masking, indi￾cating ungrounded bias. Case 2 (Q1): description disappears post-masking, reflecting evidence-based reasoning. CoEV refines reports by aligning claims with visual evidence. 3.3 Ablation Study We ablate CoEV’s two core diagnostic dimensions: (1) the textual axis, which evaluates textual consistency, and (2) the visual axis, which ev… view at source ↗
read the original abstract

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Counter-Evidence Verification (CoEV), a training-free, plug-and-play framework for detecting and correcting hallucinations in medical vision-language models (VLMs). CoEV conducts bidirectional verification between textual assertions and their corresponding visual evidence regions to assign statements to a four-quadrant map based on combinations of text factuality and visual grounding. Experiments across four medical datasets report consistent outperformance in hallucination detection (average +3.0% PR-AUC, +3.9% ROC-AUC) and correction (up to +12.5% Micro-F1, -11.9% hallucination rate).

Significance. If the central claims hold, the work provides a practical, retraining-free approach to mitigating hallucinations in medical VLMs, which could improve trustworthiness in clinical applications. The method's emphasis on evidence-based verification addresses a noted gap in existing attention-based or inconsistency-focused approaches.

major comments (1)
  1. [Framework description (bidirectional verification and four-quadrant map)] The four-quadrant classification depends on accurate identification and extraction of the specific visual evidence region for each textual statement. The manuscript does not report any validation, ablation, or error analysis of this region localization step, which is load-bearing for the detection performance claims, especially given the challenges of small, overlapping, or low-contrast anatomy in medical images.
minor comments (2)
  1. [Abstract] Formatting error in abstract: 'Co}unter-Evidence' should read 'Counter-Evidence'.
  2. [Abstract] Missing space in abstract: 'VLMs.For hallucination detection' should read 'VLMs. For hallucination detection'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. We address it directly below and will revise the manuscript to incorporate additional analysis.

read point-by-point responses
  1. Referee: [Framework description (bidirectional verification and four-quadrant map)] The four-quadrant classification depends on accurate identification and extraction of the specific visual evidence region for each textual statement. The manuscript does not report any validation, ablation, or error analysis of this region localization step, which is load-bearing for the detection performance claims, especially given the challenges of small, overlapping, or low-contrast anatomy in medical images.

    Authors: We agree that the region localization step is central to the four-quadrant map and that the manuscript does not report dedicated validation, ablation, or error analysis of this component. The bidirectional verification relies on accurate mapping of statements to evidence regions, and we will add an ablation study evaluating localization accuracy (where ground-truth regions are available) along with error analysis focused on small, overlapping, or low-contrast medical structures. This will be included in the revised manuscript to better substantiate the detection claims. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free framework with no fitted parameters or self-referential derivations

full rationale

The paper describes CoEV as a training-free, plug-and-play method that performs bidirectional verification between textual assertions and visual evidence regions to populate a four-quadrant map. No equations, parameter fitting, or self-citations are presented that would reduce the reported AUC or F1 gains to the input data by construction. The central claims rest on external empirical evaluation across four datasets rather than any definitional loop or renamed input. The extraction of evidence regions is an implementation assumption, not a self-defining step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework description does not introduce new physical or mathematical entities or state unproven background lemmas.

pith-pipeline@v0.9.1-grok · 5835 in / 1382 out tokens · 24612 ms · 2026-06-26T21:56:51.412782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    Applied Sciences12(8), 3846 (2022)

    An, J., Joe, I.: Attention map-guided visual explanations for deep neural networks. Applied Sciences12(8), 3846 (2022)

  2. [2]

    Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., Oh, J., Ji, L., Chang, E., Kim, T., et al.: Mimic-ext-mimic-cxr-vqa: A complex, diverse, and large-scale visual question answering dataset for chest x-ray images (2024)

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

  4. [4]

    Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

    Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)

  5. [5]

    Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

    Chen, J., Yang, D., Wu, T., Jiang, Y., Hou, X., Li, M., Wang, S., Xiao, D., Li, K., Zhang, L.: Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185 (2024)

  6. [6]

    Journal of the American Medical Informatics Association23(2), 304–310 (2015)

    Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol- ogy examinations for distribution and retrieval. Journal of the American Medical Informatics Association23(2), 304–310 (2015)

  7. [7]

    Nature630(8017), 625–630 (2024)

    Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large lan- guage models using semantic entropy. Nature630(8017), 625–630 (2024)

  8. [8]

    Advanced Intelligent Systems p

    Gu, Z., Chen, J., Liu, F., Yin, C., Zhang, P.: Medvh: Toward systematic evaluation of hallucination for large vision language models in the medical context. Advanced Intelligent Systems p. 2500255 (2025)

  9. [9]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18135–18143 (2024)

  10. [10]

    Scientific data6(1), 317 (2019)

    Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

  11. [11]

    Scientific data 5(1), 1–10 (2018)

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)

  12. [12]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Nan Zhou et al., submission to MICCAI 2026 review

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Nan Zhou et al., submission to MICCAI 2026 review

  13. [13]

    In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention

    Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: Inter- national Conference on Medical Image Computing and Computer-Assisted Inter- vention. pp. 669–679. Springer (2025)

  14. [14]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

  15. [15]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

    Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 21310–21320 (2025)

  16. [16]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

  17. [17]

    arXiv preprint arXiv:2509.04492 (2025)

    Moslonka, C., Randrianarivo, H., Garnier, A., Malherbe, E.: Learned hallucina- tion detection in black-box llms using token-level entropy production rate. arXiv preprint arXiv:2509.04492 (2025)

  18. [18]

    In: Findings of the association for computational linguistics: EMNLP 2024

    Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generative radiology report evaluation and error notation. In: Findings of the association for computational linguistics: EMNLP 2024. pp. 374–390 (2024)

  19. [19]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

  20. [20]

    Multimedia Tools and Applications83(19), 57551–57578 (2024)

    Raghavan, K., B, S., v, K.: Attention guided grad-cam: an improved explainable artificial intelligence model for infrared breast cancer detection. Multimedia Tools and Applications83(19), 57551–57578 (2024)

  21. [21]

    MedGemma Technical Report

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

  22. [22]

    arXiv preprint arXiv:2306.07971 (2023)

    Thawkar, O., Shaker, A., Mullappilly, S.S., Cholakkal, H., Anwer, R.M., Khan, S., Laaksonen, J., Khan, F.S.: Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 (2023)

  23. [23]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25543–25551 (2025)

  24. [24]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

  25. [25]

    arXiv preprint arXiv:2411.00299 (2024)

    Zhang, S., Sambara, S., Banerjee, O., Acosta, J., Fahrner, L.J., Rajpurkar, P.: Radflag: A black-box hallucination detection method for medical vision language models. arXiv preprint arXiv:2411.00299 (2024)

  26. [26]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv:2303.00915 (2023)

  27. [27]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification 11

  28. [28]

    arXiv e-prints pp

    Zou, K., Bai, Y., Chen, Z., Zhou, Y., Chen, Y., Ren, K., Wang, M., Yuan, X., Shen, X., Fu, H.: Medrg: Medical report grounding with multi-modal large language model. arXiv e-prints pp. arXiv–2404 (2024)

  29. [29]

    Zou, K., Bai, Y., Liu, B., Chen, Y., Chen, Z., Zhou, Y., Yuan, X., Wang, M., Shen, X., Cao, X., et al.: Uncertainty-aware medical diagnostic phrase identification and grounding.IEEETransactionsonPatternAnalysisandMachineIntelligence(2025)