pith. sign in

arxiv: 2606.24115 · v1 · pith:HUXODFEQnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI

A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

Pith reviewed 2026-06-26 01:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords hallucination detectionvision-language modelsgastrointestinal endoscopyVQAwhite-box methodsReXTrustconfident confabulationmedical AI
0
0 comments X

The pith

White-box method ReXTrust outperforms alternatives at detecting hallucinations in GI endoscopy VLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks nine hallucination detection methods across five VLMs on the Gut-VLM dataset of 4,392 GI endoscopy VQA pairs. It establishes that the white-box method ReXTrust records the highest AUC on every model tested, with a peak of 93.0 and a statistically significant lead over the next-best method in each case. White-box access to hidden states yields an average 19.5-point AUC gain, while token-level gray-box statistics rank as the strongest non-white-box option. The study also identifies confident confabulation as a persistent failure mode that defeats consistency-based and uncertainty-based detectors. These findings address a practical barrier to using VLMs safely in clinical endoscopy by showing which detection approaches work best on this underexplored domain.

Core claim

ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p < 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives. The work further identifies confident confabulation as a systemic failure for both consistency and uncertainty-based methods.

What carries the argument

ReXTrust, a white-box hallucination detector that uses access to internal hidden states of the VLM

If this is right

  • White-box hidden-state access should be prioritized when reliable hallucination detection is required for medical VLMs.
  • Token-level probability and entropy statistics serve as the best practical fallback when internal states are unavailable.
  • Confident confabulation limits the reliability of black-box and clustering-based detectors on this task.
  • Performance gaps between methods widen on weaker base models such as LLaVA-v1.6-7B.
  • The Gut-VLM dataset supplies a targeted benchmark for evaluating hallucination detectors in gastrointestinal endoscopy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • White-box advantages observed here may extend to other clinical VLM applications such as radiology or pathology.
  • Integrating ReXTrust-style detectors into clinical workflows could reduce the risk of acting on hallucinated outputs during endoscopy procedures.
  • New detection techniques may be needed to handle confident confabulation cases that current methods miss.
  • The benchmark could be extended to video-based endoscopy sequences to test temporal consistency of detections.

Load-bearing premise

The Gut-VLM test VQA pairs carry reliable ground-truth labels for hallucinations and the five VLMs plus nine detection methods were implemented without systematic bias.

What would settle it

Re-labeling a random subset of the 4,392 Gut-VLM pairs by independent clinicians and re-computing all AUCs to check whether the reported ranking of ReXTrust versus the other eight methods reverses.

Figures

Figures reproduced from arXiv: 2606.24115 by Aminu Lawal, Binod Bhattarai, Maria Carmen Romano, Niyoj Oli, Prashnna Gyawali, Sachin Acharya.

Figure 1
Figure 1. Figure 1: Overview of the benchmark pipeline. An image-question pair from Gut-VLM is passed to five VLMs, which produce hidden states, token probabilities, and generated responses utilized by nine hallucination detection methods across three access categories. Actual hallucination labels are derived independently by the GREEN model, which compares the generated response against the expert-verified reference answer. … view at source ↗
Figure 2
Figure 2. Figure 2: Following the stratified partitioning established by [9], we utilize a 20% [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUC (%) of hallucination detection methods across five VLMs on the Gut-VLM dataset. Higher values (green) indicate better detection performance, while lower values (red) indicate poor performance. 3.4 Qualitative Analysis Hallucinated vs. Non-Hallucinated Responses [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GREEN model-based hallucination labeling. The GREEN model scores generated answer and ground-truth answer pairs to produce hallucination labels. (a) A non￾hallucinated example receiving a GREEN score of 1.0, indicating full factual alignment with the reference. (b) A hallucinated example receiving a GREEN score of 0.0, indicating a critical factual inconsistency with the ground truth. 4 Discussion 4.1 Why … view at source ↗
Figure 5
Figure 5. Figure 5: Confident confabulation in Lingshu-32B. The model predicts Colon on 8 of 10 stochastic samples for an image depicting the Cecum. The high inter-sample consistency yields a low SelfCheckGPT-NLI score (0.1050), causing the hallucination to be misclassi￾fied as non-hallucinated. Despite this, the GREEN model correctly identifies the factual error and assigns a hallucination label, highlighting the failure of … view at source ↗
read the original abstract

Vision-language models (VLMs) are prone to hallucination, which remains a major barrier to their safe deployment in clinical practice. To date, most hallucination detection methods have been evaluated on radiology benchmarks such as MIMIC-CXR and VQA-RAD, while gastrointestinal (GI) endoscopy remains largely underexplored. In this paper, we benchmark nine hallucination detection methods on the Gut-VLM dataset, a GI diagnostic Visual Question Answering (VQA) dataset with 4,392 test VQA pairs, across five VLMs (MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B). The methods span three categories: black-box methods (RadFlag, SelfCheckGPT-NLI), gray-box methods (AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, and VASE), and a white-box method (ReXTrust). Our results show that ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p < 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average (range: 9.5--33.5), with ReXTrust maintaining strong performance even on LLaVA-v1.6-7B (AUC 79.9), where black-box methods and clustering-based gray-box methods collapse to near-chance performance. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives, outperforming both clustering-based gray-box methods (Semantic Entropy, VASE) and black-box approaches on average. We further identify confident confabulation, a failure mode in which models hallucinate with high inter-sample consistency or high token-level probability, as a systemic failure for both consistency and uncertainty-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript benchmarks nine hallucination detection methods (black-box: RadFlag, SelfCheckGPT-NLI; gray-box: AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, VASE; white-box: ReXTrust) on the Gut-VLM GI endoscopy VQA dataset (4,392 test pairs) across five VLMs (MedGemma-4B/27B, LLaVA-Med-7B, LLaVA-v1.6-7B, Lingshu-32B). It claims ReXTrust attains the highest AUC on all models (peak 93.0 on MedGemma-4B), outperforming the strongest non-white-box alternative on each VLM by a statistically significant margin via paired permutation tests (p < 0.001), with white-box access conferring a 19.5 AUC point average advantage (range 9.5–33.5). Token-level gray-box statistics outperform clustering-based gray-box and black-box methods on average, and the work identifies 'confident confabulation' as a systemic failure mode for consistency- and uncertainty-based detectors.

Significance. If the ground-truth labels are shown to be reliable, the results would be significant by delivering the first systematic hallucination-detection benchmark in the clinically important but underexplored GI endoscopy domain. The consistent, statistically tested superiority of white-box hidden-state methods and the identification of confident confabulation supply concrete guidance for detector selection in safety-critical settings. The multi-VLM evaluation and introduction of the Gut-VLM test set enhance reproducibility and domain coverage.

major comments (1)
  1. [Dataset section] Dataset section (description of Gut-VLM construction): The protocol for producing ground-truth hallucination labels on the 4,392 test VQA pairs is not described. No information is supplied on annotator qualifications, number of annotators, inter-rater agreement, or the operational definition of hallucination in the GI context. Because the reported AUC values, 19.5-point advantage, and p < 0.001 claims rest directly on label accuracy, this omission is load-bearing for the central empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the importance of transparent dataset construction details. We agree that the current manuscript lacks sufficient description of the ground-truth labeling protocol for Gut-VLM, which is critical for supporting the reported AUC results and statistical claims. We will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Dataset section] Dataset section (description of Gut-VLM construction): The protocol for producing ground-truth hallucination labels on the 4,392 test VQA pairs is not described. No information is supplied on annotator qualifications, number of annotators, inter-rater agreement, or the operational definition of hallucination in the GI context. Because the reported AUC values, 19.5-point advantage, and p < 0.001 claims rest directly on label accuracy, this omission is load-bearing for the central empirical claims.

    Authors: We acknowledge that the Dataset section in the submitted manuscript does not provide the requested details on how the 4,392 ground-truth hallucination labels were generated. This information is necessary to allow readers to assess label reliability. In the revised version, we will expand the Dataset section with: the operational definition of hallucination applied in the GI endoscopy VQA setting; the number of annotators and their qualifications (e.g., clinical expertise in gastroenterology); the full annotation protocol; and quantitative inter-rater agreement statistics. These additions will directly support the validity of the benchmark results and the statistical comparisons. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmark; no derivations or self-referential predictions

full rationale

The paper is a direct empirical comparison of nine hallucination detection methods (black-box, gray-box, white-box) on the Gut-VLM VQA dataset across five VLMs, reporting AUC values, paired permutation tests, and average advantages. No equations, fitted parameters presented as predictions, ansatzes, or derivation chains appear in the abstract or described content. Central claims rest on implementation and evaluation of external methods against ground-truth labels rather than any self-definition or self-citation reduction. This is the most common honest finding for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; no mathematical axioms, free parameters, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5955 in / 1092 out tokens · 24555 ms · 2026-06-26T01:52:19.218141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Gastroenterology (2020)

    Arnold, M., Abnet, C.C., Neale, R.E., Vignat, J., Giovannucci, E.L., McGlynn, K.A., Bray, F.: Global burden of 5 major types of gastrointestinal cancer. Gastroenterology (2020). https://doi.org/10.1053/j.gastro.2020.02.068 14 A. Lawal et al

  2. [2]

    The Internal State of an LLM Knows When It's Lying

    Azaria, A., Mitchell, T.M.: The internal state of an llm knows when its lying. ArXiv abs/2304.13734(2023). https://doi.org/10.18653/v1/2023.findings-emnlp.68

  3. [3]

    Detecting hallucinations in large language models using semantic entropy , volume =

    Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting halluci- nations in large language models using semantic entropy. Nature 630(8017), 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0, https://www.nature.com/articles/s41586-024-07421-0

  4. [4]

    In: 2024 7th International Conference on Universal Village (UV)

    Gu, C., Zhang, W., Huang, Z., et al.: Lens: Layers of evaluation of hallucination in genai systems. In: 2024 7th International Conference on Universal Village (UV). pp. 1–85 (2024). https://doi.org/10.1109/UV63228.2024.11189150

  5. [5]

    In: Wu, J., Zhu, J., Xu, M., Jin, Y

    Hardy, R., Kim, S.E., Ro, D.H., Rajpurkar, P.: Rextrust: A model for fine-grained hallucination detection in ai-generated radiology reports. In: Wu, J., Zhu, J., Xu, M., Jin, Y. (eds.) Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare. Proceedings of Machine Learning Research, vol. 281, pp. 173–182. PMLR (25 Feb 2025), https:/...

  6. [6]

    In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

    He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

  7. [7]

    arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

    He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

  8. [8]

    arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

    Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

  9. [9]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Khanal, B., Pokhrel, S., Bhandari, S., et al.: Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 235–245. Springer (2025)

  10. [10]

    Scientific Data5, 180251 (2018).https://doi.org/10.1038/sdata.2018.251

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5(1), 180251 (2018). https://doi.org/10.1038/sdata.2018.251, https://www.nature.com/articles/sdata2018251

  11. [11]

    arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assis- tant for biomedicine in one day. arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

  12. [12]

    In: Findings of the Association for Com- putational Linguistics: EMNLP 2024

    Li,Q.,Geng,J.,Lyu,C.,Zhu,D.,Panov,M.,Karray,F.:Reference-freehallucination detection for large vision-language models. In: Findings of the Association for Com- putational Linguistics: EMNLP 2024. pp. 4542–4551. Association for Computational Linguistics,Miami,Florida,USA(2024).https://doi.org/10.18653/v1/2024.findings- emnlp.262, https://aclanthology.org/2...

  13. [13]

    Novel Pathways ink-Contact Geometry

    Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. Lecture Notes in Computer Science, vol. 15964. Springer (2025). https://doi.org/10.1007/978-3-032- 04971-1_63

  14. [14]

    In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)

    Liu, B., Zhan, L., Xu, L., Ma, L., Yang, Y., Wu, X.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–

  15. [15]

    URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

    IEEE, Nice, France (2021). https://doi.org/10.1109/ISBI48211.2021.9434010, https://ieeexplore.ieee.org/document/9434010

  16. [16]

    In: Advances in Neural Information Processing Systems (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)

  17. [17]

    SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

    Manakul, P., Liusie, A., Gales, M.: Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In: Proceed- ings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 9004–9017. Association for Computational Lin- guistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-main.557...

  18. [18]

    Diagnostics 14(2024)

    Mota, J., Almeida, M.J., Mendes, F., et al.: A comprehensive review of artificial intelligence and colon capsule endoscopy: Opportunities and challenges. Diagnostics 14(2024). https://doi.org/10.3390/diagnostics14182072

  19. [19]

    Chaudhari, and Jean-Benoit Delbrouck

    Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Blueth- gen, C., Michalson, A.E., Moseley, M., Langlotz, C., Chaudhari, A.S., Delbrouck, J.: Green: Generative radiology report evaluation and error notation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 374–390. Association for Computational Linguistics, Mi- ami...

  20. [20]

    Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

    Pogorelov, K., Randel, K.R., Griwodz, C., et al.: Kvasir: A multi-class image- dataset for computer aided gastrointestinal disease detection. In: Proceed- ings of the 8th ACM Multimedia Systems Conference. ACM, Taiwan (2017). https://doi.org/10.1145/3083187.3083212

  21. [21]

    Sawczyn, A., Binkowski, J., Janiak, D., Gabrys, B., Kajdanowicz, T.: Factselfcheck: Fact-level black-box hallucination detection for llms (2025), https://arxiv.org/abs/2503.17229

  22. [22]

    arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report. arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

  23. [23]

    ACM Computing Surveys57(4), 1– 42 (2025)

    Shorinwa, O., Mei, Z., Lidard, J., Ren, A.Z., Majumdar, A.: A survey on uncertainty quantification of large language models: Taxonomy, open re- search challenges, and future directions. ACM Computing Surveys57(4), 1– 42 (2025). https://doi.org/10.1145/3744238, https://doi.org/10.1145/3744238, arXiv:2412.05563

  24. [24]

    arXiv preprint arXiv:2506.07044 (2025)

    Xu, W., Chan, H.P., et al.: Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

  25. [25]

    In: Proceedings of the 4th Machine Learning for Health Symposium

    Zhang, S., Sambara, S., Banerjee, O., Acosta, J.N., Fahrner, L.J., Rajpurkar, P.: Radflag: A black-box hallucination detection method for medical vision language models. In: Proceedings of the 4th Machine Learning for Health Symposium. Proceedings of Machine Learning Research, vol. 259, pp. 1087–1103. PMLR (15–16 Dec 2025), https://proceedings.mlr.press/v...

  26. [26]

    Proceedings of the 33rd ACM International Conference on Multimedia (2024)

    Zhang, Y., Xie, R., Sun, X., et al.: Dhcp: Detecting hallucinations by cross-modal attention pattern in large vision-language models. Proceedings of the 33rd ACM International Conference on Multimedia (2024)