A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

Aminu Lawal; Binod Bhattarai; Maria Carmen Romano; Niyoj Oli; Prashnna Gyawali; Sachin Acharya

arxiv: 2606.24115 · v1 · pith:HUXODFEQnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI

A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

Aminu Lawal , Niyoj Oli , Sachin Acharya , Prashnna Gyawali , Maria Carmen Romano , Binod Bhattarai This is my paper

Pith reviewed 2026-06-26 01:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hallucination detectionvision-language modelsgastrointestinal endoscopyVQAwhite-box methodsReXTrustconfident confabulationmedical AI

0 comments

The pith

White-box method ReXTrust outperforms alternatives at detecting hallucinations in GI endoscopy VLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks nine hallucination detection methods across five VLMs on the Gut-VLM dataset of 4,392 GI endoscopy VQA pairs. It establishes that the white-box method ReXTrust records the highest AUC on every model tested, with a peak of 93.0 and a statistically significant lead over the next-best method in each case. White-box access to hidden states yields an average 19.5-point AUC gain, while token-level gray-box statistics rank as the strongest non-white-box option. The study also identifies confident confabulation as a persistent failure mode that defeats consistency-based and uncertainty-based detectors. These findings address a practical barrier to using VLMs safely in clinical endoscopy by showing which detection approaches work best on this underexplored domain.

Core claim

ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p < 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives. The work further identifies confident confabulation as a systemic failure for both consistency and uncertainty-based methods.

What carries the argument

ReXTrust, a white-box hallucination detector that uses access to internal hidden states of the VLM

If this is right

White-box hidden-state access should be prioritized when reliable hallucination detection is required for medical VLMs.
Token-level probability and entropy statistics serve as the best practical fallback when internal states are unavailable.
Confident confabulation limits the reliability of black-box and clustering-based detectors on this task.
Performance gaps between methods widen on weaker base models such as LLaVA-v1.6-7B.
The Gut-VLM dataset supplies a targeted benchmark for evaluating hallucination detectors in gastrointestinal endoscopy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

White-box advantages observed here may extend to other clinical VLM applications such as radiology or pathology.
Integrating ReXTrust-style detectors into clinical workflows could reduce the risk of acting on hallucinated outputs during endoscopy procedures.
New detection techniques may be needed to handle confident confabulation cases that current methods miss.
The benchmark could be extended to video-based endoscopy sequences to test temporal consistency of detections.

Load-bearing premise

The Gut-VLM test VQA pairs carry reliable ground-truth labels for hallucinations and the five VLMs plus nine detection methods were implemented without systematic bias.

What would settle it

Re-labeling a random subset of the 4,392 Gut-VLM pairs by independent clinicians and re-computing all AUCs to check whether the reported ranking of ReXTrust versus the other eight methods reverses.

Figures

Figures reproduced from arXiv: 2606.24115 by Aminu Lawal, Binod Bhattarai, Maria Carmen Romano, Niyoj Oli, Prashnna Gyawali, Sachin Acharya.

**Figure 1.** Figure 1: Overview of the benchmark pipeline. An image-question pair from Gut-VLM is passed to five VLMs, which produce hidden states, token probabilities, and generated responses utilized by nine hallucination detection methods across three access categories. Actual hallucination labels are derived independently by the GREEN model, which compares the generated response against the expert-verified reference answer. … view at source ↗

**Figure 2.** Figure 2: Following the stratified partitioning established by [9], we utilize a 20% [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: AUC (%) of hallucination detection methods across five VLMs on the Gut-VLM dataset. Higher values (green) indicate better detection performance, while lower values (red) indicate poor performance. 3.4 Qualitative Analysis Hallucinated vs. Non-Hallucinated Responses [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: GREEN model-based hallucination labeling. The GREEN model scores generated answer and ground-truth answer pairs to produce hallucination labels. (a) A nonhallucinated example receiving a GREEN score of 1.0, indicating full factual alignment with the reference. (b) A hallucinated example receiving a GREEN score of 0.0, indicating a critical factual inconsistency with the ground truth. 4 Discussion 4.1 Why … view at source ↗

**Figure 5.** Figure 5: Confident confabulation in Lingshu-32B. The model predicts Colon on 8 of 10 stochastic samples for an image depicting the Cecum. The high inter-sample consistency yields a low SelfCheckGPT-NLI score (0.1050), causing the hallucination to be misclassified as non-hallucinated. Despite this, the GREEN model correctly identifies the factual error and assigns a hallucination label, highlighting the failure of … view at source ↗

read the original abstract

Vision-language models (VLMs) are prone to hallucination, which remains a major barrier to their safe deployment in clinical practice. To date, most hallucination detection methods have been evaluated on radiology benchmarks such as MIMIC-CXR and VQA-RAD, while gastrointestinal (GI) endoscopy remains largely underexplored. In this paper, we benchmark nine hallucination detection methods on the Gut-VLM dataset, a GI diagnostic Visual Question Answering (VQA) dataset with 4,392 test VQA pairs, across five VLMs (MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B). The methods span three categories: black-box methods (RadFlag, SelfCheckGPT-NLI), gray-box methods (AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, and VASE), and a white-box method (ReXTrust). Our results show that ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p < 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average (range: 9.5--33.5), with ReXTrust maintaining strong performance even on LLaVA-v1.6-7B (AUC 79.9), where black-box methods and clustering-based gray-box methods collapse to near-chance performance. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives, outperforming both clustering-based gray-box methods (Semantic Entropy, VASE) and black-box approaches on average. We further identify confident confabulation, a failure mode in which models hallucinate with high inter-sample consistency or high token-level probability, as a systemic failure for both consistency and uncertainty-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Gut-VLM benchmark is new and shows white-box detection ahead by ~19.5 AUC points on average, but the abstract gives no information on how the hallucination labels were produced or checked.

read the letter

ReXTrust, the white-box method, leads on this new benchmark for hallucination detection in GI endoscopy VLMs, with the highest AUC on all five models tested and a consistent edge over the next best method.

The paper builds Gut-VLM with 4,392 test VQA pairs and runs nine detectors spanning black-box, gray-box, and white-box categories across MedGemma, LLaVA-Med, LLaVA-v1.6, and Lingshu models. The 19.5-point average advantage for hidden-state access, the fact that it holds up on LLaVA-v1.6-7B where others drop near chance, and the finding that token-level gray-box stats beat clustering and black-box options are the concrete results. They also flag confident confabulation as a shared failure mode. The domain focus on GI endoscopy rather than radiology is the main addition.

The results line up across models, which strengthens the pattern. The statistical tests are reported as paired permutation tests with p<0.001.

The clear gap is the ground-truth labels. The abstract states the dataset exists but supplies no annotation protocol, rater details, definition of hallucination in this context, or agreement metrics. If label noise is present or correlates with model behavior, the AUC gaps and significance claims become harder to trust. Implementation details for the detectors are also not visible from the abstract.

This is for groups working on medical VLM safety and evaluation. It deserves peer review so the label construction and full pipeline can be examined.

Referee Report

1 major / 0 minor

Summary. The manuscript benchmarks nine hallucination detection methods (black-box: RadFlag, SelfCheckGPT-NLI; gray-box: AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, VASE; white-box: ReXTrust) on the Gut-VLM GI endoscopy VQA dataset (4,392 test pairs) across five VLMs (MedGemma-4B/27B, LLaVA-Med-7B, LLaVA-v1.6-7B, Lingshu-32B). It claims ReXTrust attains the highest AUC on all models (peak 93.0 on MedGemma-4B), outperforming the strongest non-white-box alternative on each VLM by a statistically significant margin via paired permutation tests (p < 0.001), with white-box access conferring a 19.5 AUC point average advantage (range 9.5–33.5). Token-level gray-box statistics outperform clustering-based gray-box and black-box methods on average, and the work identifies 'confident confabulation' as a systemic failure mode for consistency- and uncertainty-based detectors.

Significance. If the ground-truth labels are shown to be reliable, the results would be significant by delivering the first systematic hallucination-detection benchmark in the clinically important but underexplored GI endoscopy domain. The consistent, statistically tested superiority of white-box hidden-state methods and the identification of confident confabulation supply concrete guidance for detector selection in safety-critical settings. The multi-VLM evaluation and introduction of the Gut-VLM test set enhance reproducibility and domain coverage.

major comments (1)

[Dataset section] Dataset section (description of Gut-VLM construction): The protocol for producing ground-truth hallucination labels on the 4,392 test VQA pairs is not described. No information is supplied on annotator qualifications, number of annotators, inter-rater agreement, or the operational definition of hallucination in the GI context. Because the reported AUC values, 19.5-point advantage, and p < 0.001 claims rest directly on label accuracy, this omission is load-bearing for the central empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the importance of transparent dataset construction details. We agree that the current manuscript lacks sufficient description of the ground-truth labeling protocol for Gut-VLM, which is critical for supporting the reported AUC results and statistical claims. We will revise the manuscript to address this.

read point-by-point responses

Referee: [Dataset section] Dataset section (description of Gut-VLM construction): The protocol for producing ground-truth hallucination labels on the 4,392 test VQA pairs is not described. No information is supplied on annotator qualifications, number of annotators, inter-rater agreement, or the operational definition of hallucination in the GI context. Because the reported AUC values, 19.5-point advantage, and p < 0.001 claims rest directly on label accuracy, this omission is load-bearing for the central empirical claims.

Authors: We acknowledge that the Dataset section in the submitted manuscript does not provide the requested details on how the 4,392 ground-truth hallucination labels were generated. This information is necessary to allow readers to assess label reliability. In the revised version, we will expand the Dataset section with: the operational definition of hallucination applied in the GI endoscopy VQA setting; the number of annotators and their qualifications (e.g., clinical expertise in gastroenterology); the full annotation protocol; and quantitative inter-rater agreement statistics. These additions will directly support the validity of the benchmark results and the statistical comparisons. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmark; no derivations or self-referential predictions

full rationale

The paper is a direct empirical comparison of nine hallucination detection methods (black-box, gray-box, white-box) on the Gut-VLM VQA dataset across five VLMs, reporting AUC values, paired permutation tests, and average advantages. No equations, fitted parameters presented as predictions, ansatzes, or derivation chains appear in the abstract or described content. Central claims rest on implementation and evaluation of external methods against ground-truth labels rather than any self-definition or self-citation reduction. This is the most common honest finding for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; no mathematical axioms, free parameters, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5955 in / 1092 out tokens · 24555 ms · 2026-06-26T01:52:19.218141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Gastroenterology (2020)

Arnold, M., Abnet, C.C., Neale, R.E., Vignat, J., Giovannucci, E.L., McGlynn, K.A., Bray, F.: Global burden of 5 major types of gastrointestinal cancer. Gastroenterology (2020). https://doi.org/10.1053/j.gastro.2020.02.068 14 A. Lawal et al

work page doi:10.1053/j.gastro.2020.02.068 2020
[2]

The Internal State of an LLM Knows When It's Lying

Azaria, A., Mitchell, T.M.: The internal state of an llm knows when its lying. ArXiv abs/2304.13734(2023). https://doi.org/10.18653/v1/2023.findings-emnlp.68

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.68 2023
[3]

Detecting hallucinations in large language models using semantic entropy , volume =

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting halluci- nations in large language models using semantic entropy. Nature 630(8017), 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0, https://www.nature.com/articles/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[4]

In: 2024 7th International Conference on Universal Village (UV)

Gu, C., Zhang, W., Huang, Z., et al.: Lens: Layers of evaluation of hallucination in genai systems. In: 2024 7th International Conference on Universal Village (UV). pp. 1–85 (2024). https://doi.org/10.1109/UV63228.2024.11189150

work page doi:10.1109/uv63228.2024.11189150 2024
[5]

In: Wu, J., Zhu, J., Xu, M., Jin, Y

Hardy, R., Kim, S.E., Ro, D.H., Rajpurkar, P.: Rextrust: A model for fine-grained hallucination detection in ai-generated radiology reports. In: Wu, J., Zhu, J., Xu, M., Jin, Y. (eds.) Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare. Proceedings of Machine Learning Research, vol. 281, pp. 173–182. PMLR (25 Feb 2025), https:/...

2025
[6]

In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

Pith/arXiv arXiv 2023
[7]

arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

Pith/arXiv arXiv 2003
[8]

arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

Pith/arXiv arXiv 1901
[9]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Khanal, B., Pokhrel, S., Bhandari, S., et al.: Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 235–245. Springer (2025)

2025
[10]

Scientific Data5, 180251 (2018).https://doi.org/10.1038/sdata.2018.251

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5(1), 180251 (2018). https://doi.org/10.1038/sdata.2018.251, https://www.nature.com/articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018
[11]

arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assis- tant for biomedicine in one day. arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

Pith/arXiv arXiv 2023
[12]

In: Findings of the Association for Com- putational Linguistics: EMNLP 2024

Li,Q.,Geng,J.,Lyu,C.,Zhu,D.,Panov,M.,Karray,F.:Reference-freehallucination detection for large vision-language models. In: Findings of the Association for Com- putational Linguistics: EMNLP 2024. pp. 4542–4551. Association for Computational Linguistics,Miami,Florida,USA(2024).https://doi.org/10.18653/v1/2024.findings- emnlp.262, https://aclanthology.org/2...

work page doi:10.18653/v1/2024.findings- 2024
[13]

Novel Pathways ink-Contact Geometry

Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. Lecture Notes in Computer Science, vol. 15964. Springer (2025). https://doi.org/10.1007/978-3-032- 04971-1_63

work page doi:10.1007/978-3-032- 2025
[14]

In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)

Liu, B., Zhan, L., Xu, L., Ma, L., Yang, Y., Wu, X.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–

2021
[15]

URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

IEEE, Nice, France (2021). https://doi.org/10.1109/ISBI48211.2021.9434010, https://ieeexplore.ieee.org/document/9434010

work page doi:10.1109/isbi48211.2021.9434010 2021
[16]

In: Advances in Neural Information Processing Systems (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)

2023
[17]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, P., Liusie, A., Gales, M.: Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In: Proceed- ings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 9004–9017. Association for Computational Lin- guistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-main.557...

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[18]

Diagnostics 14(2024)

Mota, J., Almeida, M.J., Mendes, F., et al.: A comprehensive review of artificial intelligence and colon capsule endoscopy: Opportunities and challenges. Diagnostics 14(2024). https://doi.org/10.3390/diagnostics14182072

work page doi:10.3390/diagnostics14182072 2024
[19]

Chaudhari, and Jean-Benoit Delbrouck

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Blueth- gen, C., Michalson, A.E., Moseley, M., Langlotz, C., Chaudhari, A.S., Delbrouck, J.: Green: Generative radiology report evaluation and error notation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 374–390. Association for Computational Linguistics, Mi- ami...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024
[20]

Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

Pogorelov, K., Randel, K.R., Griwodz, C., et al.: Kvasir: A multi-class image- dataset for computer aided gastrointestinal disease detection. In: Proceed- ings of the 8th ACM Multimedia Systems Conference. ACM, Taiwan (2017). https://doi.org/10.1145/3083187.3083212

work page doi:10.1145/3083187.3083212 2017
[21]

Sawczyn, A., Binkowski, J., Janiak, D., Gabrys, B., Kajdanowicz, T.: Factselfcheck: Fact-level black-box hallucination detection for llms (2025), https://arxiv.org/abs/2503.17229

arXiv 2025
[22]

arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report. arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

Pith/arXiv arXiv 2025
[23]

ACM Computing Surveys57(4), 1– 42 (2025)

Shorinwa, O., Mei, Z., Lidard, J., Ren, A.Z., Majumdar, A.: A survey on uncertainty quantification of large language models: Taxonomy, open re- search challenges, and future directions. ACM Computing Surveys57(4), 1– 42 (2025). https://doi.org/10.1145/3744238, https://doi.org/10.1145/3744238, arXiv:2412.05563

work page doi:10.1145/3744238 2025
[24]

arXiv preprint arXiv:2506.07044 (2025)

Xu, W., Chan, H.P., et al.: Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

Pith/arXiv arXiv 2025
[25]

In: Proceedings of the 4th Machine Learning for Health Symposium

Zhang, S., Sambara, S., Banerjee, O., Acosta, J.N., Fahrner, L.J., Rajpurkar, P.: Radflag: A black-box hallucination detection method for medical vision language models. In: Proceedings of the 4th Machine Learning for Health Symposium. Proceedings of Machine Learning Research, vol. 259, pp. 1087–1103. PMLR (15–16 Dec 2025), https://proceedings.mlr.press/v...

2025
[26]

Proceedings of the 33rd ACM International Conference on Multimedia (2024)

Zhang, Y., Xie, R., Sun, X., et al.: Dhcp: Detecting hallucinations by cross-modal attention pattern in large vision-language models. Proceedings of the 33rd ACM International Conference on Multimedia (2024)

2024

[1] [1]

Gastroenterology (2020)

Arnold, M., Abnet, C.C., Neale, R.E., Vignat, J., Giovannucci, E.L., McGlynn, K.A., Bray, F.: Global burden of 5 major types of gastrointestinal cancer. Gastroenterology (2020). https://doi.org/10.1053/j.gastro.2020.02.068 14 A. Lawal et al

work page doi:10.1053/j.gastro.2020.02.068 2020

[2] [2]

The Internal State of an LLM Knows When It's Lying

Azaria, A., Mitchell, T.M.: The internal state of an llm knows when its lying. ArXiv abs/2304.13734(2023). https://doi.org/10.18653/v1/2023.findings-emnlp.68

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.68 2023

[3] [3]

Detecting hallucinations in large language models using semantic entropy , volume =

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting halluci- nations in large language models using semantic entropy. Nature 630(8017), 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0, https://www.nature.com/articles/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[4] [4]

In: 2024 7th International Conference on Universal Village (UV)

Gu, C., Zhang, W., Huang, Z., et al.: Lens: Layers of evaluation of hallucination in genai systems. In: 2024 7th International Conference on Universal Village (UV). pp. 1–85 (2024). https://doi.org/10.1109/UV63228.2024.11189150

work page doi:10.1109/uv63228.2024.11189150 2024

[5] [5]

In: Wu, J., Zhu, J., Xu, M., Jin, Y

Hardy, R., Kim, S.E., Ro, D.H., Rajpurkar, P.: Rextrust: A model for fine-grained hallucination detection in ai-generated radiology reports. In: Wu, J., Zhu, J., Xu, M., Jin, Y. (eds.) Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare. Proceedings of Machine Learning Research, vol. 281, pp. 173–182. PMLR (25 Feb 2025), https:/...

2025

[6] [6]

In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://arxiv.org/abs/2111.09543, arXiv:2111.09543

Pith/arXiv arXiv 2023

[7] [7]

arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprintarXiv:2003.10286(2020), https://arxiv.org/abs/2003.10286

Pith/arXiv arXiv 2003

[8] [8]

arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprintarXiv:1901.07042 (2019), https://arxiv.org/abs/1901.07042

Pith/arXiv arXiv 1901

[9] [9]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Khanal, B., Pokhrel, S., Bhandari, S., et al.: Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 235–245. Springer (2025)

2025

[10] [10]

Scientific Data5, 180251 (2018).https://doi.org/10.1038/sdata.2018.251

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5(1), 180251 (2018). https://doi.org/10.1038/sdata.2018.251, https://www.nature.com/articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018

[11] [11]

arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assis- tant for biomedicine in one day. arXiv preprintarXiv:2306.00890(2023), https://arxiv.org/abs/2306.00890

Pith/arXiv arXiv 2023

[12] [12]

In: Findings of the Association for Com- putational Linguistics: EMNLP 2024

Li,Q.,Geng,J.,Lyu,C.,Zhu,D.,Panov,M.,Karray,F.:Reference-freehallucination detection for large vision-language models. In: Findings of the Association for Com- putational Linguistics: EMNLP 2024. pp. 4542–4551. Association for Computational Linguistics,Miami,Florida,USA(2024).https://doi.org/10.18653/v1/2024.findings- emnlp.262, https://aclanthology.org/2...

work page doi:10.18653/v1/2024.findings- 2024

[13] [13]

Novel Pathways ink-Contact Geometry

Liao, Z., Hu, S., Zou, K., Fu, H., Zhen, L., Xia, Y.: Vision-amplified semantic entropy for hallucination detection in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. Lecture Notes in Computer Science, vol. 15964. Springer (2025). https://doi.org/10.1007/978-3-032- 04971-1_63

work page doi:10.1007/978-3-032- 2025

[14] [14]

In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)

Liu, B., Zhan, L., Xu, L., Ma, L., Yang, Y., Wu, X.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: Hallucination Detection in VLMs for GI Endoscopy 15 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–

2021

[15] [15]

URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

IEEE, Nice, France (2021). https://doi.org/10.1109/ISBI48211.2021.9434010, https://ieeexplore.ieee.org/document/9434010

work page doi:10.1109/isbi48211.2021.9434010 2021

[16] [16]

In: Advances in Neural Information Processing Systems (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)

2023

[17] [17]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, P., Liusie, A., Gales, M.: Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In: Proceed- ings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 9004–9017. Association for Computational Lin- guistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-main.557...

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[18] [18]

Diagnostics 14(2024)

Mota, J., Almeida, M.J., Mendes, F., et al.: A comprehensive review of artificial intelligence and colon capsule endoscopy: Opportunities and challenges. Diagnostics 14(2024). https://doi.org/10.3390/diagnostics14182072

work page doi:10.3390/diagnostics14182072 2024

[19] [19]

Chaudhari, and Jean-Benoit Delbrouck

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Blueth- gen, C., Michalson, A.E., Moseley, M., Langlotz, C., Chaudhari, A.S., Delbrouck, J.: Green: Generative radiology report evaluation and error notation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 374–390. Association for Computational Linguistics, Mi- ami...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024

[20] [20]

Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

Pogorelov, K., Randel, K.R., Griwodz, C., et al.: Kvasir: A multi-class image- dataset for computer aided gastrointestinal disease detection. In: Proceed- ings of the 8th ACM Multimedia Systems Conference. ACM, Taiwan (2017). https://doi.org/10.1145/3083187.3083212

work page doi:10.1145/3083187.3083212 2017

[21] [21]

Sawczyn, A., Binkowski, J., Janiak, D., Gabrys, B., Kajdanowicz, T.: Factselfcheck: Fact-level black-box hallucination detection for llms (2025), https://arxiv.org/abs/2503.17229

arXiv 2025

[22] [22]

arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al.: Medgemma technical report. arXiv preprintarXiv:2507.05201(2025), https://arxiv.org/abs/2507.05201

Pith/arXiv arXiv 2025

[23] [23]

ACM Computing Surveys57(4), 1– 42 (2025)

Shorinwa, O., Mei, Z., Lidard, J., Ren, A.Z., Majumdar, A.: A survey on uncertainty quantification of large language models: Taxonomy, open re- search challenges, and future directions. ACM Computing Surveys57(4), 1– 42 (2025). https://doi.org/10.1145/3744238, https://doi.org/10.1145/3744238, arXiv:2412.05563

work page doi:10.1145/3744238 2025

[24] [24]

arXiv preprint arXiv:2506.07044 (2025)

Xu, W., Chan, H.P., et al.: Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

Pith/arXiv arXiv 2025

[25] [25]

In: Proceedings of the 4th Machine Learning for Health Symposium

Zhang, S., Sambara, S., Banerjee, O., Acosta, J.N., Fahrner, L.J., Rajpurkar, P.: Radflag: A black-box hallucination detection method for medical vision language models. In: Proceedings of the 4th Machine Learning for Health Symposium. Proceedings of Machine Learning Research, vol. 259, pp. 1087–1103. PMLR (15–16 Dec 2025), https://proceedings.mlr.press/v...

2025

[26] [26]

Proceedings of the 33rd ACM International Conference on Multimedia (2024)

Zhang, Y., Xie, R., Sun, X., et al.: Dhcp: Detecting hallucinations by cross-modal attention pattern in large vision-language models. Proceedings of the 33rd ACM International Conference on Multimedia (2024)

2024