Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

Muzammil Behzad; Omar Alshahrani

arxiv: 2606.13211 · v1 · pith:BV4YLVL6new · submitted 2026-06-11 · 💻 cs.AI

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

Omar Alshahrani , Muzammil Behzad This is my paper

Pith reviewed 2026-06-27 06:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords hallucinationmedical imagingfoundation modelsAI regulationFDAtaxonomymitigationcross-modality

0 comments

The pith

General-purpose foundation models outperform medical-specialized models on hallucination benchmarks because narrow fine-tuning introduces overfitting-induced confabulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper synthesizes peer-reviewed studies, benchmarks, and FDA guidance across five imaging modalities to unify hallucination taxonomies, compare model types, and map mitigations to regulatory requirements. It establishes that general-purpose foundation models produce fewer fabricated anatomical structures, missed findings, and incorrect measurements than models specialized through narrow domain fine-tuning. A sympathetic reader would care because such errors directly affect biopsy decisions, staging, and treatment planning. The review also shows that radiologist oversight is still required for a high percentage of AI outputs and that combining physics-informed constraints, Chain-of-Thought prompting, and human-in-the-loop checks addresses distinct failure modes.

Core claim

Three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. General-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and remain effective when combined. All findings map to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

What carries the argument

A cross-modality analytical framework that unifies three existing taxonomic approaches to analyze hallucination taxonomy, etiology, detection, and mitigation across the full imaging pipeline.

If this is right

Unified taxonomies from the three frameworks together cover the entire imaging pipeline where no single framework suffices.
General-purpose models should be considered as base architectures to avoid the confabulation introduced by narrow fine-tuning.
Mitigation strategies must be combined because each targets distinct failure modes such as anatomical fabrication or incorrect laterality.
Hallucination management is a continuous obligation under FDA lifecycle frameworks rather than a one-time pre-deployment check.
Human oversight remains essential, as a high percentage of AI-generated flags require expert correction before clinical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding on fine-tuning may extend to other high-stakes domains where broad pretraining could reduce domain-specific confabulation better than narrow specialization.
Regulators may need to require disclosure of training scale and protocol details to prevent confounded comparisons in future model approvals.
Future benchmarks could isolate the effect of fine-tuning scale by holding data volume constant while varying domain specificity.
The reliance on human-in-the-loop suggests that regulatory pathways may need to define required levels of radiologist review rather than aiming for full autonomy.

Load-bearing premise

The benchmarks and studies synthesized are representative across modalities and comparisons between general-purpose and medical-specialized models are not confounded by differences in training data scale or evaluation protocols.

What would settle it

A controlled study that equalizes training data volume and evaluation protocols across model types and still finds medical-specialized models producing fewer hallucinations than general-purpose ones would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.13211 by Muzammil Behzad, Omar Alshahrani.

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: FIGURE 6 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 5.** Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: FIGURE 7 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: FIGURE 8 [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 11.** Figure 11: FIGURE 11 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrative synthesis that unifies hallucination taxonomies across modalities and maps them to FDA rules, but the claim that general-purpose models outperform specialized ones lacks controls for scale and protocol differences.

read the letter

The main thing to know is that this paper organizes existing literature on hallucinations in medical imaging AI rather than presenting new experiments or derivations. It unifies three taxonomic frameworks to cover the imaging pipeline and links mitigation approaches to FDA lifecycle requirements.

It does a solid job pulling together taxonomies, benchmark datasets, and regulatory guidance across five modalities. The emphasis on combined strategies like Chain-of-Thought prompting and human oversight, plus the point that radiologist review remains necessary, aligns with what the cited studies appear to show.

The soft spot is the central comparison: general-purpose models are said to hallucinate less than medical-specialized ones due to overfitting from narrow fine-tuning. The synthesis does not report matched controls for model size, training data volume, or evaluation protocols, so the causal attribution stays under-supported. That matches the stress-test concern.

This is for people who need a consolidated overview of hallucination issues tied to regulatory compliance in medical AI. A reader focused on taxonomy building or FDA mapping would find it useful. It deserves peer review as a review article because the topic matters clinically and the organization of prior work is coherent, though the authors should add explicit discussion of synthesis limitations and potential confounders.

Referee Report

2 major / 2 minor

Summary. This manuscript is a structured narrative synthesis of peer-reviewed studies, benchmarks, and FDA guidance on hallucination (clinically plausible but factually incorrect outputs) in medical imaging AI across five modalities. It unifies three taxonomic frameworks to cover the imaging pipeline, addresses three questions on taxonomy unification, model performance differences, and mitigation strategies compatible with regulatory oversight, claims that general-purpose foundation models outperform medical-specialized ones on hallucination benchmarks (suggesting overfitting from narrow fine-tuning), notes the necessity of radiologist oversight, and maps findings to FDA Total Product Lifecycle and Predetermined Change Control Plan frameworks.

Significance. If the synthesized comparisons and mappings hold after addressing confounders, the work would offer a useful cross-modality taxonomy unification and regulatory-aligned mitigation overview for a high-stakes clinical failure mode, potentially informing safer AI deployment in imaging.

major comments (2)

[Abstract] Abstract, research question (2): The question presupposes that 'medical-specialized foundation models hallucinate less than general-purpose ones,' yet the highlighted finding states the opposite (general-purpose models outperform specialized ones, implying specialized models hallucinate more due to overfitting). This internal contradiction between the framing question and the central claim requires explicit resolution.
[Abstract] Abstract, paragraph on model comparison: The claim that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks (with attribution to overfitting-induced confabulation) is stated without any quantitative synthesis, tables of results, error bars, or discussion of controls for confounders such as model scale, training data volume, or non-identical evaluation protocols. As a narrative synthesis, the manuscript does not demonstrate that the reviewed studies isolate domain specialization as the causal factor.

minor comments (2)

[Abstract] Abstract: 'a very high percentage of of AI-generated flags' contains a duplicated word.
[Abstract] Abstract, final sentence: 'Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined' has a subject-verb agreement error ('is' should be 'are').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight an important inconsistency in the abstract phrasing and raise valid points about the evidentiary basis for the model comparison claim in a narrative synthesis. We address each below and commit to targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract, research question (2): The question presupposes that 'medical-specialized foundation models hallucinate less than general-purpose ones,' yet the highlighted finding states the opposite (general-purpose models outperform specialized ones, implying specialized models hallucinate more due to overfitting). This internal contradiction between the framing question and the central claim requires explicit resolution.

Authors: We agree that the current wording of research question (2) creates an unintended presupposition that does not match the manuscript's central finding. The question was drafted to explore comparative hallucination behavior but was phrased in a way that assumes the direction of the effect. We will revise it to: '(2) how do hallucination rates in medical-specialized foundation models compare to those in general-purpose ones?' This removes the presupposition and aligns the framing directly with the reported observation that general-purpose models outperform specialized ones. The revised question will be reflected consistently in the abstract and introduction. revision: yes
Referee: [Abstract] Abstract, paragraph on model comparison: The claim that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks (with attribution to overfitting-induced confabulation) is stated without any quantitative synthesis, tables of results, error bars, or discussion of controls for confounders such as model scale, training data volume, or non-identical evaluation protocols. As a narrative synthesis, the manuscript does not demonstrate that the reviewed studies isolate domain specialization as the causal factor.

Authors: As a structured narrative synthesis, the manuscript aggregates and interprets patterns reported in the existing peer-reviewed literature rather than conducting a new quantitative meta-analysis with statistical controls. We therefore cannot claim that the reviewed studies isolate domain specialization as the sole causal factor, and we acknowledge that confounders such as model scale, training data volume, and heterogeneous evaluation protocols limit causal inference. To address this, we will add an explicit limitations subsection discussing these confounders and the interpretive boundaries of narrative synthesis. We will also insert a summary table collating the key studies, their reported hallucination metrics, and noted methodological differences to increase transparency. These additions strengthen the presentation without altering the manuscript's primary scope as a cross-modality taxonomy and regulatory mapping exercise. revision: partial

Circularity Check

0 steps flagged

No circularity: synthesis rests on external peer-reviewed studies and FDA guidance

full rationale

The paper is a structured narrative synthesis referencing external peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across modalities. The central claim that general-purpose foundation models outperform medical-specialized models on hallucination benchmarks is attributed to synthesized external sources rather than any derivation, equation, or fitted parameter internal to the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The work is self-contained against external benchmarks and does not reduce its conclusions to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature synthesis paper. No mathematical derivations, new empirical fits, or postulated entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5836 in / 1083 out tokens · 16463 ms · 2026-06-27T06:59:05.049349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages

[1]

On hallucinations in AI-generated content for nuclear medicine imaging (the DREAM Report),

M. Xia, R. Bayerlein, Y . Chemli et al., “On hallucinations in AI-generated content for nuclear medicine imaging (the DREAM Report),”J. Nucl. Med., published online Nov. 6, 2025, PMCID: PMC11844622

2025
[2]

Medical hallucination in foundation models and their impact on healthcare

Y . Kim, H. Jeong, S. Chen et al., “Medical hallucination in foundation models and their impact on healthcare,”medRxiv2025.02.28.25323115 (also arXiv 2503.05777). DOI: 10.1101/2025.02.28.25323115

work page doi:10.1101/2025.02.28.25323115 2025
[3]

A taxonomy of machine hallucina- tion in radiology,

F. J. Brooks and M. A. Anastasio, “A taxonomy of machine hallucina- tion in radiology,”Radiology: Artif. Intell., DOI: 10.1148/ryai.250203

work page doi:10.1148/ryai.250203
[4]

Detecting and evaluating medical hallucinations in large vision language models,

J. Chen, D. Yang, T. Wu et al., “Detecting and evaluating medical hallucinations in large vision language models,” arXiv 2406.10185, 2024

Pith/arXiv arXiv 2024
[5]

J., Madotto, A., and Fung, P

Z. Ji, N. Lee, R. Frieske et al., “Survey of hallucination in natural language generation,”ACM Comput. Surv., vol. 55, no. 12, Art. no. 248, 2023. DOI: 10.1145/3571730

work page doi:10.1145/3571730 2023
[6]

On hallucinations in tomographic image reconstruction,

S. Bhadra, V . A. Kelkar, F. J. Brooks, and M. A. Anastasio, “On hallucinations in tomographic image reconstruction,”IEEE Trans. Med. Imag., vol. 40, no. 11, pp. 3249–3260, Nov. 2021. DOI: 10.1109/TMI.2021.3089456

work page doi:10.1109/tmi.2021.3089456 2021
[7]

A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma et al., “A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., 2024, arXiv 2311.05232

Pith/arXiv arXiv 2024
[8]

Deep learning-based algorithm for detection and characterization of COVID-19 pneumonia in chest X-rays and CT images,

Y . Sim, M. J. Chung, E. Kotter et al., “Deep learning-based algorithm for detection and characterization of COVID-19 pneumonia in chest X-rays and CT images,”JAMA Netw. Open, vol. 4, no. 12, Art. no. e2141096, 2021

2021
[9]

Radiologist diagnostic sensitivity with versus without AI assistance for chest radiograph findings,

J. S. Ahn, S. Ebrahimian, S. McDermott et al., “Radiologist diagnostic sensitivity with versus without AI assistance for chest radiograph findings,”JAMA Netw. Open, vol. 5, no. 8, Art. no. e2229289, 2022

2022
[10]

AI double-reading triage across 25,104 consecutive chest radiographs,

L. Topff, R. van der Sluijs, H. Laue et al., “AI double-reading triage across 25,104 consecutive chest radiographs,”Eur. Radiol., vol. 34, no. 9, pp. 5876–5885, 2024, PMC11364654

2024
[11]

AI improves nodule detection on chest radiographs in a health screening population: a randomized controlled trial,

J. G. Nam, E. J. Hwang, J. Kim et al., “AI improves nodule detection on chest radiographs in a health screening population: a randomized controlled trial,”Radiology, vol. 307, no. 2, Art. no. e221894, 2023. DOI: 10.1148/radiol.221894

work page doi:10.1148/radiol.221894 2023
[12]

A physics-informed deep learning model for MRI brain motion correction,

M. Safari, X. Yang, Z. Eidex et al., “A physics-informed deep learning model for MRI brain motion correction,” PMCID PMC11844622, arXiv 2502.09296, 2025

arXiv 2025
[13]

Hallucination of multimodal large language models: a survey,

Z. Bai, P. Wang, T. Xiao et al., “Hallucination of multimodal large language models: a survey,” arXiv 2404.18930, 2024

Pith/arXiv arXiv 2024
[14]

Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of cervical spine fractures,

A. F. V oter, M. E. Larson, J. W. Garrett, and J. P. Yu, “Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of cervical spine fractures,”Amer. J. Neuroradiol., vol. 42, no. 7, pp. 1298–1305, 2021. DOI: 10.3174/ajnr.A7179

work page doi:10.3174/ajnr.a7179 2021
[15]

Do as AI say: susceptibility in deployment of clinical decision-aids,

S. Gaube, H. Suresh, M. Raue et al., “Do as AI say: susceptibility in deployment of clinical decision-aids,”NPJ Digit. Med., vol. 4, Art. no. 31, 2021. DOI: 10.1038/s41746-021-00385-9

work page doi:10.1038/s41746-021-00385-9 2021
[16]

FDA perspective on the regulation of artificial intelligence in health care and biomedicine,

H. J. Warraich, T. Tazbaz, and R. M. Califf, “FDA perspective on the regulation of artificial intelligence in health care and biomedicine,” JAMA, 2024. DOI: 10.1001/jama.2024.21451

work page doi:10.1001/jama.2024.21451 2024
[17]

Total product lifecycle con- siderations for generative AI-enabled devices,

U.S. Food and Drug Administration, “Total product lifecycle con- siderations for generative AI-enabled devices,” Executive Summary for the Digital Health Advisory Committee, Nov. 20, 2024. [Online]. Available: https://www.fda.gov/media/184273/download

2024
[18]

Artificial intelligence-enabled device software functions: lifecycle management and marketing sub- mission recommendations,

U.S. Food and Drug Administration, “Artificial intelligence-enabled device software functions: lifecycle management and marketing sub- mission recommendations,” Draft Guidance, Jan. 7, 2025. Docket FDA-2025-D-0070. VOLUME , 11 Omar Alshahrani and Muzammil Behzad: Hallucination in Medical Imaging AI

2025
[19]

Predetermined change control plans (PCCP) for machine learning-enabled devices: final guidance,

U.S. Food and Drug Administration, “Predetermined change control plans (PCCP) for machine learning-enabled devices: final guidance,” Dec. 2024

2024
[20]

A survey on deep learning in medical image analysis,

G. Litjens, T. Kooi, B. E. Bejnordi et al., “A survey on deep learning in medical image analysis,”Med. Image Anal., vol. 42, pp. 60–88, 2017

2017
[21]

AI in health and medicine,

P. Rajpurkar, E. Chen, O. Banerjee, and E. J. Topol, “AI in health and medicine,”Nat. Med., vol. 28, pp. 31–38, 2022

2022
[22]

Predicting the future—big data, machine learning, and clinical medicine,

Z. Obermeyer and E. J. Emanuel, “Predicting the future—big data, machine learning, and clinical medicine,”N. Engl. J. Med., vol. 375, pp. 1216–1219, 2016

2016
[23]

High-performance medicine: the convergence of human and artificial intelligence,

E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,”Nat. Med., vol. 25, pp. 44–56, 2019

2019
[24]

On the interpretability of artificial intelligence in radiology: challenges and opportunities,

M. Reyes, R. Meier, S. Pereira et al., “On the interpretability of artificial intelligence in radiology: challenges and opportunities,”Ra- diology: Artif. Intell., vol. 2, Art. no. e190043, 2020

2020
[25]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,”Nat. Mach. Intell., vol. 1, pp. 206–215, 2019

2019
[26]

Implementing machine learning in health care—addressing ethical challenges,

D. S. Char, N. H. Shah, and D. Magnus, “Implementing machine learning in health care—addressing ethical challenges,”N. Engl. J. Med., vol. 378, pp. 981–983, 2018

2018
[27]

Key challenges for delivering clinical impact with artificial intelligence,

C. J. Kelly, A. Karthikesalingam, M. Suleyman et al., “Key challenges for delivering clinical impact with artificial intelligence,”BMC Med., vol. 17, Art. no. 195, 2019

2019
[28]

Deep learning in medical image analysis,

D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,”Annu. Rev. Biomed. Eng., vol. 19, pp. 221–248, 2017

2017
[29]

AI in medical imaging informatics: current challenges and future directions,

A. S. Panayides, A. Amini, N. D. Filipovic et al., “AI in medical imaging informatics: current challenges and future directions,”IEEE J. Biomed. Health Informat., vol. 24, no. 7, pp. 1837–1857, 2020

2020
[30]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” inAdv. Neural Inf. Process. Syst., vol. 30, 2017

2017
[31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2016

2016
[32]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015

2015
[33]

Generative adver- sarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adver- sarial networks,”Commun. ACM, vol. 63, pp. 139–144, 2020

2020
[34]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409.1556, 2014

Pith/arXiv arXiv 2014
[35]

U-Net: convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional net- works for biomedical image segmentation,” inProc. MICCAI, 2015

2015
[36]

An image is worth 16×16words: transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., “An image is worth 16×16words: transformers for image recognition at scale,” arXiv 2010.11929, 2020

Pith/arXiv arXiv 2010
[37]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder et al., “Language models are few-shot learners,” inAdv. Neural Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020

1901
[38]

Learning transferable visual models from natural language supervision (CLIP),

A. Radford, J. W. Kim, C. Hallacy et al., “Learning transferable visual models from natural language supervision (CLIP),” inProc. ICML, 2021

2021
[39]

On the opportunities and risks of foundation models,

R. Bommasani, D. A. Hudson et al., “On the opportunities and risks of foundation models,” arXiv 2108.07258, 2021

Pith/arXiv arXiv 2021
[40]

Review of deep learning algorithms and architectures,

A. Shrestha and A. Mahmood, “Review of deep learning algorithms and architectures,”IEEE Access, vol. 7, pp. 53040–53065, 2019

2019
[41]

A guide to deep learning in healthcare,

A. Esteva, A. Robicquet, B. Ramsundar et al., “A guide to deep learning in healthcare,”Nat. Med., vol. 25, pp. 24–29, 2019

2019
[42]

MRI motion artifact detection and correction using AI: systematic review and meta-analysis,

K. Pawar, J. Fripp, and J. Dowling, “MRI motion artifact detection and correction using AI: systematic review and meta-analysis,” PMC, 2025

2025
[43]

Uncertainty quantification for machine learning in healthcare: a survey,

M. Abdar, F. Pourpanah, S. Hussain et al., “Uncertainty quantification for machine learning in healthcare: a survey,”Neurocomputing, vol. 461, pp. 243–268, 2021

2021
[44]

Fusion of medical imaging and electronic health records using deep learning: systematic review and implementation guidelines,

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren, “Fusion of medical imaging and electronic health records using deep learning: systematic review and implementation guidelines,”NPJ Digit. Med., vol. 3, Art. no. 136, 2020

2020
[45]

MedHallBench: a benchmark for hallucination detection in med- ical VLMs with reinforcement learning-assisted annotation,

“MedHallBench: a benchmark for hallucination detection in med- ical VLMs with reinforcement learning-assisted annotation,” arXiv 2412.18947, 2024. 12 VOLUME ,

arXiv 2024

[1] [1]

On hallucinations in AI-generated content for nuclear medicine imaging (the DREAM Report),

M. Xia, R. Bayerlein, Y . Chemli et al., “On hallucinations in AI-generated content for nuclear medicine imaging (the DREAM Report),”J. Nucl. Med., published online Nov. 6, 2025, PMCID: PMC11844622

2025

[2] [2]

Medical hallucination in foundation models and their impact on healthcare

Y . Kim, H. Jeong, S. Chen et al., “Medical hallucination in foundation models and their impact on healthcare,”medRxiv2025.02.28.25323115 (also arXiv 2503.05777). DOI: 10.1101/2025.02.28.25323115

work page doi:10.1101/2025.02.28.25323115 2025

[3] [3]

A taxonomy of machine hallucina- tion in radiology,

F. J. Brooks and M. A. Anastasio, “A taxonomy of machine hallucina- tion in radiology,”Radiology: Artif. Intell., DOI: 10.1148/ryai.250203

work page doi:10.1148/ryai.250203

[4] [4]

Detecting and evaluating medical hallucinations in large vision language models,

J. Chen, D. Yang, T. Wu et al., “Detecting and evaluating medical hallucinations in large vision language models,” arXiv 2406.10185, 2024

Pith/arXiv arXiv 2024

[5] [5]

J., Madotto, A., and Fung, P

Z. Ji, N. Lee, R. Frieske et al., “Survey of hallucination in natural language generation,”ACM Comput. Surv., vol. 55, no. 12, Art. no. 248, 2023. DOI: 10.1145/3571730

work page doi:10.1145/3571730 2023

[6] [6]

On hallucinations in tomographic image reconstruction,

S. Bhadra, V . A. Kelkar, F. J. Brooks, and M. A. Anastasio, “On hallucinations in tomographic image reconstruction,”IEEE Trans. Med. Imag., vol. 40, no. 11, pp. 3249–3260, Nov. 2021. DOI: 10.1109/TMI.2021.3089456

work page doi:10.1109/tmi.2021.3089456 2021

[7] [7]

A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma et al., “A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., 2024, arXiv 2311.05232

Pith/arXiv arXiv 2024

[8] [8]

Deep learning-based algorithm for detection and characterization of COVID-19 pneumonia in chest X-rays and CT images,

Y . Sim, M. J. Chung, E. Kotter et al., “Deep learning-based algorithm for detection and characterization of COVID-19 pneumonia in chest X-rays and CT images,”JAMA Netw. Open, vol. 4, no. 12, Art. no. e2141096, 2021

2021

[9] [9]

Radiologist diagnostic sensitivity with versus without AI assistance for chest radiograph findings,

J. S. Ahn, S. Ebrahimian, S. McDermott et al., “Radiologist diagnostic sensitivity with versus without AI assistance for chest radiograph findings,”JAMA Netw. Open, vol. 5, no. 8, Art. no. e2229289, 2022

2022

[10] [10]

AI double-reading triage across 25,104 consecutive chest radiographs,

L. Topff, R. van der Sluijs, H. Laue et al., “AI double-reading triage across 25,104 consecutive chest radiographs,”Eur. Radiol., vol. 34, no. 9, pp. 5876–5885, 2024, PMC11364654

2024

[11] [11]

AI improves nodule detection on chest radiographs in a health screening population: a randomized controlled trial,

J. G. Nam, E. J. Hwang, J. Kim et al., “AI improves nodule detection on chest radiographs in a health screening population: a randomized controlled trial,”Radiology, vol. 307, no. 2, Art. no. e221894, 2023. DOI: 10.1148/radiol.221894

work page doi:10.1148/radiol.221894 2023

[12] [12]

A physics-informed deep learning model for MRI brain motion correction,

M. Safari, X. Yang, Z. Eidex et al., “A physics-informed deep learning model for MRI brain motion correction,” PMCID PMC11844622, arXiv 2502.09296, 2025

arXiv 2025

[13] [13]

Hallucination of multimodal large language models: a survey,

Z. Bai, P. Wang, T. Xiao et al., “Hallucination of multimodal large language models: a survey,” arXiv 2404.18930, 2024

Pith/arXiv arXiv 2024

[14] [14]

Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of cervical spine fractures,

A. F. V oter, M. E. Larson, J. W. Garrett, and J. P. Yu, “Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of cervical spine fractures,”Amer. J. Neuroradiol., vol. 42, no. 7, pp. 1298–1305, 2021. DOI: 10.3174/ajnr.A7179

work page doi:10.3174/ajnr.a7179 2021

[15] [15]

Do as AI say: susceptibility in deployment of clinical decision-aids,

S. Gaube, H. Suresh, M. Raue et al., “Do as AI say: susceptibility in deployment of clinical decision-aids,”NPJ Digit. Med., vol. 4, Art. no. 31, 2021. DOI: 10.1038/s41746-021-00385-9

work page doi:10.1038/s41746-021-00385-9 2021

[16] [16]

FDA perspective on the regulation of artificial intelligence in health care and biomedicine,

H. J. Warraich, T. Tazbaz, and R. M. Califf, “FDA perspective on the regulation of artificial intelligence in health care and biomedicine,” JAMA, 2024. DOI: 10.1001/jama.2024.21451

work page doi:10.1001/jama.2024.21451 2024

[17] [17]

Total product lifecycle con- siderations for generative AI-enabled devices,

U.S. Food and Drug Administration, “Total product lifecycle con- siderations for generative AI-enabled devices,” Executive Summary for the Digital Health Advisory Committee, Nov. 20, 2024. [Online]. Available: https://www.fda.gov/media/184273/download

2024

[18] [18]

Artificial intelligence-enabled device software functions: lifecycle management and marketing sub- mission recommendations,

U.S. Food and Drug Administration, “Artificial intelligence-enabled device software functions: lifecycle management and marketing sub- mission recommendations,” Draft Guidance, Jan. 7, 2025. Docket FDA-2025-D-0070. VOLUME , 11 Omar Alshahrani and Muzammil Behzad: Hallucination in Medical Imaging AI

2025

[19] [19]

Predetermined change control plans (PCCP) for machine learning-enabled devices: final guidance,

U.S. Food and Drug Administration, “Predetermined change control plans (PCCP) for machine learning-enabled devices: final guidance,” Dec. 2024

2024

[20] [20]

A survey on deep learning in medical image analysis,

G. Litjens, T. Kooi, B. E. Bejnordi et al., “A survey on deep learning in medical image analysis,”Med. Image Anal., vol. 42, pp. 60–88, 2017

2017

[21] [21]

AI in health and medicine,

P. Rajpurkar, E. Chen, O. Banerjee, and E. J. Topol, “AI in health and medicine,”Nat. Med., vol. 28, pp. 31–38, 2022

2022

[22] [22]

Predicting the future—big data, machine learning, and clinical medicine,

Z. Obermeyer and E. J. Emanuel, “Predicting the future—big data, machine learning, and clinical medicine,”N. Engl. J. Med., vol. 375, pp. 1216–1219, 2016

2016

[23] [23]

High-performance medicine: the convergence of human and artificial intelligence,

E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,”Nat. Med., vol. 25, pp. 44–56, 2019

2019

[24] [24]

On the interpretability of artificial intelligence in radiology: challenges and opportunities,

M. Reyes, R. Meier, S. Pereira et al., “On the interpretability of artificial intelligence in radiology: challenges and opportunities,”Ra- diology: Artif. Intell., vol. 2, Art. no. e190043, 2020

2020

[25] [25]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,”Nat. Mach. Intell., vol. 1, pp. 206–215, 2019

2019

[26] [26]

Implementing machine learning in health care—addressing ethical challenges,

D. S. Char, N. H. Shah, and D. Magnus, “Implementing machine learning in health care—addressing ethical challenges,”N. Engl. J. Med., vol. 378, pp. 981–983, 2018

2018

[27] [27]

Key challenges for delivering clinical impact with artificial intelligence,

C. J. Kelly, A. Karthikesalingam, M. Suleyman et al., “Key challenges for delivering clinical impact with artificial intelligence,”BMC Med., vol. 17, Art. no. 195, 2019

2019

[28] [28]

Deep learning in medical image analysis,

D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,”Annu. Rev. Biomed. Eng., vol. 19, pp. 221–248, 2017

2017

[29] [29]

AI in medical imaging informatics: current challenges and future directions,

A. S. Panayides, A. Amini, N. D. Filipovic et al., “AI in medical imaging informatics: current challenges and future directions,”IEEE J. Biomed. Health Informat., vol. 24, no. 7, pp. 1837–1857, 2020

2020

[30] [30]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” inAdv. Neural Inf. Process. Syst., vol. 30, 2017

2017

[31] [31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2016

2016

[32] [32]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015

2015

[33] [33]

Generative adver- sarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adver- sarial networks,”Commun. ACM, vol. 63, pp. 139–144, 2020

2020

[34] [34]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409.1556, 2014

Pith/arXiv arXiv 2014

[35] [35]

U-Net: convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional net- works for biomedical image segmentation,” inProc. MICCAI, 2015

2015

[36] [36]

An image is worth 16×16words: transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., “An image is worth 16×16words: transformers for image recognition at scale,” arXiv 2010.11929, 2020

Pith/arXiv arXiv 2010

[37] [37]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder et al., “Language models are few-shot learners,” inAdv. Neural Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020

1901

[38] [38]

Learning transferable visual models from natural language supervision (CLIP),

A. Radford, J. W. Kim, C. Hallacy et al., “Learning transferable visual models from natural language supervision (CLIP),” inProc. ICML, 2021

2021

[39] [39]

On the opportunities and risks of foundation models,

R. Bommasani, D. A. Hudson et al., “On the opportunities and risks of foundation models,” arXiv 2108.07258, 2021

Pith/arXiv arXiv 2021

[40] [40]

Review of deep learning algorithms and architectures,

A. Shrestha and A. Mahmood, “Review of deep learning algorithms and architectures,”IEEE Access, vol. 7, pp. 53040–53065, 2019

2019

[41] [41]

A guide to deep learning in healthcare,

A. Esteva, A. Robicquet, B. Ramsundar et al., “A guide to deep learning in healthcare,”Nat. Med., vol. 25, pp. 24–29, 2019

2019

[42] [42]

MRI motion artifact detection and correction using AI: systematic review and meta-analysis,

K. Pawar, J. Fripp, and J. Dowling, “MRI motion artifact detection and correction using AI: systematic review and meta-analysis,” PMC, 2025

2025

[43] [43]

Uncertainty quantification for machine learning in healthcare: a survey,

M. Abdar, F. Pourpanah, S. Hussain et al., “Uncertainty quantification for machine learning in healthcare: a survey,”Neurocomputing, vol. 461, pp. 243–268, 2021

2021

[44] [44]

Fusion of medical imaging and electronic health records using deep learning: systematic review and implementation guidelines,

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren, “Fusion of medical imaging and electronic health records using deep learning: systematic review and implementation guidelines,”NPJ Digit. Med., vol. 3, Art. no. 136, 2020

2020

[45] [45]

MedHallBench: a benchmark for hallucination detection in med- ical VLMs with reinforcement learning-assisted annotation,

“MedHallBench: a benchmark for hallucination detection in med- ical VLMs with reinforcement learning-assisted annotation,” arXiv 2412.18947, 2024. 12 VOLUME ,

arXiv 2024