Med-StepBench is the first large-scale step-wise hallucination benchmark for 3D oncological PET/CT that decomposes clinical reasoning into four stages and reveals systematic VLM failures hidden by aggregate metrics.
Can we trust AI doctors? a survey of medical hallucination in large language and large vision-language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
LLMs hallucinate in 19.7% of textbook-grounded medical QA answers despite high plausibility scores, indicating they remain unfit for unsupervised clinical use.
citing papers explorer
-
Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
Med-StepBench is the first large-scale step-wise hallucination benchmark for 3D oncological PET/CT that decomposes clinical reasoning into four stages and reveals systematic VLM failures hidden by aggregate metrics.
-
Quantifying Hallucinations in Language Language Models on Medical Textbooks
LLMs hallucinate in 19.7% of textbook-grounded medical QA answers despite high plausibility scores, indicating they remain unfit for unsupervised clinical use.