LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.
Testing and evaluation of health care applications of large language models: A systematic review.JAMA, 2025
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Health AI benchmarks exhibit a validity gap, with only 42% referencing objective data (mostly wellness wearables), rare complex inputs like labs or imaging, and minimal coverage of vulnerable groups or chronic care.
A pre-response classifier predicts user rejection risk for clinical LLM outputs with AUROC 0.719 over 4.5 months of deployment data by incorporating deployment-specific context.
Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.
citing papers explorer
-
Opportunities and Risks of Generative AI through the Health Information Journey
Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.