Testing and evaluation of health care applications of large language models: A systematic review.JAMA, 2025

Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R Chaurasia, Nirav R Shah, Karandeep Singh, Troy · 2025 · arXiv 2024.21700

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

cs.AI · 2026-03-18 · unverdicted · novelty 6.0

Health AI benchmarks exhibit a validity gap, with only 42% referencing objective data (mostly wellness wearables), rare complex inputs like labs or imaging, and minimal coverage of vulnerable groups or chronic care.

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

cs.AI · 2026-06-10 · unverdicted · novelty 5.0

A pre-response classifier predicts user rejection risk for clinical LLM outputs with AUROC 0.719 over 4.5 months of deployment data by incorporating deployment-specific context.

Opportunities and Risks of Generative AI through the Health Information Journey

cs.CY · 2026-05-21 · unverdicted · novelty 4.0

Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Opportunities and Risks of Generative AI through the Health Information Journey cs.CY · 2026-05-21 · unverdicted · none · ref 116
Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.

Testing and evaluation of health care applications of large language models: A systematic review.JAMA, 2025

fields

years

verdicts

representative citing papers

citing papers explorer