Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.
M ed R ed QA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Health AI benchmarks exhibit a validity gap, with only 42% referencing objective data (mostly wellness wearables), rare complex inputs like labs or imaging, and minimal coverage of vulnerable groups or chronic care.
citing papers explorer
-
When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG
Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.