EuropeMedQA is presented as the first comprehensive multilingual and multimodal medical examination dataset drawn from official regulatory exams in four European countries.
Denniston, Melanie J
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4representative citing papers
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
In a blinded study, an LLM-based agent generated higher-rated responses than clinicians for explaining CGM data in diabetes counseling, with similar safety flags.
Registry analysis shows marked growth in AI-related clinical trials led by China and the US, with moderate human-AI agreement on interaction classification in a 100-record sample.
citing papers explorer
-
EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
EuropeMedQA is presented as the first comprehensive multilingual and multimodal medical examination dataset drawn from official regulatory exams in four European countries.
-
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
-
Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
In a blinded study, an LLM-based agent generated higher-rated responses than clinicians for explaining CGM data in diabetes counseling, with similar safety flags.
-
Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration
Registry analysis shows marked growth in AI-related clinical trials led by China and the US, with moderate human-AI agreement on interaction classification in a 100-record sample.