Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.
72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice in last 5 months
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.