pith. machine review for the scientific record. sign in

arxiv: 2505.22919 · v3 · submitted 2025-05-28 · 💻 cs.CL

Recognition: unknown

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Adam Rodman, Ahmed Alaa, Anu Ramachandran, Christopher J. Nash, David Bamman, Kathy T. LeSaint, Liam G. McCoy, Melanie Molina, Namrata Garg, Nikita Mehandru, Niloufar Golchini, Travis Zack

Authors on Pith no claims yet
classification 💻 cs.CL
keywords clinicalreasoninger-reasonemergencybenchmarksllmstasksworkflow
0
0 comments X
read the original abstract

Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department workflow: triage intake, treatment selection, disposition planning, and final diagnosis. Crucially, evaluation in ER-Reason extends beyond diagnostic accuracy to include stepwise Script Concordance Test (SCT)-style questions grounded in real patient cases, which assess whether LLMs update their diagnostic beliefs in the correct direction and magnitude as clinical evidence accumulates, scored against 2,555 emergency physician annotations. We evaluate reasoning and non-reasoning LLMs on ER-Reason, and show that our tasks provide a more nuanced view of how LLM reasoning fails on real patient cases than existing benchmarks allow.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

    stat.ML 2026-05 unverdicted novelty 7.0

    CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

  2. Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

    cs.AI 2026-04 accept novelty 7.0

    The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

  3. CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

    cs.CL 2026-05 unverdicted novelty 6.0

    CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

  4. ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

    cs.AI 2026-04 unverdicted novelty 6.0

    ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO ...

  5. Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.

  6. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

    cs.CL 2025-08