arxiv: 2605.11398 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

Robin Linzmayer (1 , 2) , Georgianna Lin (2) , Di Coneybeare (3) , Jason Chu (3) , Trudi Cloyd (3) , Manish Garg (3) , Miles Gordon (3)

show 23 more authors

Elizabeth Hartofilis (3) Benjamin Hong (3) Ashraf Hussain (3) Eugene Y. Kim (3) Oluchi Iheagwara King (3) Ross McCormack (3) Erica Olsen (3) John K. Riggins Jr (3) Mustafa N. Rasheed (3) Dana L. Sacco (3) Vinay Saggar (3) Osman R. Sayan (3) Amit Shembekar (3) Janice Shin-Kim (3) Wendy W. Sun (3) Bernard P. Chang (3) David Kessler (3) No\'emie Elhadad (1 2) ((1) Department of Computer Science Columbia University (2) Department of Biomedical Informatics (3) Department of Emergency Medicine Columbia University Irving Medical Center)

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords clinical acuitylanguage model evaluationmedical triageuncertainty calibrationhealth AI safetybenchmark dataset

0 comments

The pith

No tested language model matches the spread of physicians' urgency judgments on ambiguous medical cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AcuityBench as a new evaluation tool that merges five public medical datasets into one shared four-level scale of care urgency, from home care to immediate emergency. It tests models on both direct multiple-choice classification and open-ended conversational replies, using physician consensus for clear cases and physician-labeled ambiguous cases for uncertainty checks. Results indicate wide differences in how models handle straightforward cases, a consistent shift toward under-triage when models respond conversationally, and a clear mismatch where model outputs cluster more tightly than the varied judgments physicians actually give. A reader should care because these gaps could translate into models steering users toward the wrong level of care in real health queries. The work frames acuity identification itself as a separate, safety-relevant skill that current models have not yet mastered.

Core claim

AcuityBench shows substantial variation in clear-case accuracy across 12 frontier models, a systematic tradeoff where free-form responses cut over-triage but raise under-triage especially on higher-acuity items, and that on the 217 physician-confirmed ambiguous cases no model distribution approaches the spread of physician judgments while model outputs remain more concentrated than expert clinical uncertainty.

What carries the argument

AcuityBench, the harmonized collection of 914 cases under a shared four-level acuity framework that supports both explicit classification and rubric-anchored free-form evaluation.

If this is right

Conversational response formats reduce over-triage errors relative to direct classification but increase under-triage, particularly on higher-acuity presentations.
Clear-case accuracy varies substantially across current proprietary and open-weight models.
Model predictions concentrate more than physician judgments on ambiguous cases, indicating poorer uncertainty calibration.
Label disagreement on maximally ambiguous cases can be traced in part to clinical uncertainty when expert and model adjudications are compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that explicitly reward distribution matching rather than single-label accuracy may be needed for better uncertainty alignment.
The benchmark format difference suggests that deployment choices between chat-style and structured interfaces carry measurable safety tradeoffs.
Extending the rubric evaluation to track how uncertainty is expressed in free text could reveal additional misalignment not captured by category counts alone.

Load-bearing premise

The four-level acuity scale can be applied uniformly and without serious distortion to all five source datasets while the rubric judge for open responses stays faithful to the same physician-defined categories.

What would settle it

A model whose predicted acuity distribution on the 217 ambiguous cases passes a statistical test for close match to the physician distribution, such as low KL divergence or equivalent measure.

Figures

Figures reproduced from arXiv: 2605.11398 by 2), 2) ((1) Department of Computer Science, (2) Department of Biomedical Informatics, (3) Department of Emergency Medicine, Amit Shembekar (3), Ashraf Hussain (3), Benjamin Hong (3), Bernard P. Chang (3), Columbia University, Columbia University Irving Medical Center), Dana L. Sacco (3), David Kessler (3), Di Coneybeare (3), Elizabeth Hartofilis (3), Erica Olsen (3), Eugene Y. Kim (3), Georgianna Lin (2), Janice Shin-Kim (3), Jason Chu (3), John K. Riggins Jr (3), Manish Garg (3), Miles Gordon (3), Mustafa N. Rasheed (3), No\'emie Elhadad (1, Oluchi Iheagwara King (3), Osman R. Sayan (3), Robin Linzmayer (1, Ross McCormack (3), Trudi Cloyd (3), Vinay Saggar (3), Wendy W. Sun (3).

**Figure 1.** Figure 1: Overview of AcuityBench construction and evaluation. Heterogeneous data sources were normalized into a four-level acuity framework, labeled through direct mapping or physician-panel annotation, and evaluated in QA and free-response conversation formats, yielding consensus and ambiguous subsets for downstream accuracy, uncertainty, and error analyses. also about what clinical resources may be needed and on … view at source ↗

**Figure 2.** Figure 2: Error rate by true acuity level (QA format, clear consensus cases, mode of five samples). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: QA vs. conversational exact-match accuracy, one point per model [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction distribution on boundary-label cases by model (QA format, mode of five samples; [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Per-model confusion matrices (QA format, clear consensus cases, mode of five samples). [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AcuityBench, a benchmark harmonizing five public datasets (user conversations, forum posts, clinical vignettes, patient portal messages) under a shared four-level acuity framework (home monitoring to immediate emergency care). It comprises 914 cases (697 consensus for accuracy evaluation, 217 physician-labeled ambiguous cases for uncertainty evaluation) and supports two formats: explicit four-way QA classification and free-form responses scored by a rubric-based judge. Across 12 frontier models, the work reports variation in clear-case accuracy and error patterns, a systematic tradeoff (conversational responses reduce over-triage but increase under-triage vs. QA), and that no model matches physician judgment distributions in ambiguous cases, with models producing more concentrated predictions than experts.

Significance. If the label harmonization proves robust, AcuityBench supplies a valuable unified, safety-critical benchmark for clinical acuity identification that spans real-world interaction styles. The emphasis on ambiguous cases and uncertainty alignment, together with the use of public datasets and independent physician labels, enables reproducible stress-testing of how models guide care-seeking decisions. This addresses a gap left by existing health QA and triage benchmarks.

major comments (2)

[Methods (dataset harmonization and labeling)] The harmonization of the five heterogeneous source datasets into a shared four-level rubric (described in the abstract and Methods) reports no inter-rater reliability statistics, no cross-dataset label consistency checks, and no sensitivity analysis on how re-labeling ambiguous or edge cases would affect the reported distributions. This is load-bearing for the central claims that 'no model closely matches the distribution of physician judgments' and that 'model predictions are more concentrated than expert clinical uncertainty' in the 217 ambiguous cases.
[Evaluation methodology and results] The rubric-based judge for free-form responses is stated to be 'anchored to the same framework,' yet the manuscript provides no validation that this judge faithfully reproduces the four-level labels used for the consensus and ambiguous cases (e.g., agreement rates with physician labels on a held-out subset). Without this, the reported tradeoff between QA and conversational formats cannot be confidently attributed to model behavior rather than judge construction.

minor comments (2)

[Results] Table or figure presenting the 12 models and their accuracy/error breakdowns would benefit from explicit confidence intervals or statistical tests for the claimed 'substantial variation' and 'systematic tradeoff.'
[Abstract] The abstract and introduction could more clearly distinguish the 697 consensus cases from the 217 ambiguous cases when stating overall findings, to avoid conflating clear-case accuracy with uncertainty alignment results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough and constructive review. We appreciate the focus on strengthening the methodological transparency around dataset harmonization and judge validation. Below we respond point-by-point to the major comments and describe the revisions we will make.

read point-by-point responses

Referee: [Methods (dataset harmonization and labeling)] The harmonization of the five heterogeneous source datasets into a shared four-level rubric (described in the abstract and Methods) reports no inter-rater reliability statistics, no cross-dataset label consistency checks, and no sensitivity analysis on how re-labeling ambiguous or edge cases would affect the reported distributions. This is load-bearing for the central claims that 'no model closely matches the distribution of physician judgments' and that 'model predictions are more concentrated than expert clinical uncertainty' in the 217 ambiguous cases.

Authors: We agree that explicit reporting of inter-rater reliability, cross-dataset consistency, and sensitivity analyses would improve the manuscript. The 697 consensus cases were produced via multi-physician review requiring full agreement for inclusion, while the 217 ambiguous cases received independent physician labels to reflect clinical uncertainty. In the revision we will (1) detail the labeling protocol and report any available agreement metrics from the consensus process, (2) add per-source-dataset label distributions to demonstrate harmonization consistency, and (3) include a sensitivity analysis that varies the ambiguous-case inclusion threshold and re-computes the model-vs-physician distribution comparisons. These additions will directly buttress the claims about model concentration and mismatch with expert judgments. revision: yes
Referee: [Evaluation methodology and results] The rubric-based judge for free-form responses is stated to be 'anchored to the same framework,' yet the manuscript provides no validation that this judge faithfully reproduces the four-level labels used for the consensus and ambiguous cases (e.g., agreement rates with physician labels on a held-out subset). Without this, the reported tradeoff between QA and conversational formats cannot be confidently attributed to model behavior rather than judge construction.

Authors: We acknowledge that validation of the rubric-based judge against physician labels is necessary to confidently attribute format differences to model behavior. The judge rubric was constructed to mirror the identical four-level acuity framework used for the labeled cases. In the revised manuscript we will add a validation experiment: the judge will be applied to a held-out subset of consensus cases, and we will report agreement rates (and confusion matrices) with the original physician labels. This will quantify judge fidelity; if agreement is high, it supports the attribution of the QA-conversational tradeoff to model behavior. We will also discuss any residual limitations of the judge. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark on external datasets with no derivations or self-referential fits

full rationale

The paper constructs AcuityBench by harmonizing five public external datasets (conversations, forum posts, vignettes, portal messages) under a shared four-level acuity rubric, then evaluates 12 models on 914 cases using accuracy metrics and distribution comparisons against physician labels. No equations, parameters, or derivations appear in the central claims; the results are direct empirical measurements on held-out data. The harmonization step is a preprocessing choice whose validity is external to any internal reduction, and the key findings (model over-concentration, format tradeoffs) follow from counting and comparing observed outputs rather than from any self-definition or fitted-input renaming. Self-citations, if present, are not load-bearing for the benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about data harmonization and expert labels; no free parameters or new entities are introduced.

axioms (2)

domain assumption Existing public datasets spanning conversations, forum posts, vignettes, and portal messages can be mapped to a shared four-level acuity framework without substantial loss of clinical meaning or introduction of systematic bias.
Invoked to create the unified 914-case benchmark from five heterogeneous sources.
domain assumption Physician-confirmed labels on ambiguous cases constitute a reliable external reference distribution against which model uncertainty can be compared.
Used for the 217-case uncertainty-aware evaluation track.

pith-pipeline@v0.9.0 · 5786 in / 1469 out tokens · 67969 ms · 2026-05-13T02:27:00.383309+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

(2026)AI as a Healthcare Ally: How Americans are Navigating the System with ChatGPT

OpenAI. (2026)AI as a Healthcare Ally: How Americans are Navigating the System with ChatGPT. OpenAI, January 2026

work page 2026
[2]

& Shamszare, H

Choudhury, A. & Shamszare, H. (2023) Investigating the impact of user trust on the adoption and use of ChatGPT: Survey analysis.Journal of Medical Internet Research25:e47184

work page 2023
[3]

& Choudhury, A

Shahsavar, Y . & Choudhury, A. (2023) User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study.JMIR Human Factors10:e47564

work page 2023
[4]

& Varghese, J

Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. (2024) Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.Nature Communications15:2050

work page 2024
[5]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021) On the opportunities and risks of foundation models. arXiv:2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

(2026) ChatGPT Health performance in a structured test of triage recommendations

Ramaswamy, A., Tyagi, A., Hugo, H., Jiang, J., Jayaraman, P., Jangda, M., Te, A.E., Kaplan, S.A., Lampert, J., Freeman, R., Gavin, N., Tewari, A.K., Sakhuja, A., Naved, B., Charney, A.W., Omar, M., Gorin, M.A., Klang, E., et al. (2026) ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine

work page 2026
[7]

& Mehrotra, A

Semigran, H.L., Linder, J.A., Gidengil, C. & Mehrotra, A. (2015) Evaluation of symptom checkers for self diagnosis and triage: audit study.BMJ351:h3480

work page 2015
[8]

& Singhal, K

Arora, R.K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J. & Singhal, K. HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI, 2025

work page 2025
[9]

(2025) Emergency department triage accuracy and delays in care for high-risk conditions.JAMA Network Open8(4):e259068

Mowbray, H.A., Therriault, M., Bellolio, M.F., Fronheiser, T., Casey, M.F., Mohr, N.M., & Sun, B.C. (2025) Emergency department triage accuracy and delays in care for high-risk conditions.JAMA Network Open8(4):e259068

work page 2025
[10]

& Kinsman, L

Morley, C., Unwin, M., Peterson, G.M., Stankovich, J. & Kinsman, L. (2018) Emergency department crowding: a systematic review of causes, consequences and solutions.PLOS ONE13(8):e0203316

work page 2018
[11]

& Gentile, S

Durand, A.C., Palazzolo, S., Tanti-Hardouin, N., Gerbeaux, P., Sambuc, R. & Gentile, S. (2012) Nonurgent patients in emergency departments: rational or irresponsible consumers? Perceptions of professionals and patients.BMC Research Notes5:525

work page 2012
[12]

& Elhadad, N

Linzmayer, R., Ramaswamy, A., Hugo, H., Nadkarni, G. & Elhadad, N. (2026) Aggregate bench- mark scores obscure patient safety implications of errors across frontier language models. medRxiv. doi:10.64898/2026.03.18.26348695

work page doi:10.64898/2026.03.18.26348695 2026
[13]

& Szolovits, P

Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H. & Szolovits, P. (2021) What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences 11(14):6421. 10

work page 2021
[14]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. (2019) PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2567–2577

work page 2019
[15]

& Sankarasubbu, M

Pal, A., Umapathi, L.K. & Sankarasubbu, M. (2022) MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research174:248–260

work page 2022
[16]

(2026) Holistic evaluation of large language models for medical tasks with MedHELM.Nature Medicine32:943–951

Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J.M., Kotecha, N., Keyes, T., Mai, Y ., Oez, M., et al. (2026) Holistic evaluation of large language models for medical tasks with MedHELM.Nature Medicine32:943–951

work page 2026
[17]

& Alaa, A

Molina, M., Mehandru, N., Golchini, N. & Alaa, A. (2025)ER-REASON: A Benchmark Dataset for LLM- Based Clinical Reasoning in the Emergency Room. PhysioNet, version 1.0.0. doi:10.13026/55s7-3c27

work page doi:10.13026/55s7-3c27 2025
[18]

(2026)HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

OpenAI. (2026)HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats. OpenAI

work page 2026
[19]

& Preum, S.M

Gatto, J., Seegmiller, P., Burdick, T., Resnik, P., Rahat, R., DeLozier, S. & Preum, S.M. (2026) Medical triage as pairwise ranking: A benchmark for urgency in patient portal messages. arXiv:2601.13178

work page arXiv 2026
[20]

(2022) The “problem” of human label variation: on ground truth in data, modeling and evaluation

Plank, B. (2022) The “problem” of human label variation: on ground truth in data, modeling and evaluation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 10671–10682

work page 2022
[22]

& Bengio, S

Raghu, M., Zhang, C., Kleinberg, J. & Bengio, S. (2019) Direct uncertainty prediction for medical second opinions. InProceedings of the 36th International Conference on Machine Learning (ICML), PMLR 97:5281–5290

work page 2019
[23]

& Stoica, I

Zheng, L., Chiang, W.L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E. & Stoica, I. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track

work page 2023
[24]

& Steinhardt, J

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2021) Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR)

work page 2021
[25]

& Natarajan, V

Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., Aguera y Arcas, B., Webster, D., Corrado, G.S., Matias, Y ., Chou, K., Gottweis, J., Tomasev, N., Liu, Y ., Rajkomar, A., Barral, J., Semturs,...

work page 2023
[26]

(2025) Toward expert-level medical question answering with large language models.Nature Medicine31:943–950

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D., et al. (2025) Toward expert-level medical question answering with large language models.Nature Medicine31:943–950

work page 2025
[27]

& Moro, G

Cocchieri, A., Ragazzi, L., Tagliavini, G. & Moro, G. (2026) ReMedQA: Are we done with medical multiple-choice benchmarks? InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2706–2738

work page 2026
[28]

& Liu, J

Yan, L.K.Q., Niu, Q., Li, M., Zhang, Y ., Yin, C.H., Fei, C., Peng, B., Bi, Z., Feng, P., Chen, K., Wang, T., Wang, Y ., Chen, S., Liu, M. & Liu, J. (2024) Large language model benchmarks in medical tasks. arXiv:2410.21348

work page arXiv 2024
[29]

(2025) An evaluation framework for clinical use of large language models in healthcare.Nature Medicine31:77–86

Van Allen, J., Daneshjou, R., Rajpurkar, P., et al. (2025) An evaluation framework for clinical use of large language models in healthcare.Nature Medicine31:77–86

work page 2025
[30]

(2024) Artificial intelligence-generated draft replies to patient inbox messages.JAMA Network Open7(3):e243201

Garcia, P., Ma, S.P., Shah, S., Smith, M., Jeong, Y ., Devon-Sand, A., Tai-Seale, M., Takazawa, K., Clutter, D., V ogt, K., et al. (2024) Artificial intelligence-generated draft replies to patient inbox messages.JAMA Network Open7(3):e243201

work page 2024
[31]

& Zhang, Y

Lee, P., Bubeck, S., Petro, J., Chandrasekaran, V ., Chen, P., Zhu, Y ., Koutra, D., Choi, Y ., Kembhavi, A., Xie, Y ., Xiong, C., Aljundi, R., Bansal, M., Bastani, H., Nori, H. & Zhang, Y . (2025) An evaluation framework for clinical use of large language models in healthcare.Nature Medicine31:1163–1172

work page 2025
[32]

(2025) Application of large language models in medicine.Nature Reviews Bioengineering3:197–216

Liu, S., Wright, A.P., Patterson, B.L., Wanderer, J.P., Turer, R.W., Nelson, S.D., McCoy, A.B., Sittig, D.F., Wright, A., & Chen, E.S. (2025) Application of large language models in medicine.Nature Reviews Bioengineering3:197–216

work page 2025
[33]

& Xie, W

Qiu, P., Wu, C., Liu, S., Fan, Y ., Zhao, W., Chen, Z., Gu, H., Peng, C., Zhang, Y . & Xie, W. (2025) Quantifying the reasoning abilities of LLMs on clinical cases.Nature Communications16:9799. 11

work page 2025
[34]

& Shah, N.H

Bedi, S., Liu, Y ., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J.A., Wornow, M., Swaminathan, A., Soleymani Lehmann, L., Hong, H.J., Kashyap, M., Chaurasia, A.R., Shah, N.R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M.A. & Shah, N.H. (2025) Testing and evaluation of health care applications of large language models: A systematic review...

work page 2025
[36]

& Akalin, A

Gaber, F., Shaik, M., Allega, F., Bilecz, A.J., Busch, F., Goon, K., Franke, V . & Akalin, A. (2025) Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine8(1):263

work page 2025
[37]

& Levin, S

Hinson, J.S., Martinez, D.A., Cabral, S., George, K., Whalen, M., Hansoti, B. & Levin, S. (2019) Triage performance in emergency medicine: a systematic review.Annals of Emergency Medicine74(1):140–152

work page 2019
[38]

& Hinson, J.S

Mistry, B., Stewart de Ramirez, S., Kelen, G., Schmitz, P.S.K., Balhara, K.S., Levin, S., Martinez, D., Anton, X. & Hinson, J.S. (2018) Accuracy and reliability of emergency department triage using the Emergency Severity Index: an international multicenter assessment.Annals of Emergency Medicine 71(5):581–587

work page 2018
[39]

& Taylor, R.A

Hong, W.S., Haimovich, A.D. & Taylor, R.A. (2018) Predicting hospital admission at emergency depart- ment triage using machine learning.PLOS ONE13(7):e0201016

work page 2018
[40]

& Liu, S

Lu, T., Wu, S., Cui, J., Umapathi, L.K., Cheng, F., Ide, T. & Liu, S. (2020) Classifying patient portal messages using convolutional neural networks.Journal of Healthcare Informatics Research

work page 2020
[41]

& Carin, L

Si, S., Wang, R., Wosik, J., Zhang, H., Dov, D., Wang, G. & Carin, L. (2020) Students need more attention: Bert-based attention model for small data with application to automatic patient message triage. InProceedings of the 5th Machine Learning for Healthcare Conference,Proceedings of Machine Learning Research126:436–456

work page 2020
[42]

& Huang, X

Zhang, M., Shen, Y ., Li, Z., Sha, H., Hu, B., Wang, Y ., Huang, C., Liu, S., Tong, J., Jiang, C., Chai, M., Xi, Z., Dou, S., Gui, T., Zhang, Q. & Huang, X. (2025) LLMEval-Medicine: A real-world clinical benchmark for medical LLMs with physician validation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 5270–5293

work page 2025
[43]

& Klang, E

Omar, M., Agbareia, R., Glicksberg, B.S., Nadkarni, G.N. & Klang, E. (2025) Benchmarking the confidence of large language models in answering clinical questions: Cross-sectional evaluation study.JMIR Medical Informatics13:e66917

work page 2025
[44]

& Cho, B

Lee, J., Park, S., Shin, J. & Cho, B. (2024) Analyzing evaluation methods for large language models in the medical field: a scoping review.BMC Medical Informatics and Decision Making24:366

work page 2024
[45]

& Afshar, M

Patterson, B.L. & Afshar, M. (2025) Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine8:640

work page 2025
[46]

& Rodman, A

McCoy, L.G., Swamy, R., Sagar, N., Wang, M., Bacchi, S., Fong, J.M.N., Manrai, N.C., Humbert, A. & Rodman, A. (2025) Assessment of large language models in clinical reasoning: A novel benchmarking study.NEJM AI2(10). doi:10.1056/AIdbp2500120

work page doi:10.1056/aidbp2500120 2025
[47]

& Prabhakaran, V

Davani, A.M., Díaz, M. & Prabhakaran, V . (2022) Dealing with disagreements: looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics 10:92–110

work page 2022
[48]

& Sato, I

Mimori, T., Sasada, K., Matsui, H. & Sato, I. (2021) Diagnostic uncertainty calibration: Toward reliable machine predictions in medical domain. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR130:3664–3672

work page 2021
[49]

& Poesio, M

Leonardelli, E., Abercrombie, G., Almanea, D., Basile, V ., Fornaciari, T., Plank, B., Rieser, V ., Uma, A. & Poesio, M. (2023) SemEval-2023 Task 11: Learning With Disagreements (LeWiDi). InProceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pp. 2304–2318

work page 2023
[50]

& Uma, A

Basile, V ., Leonardelli, E., Hovy, D., Paun, S. & Uma, A. (2024) Soft metrics for evaluation with disagreements: an assessment. InProceedings of the 1st Workshop on Perspectivist Approaches to NLP @ LREC-COLING 2024, pp. 90–100

work page 2024
[51]

& Homan, C

Weerasooriya, T.C., Ororbia, A., Bhensadadia, R., KhudaBukhsh, A. & Homan, C. (2023) Disagreement matters: Preserving label diversity by jointly modeling item and annotator label distributions with DisCo. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 4679–4695

work page 2023
[52]

emergent

Gao, R., Nanda, V ., Choudhury, S.R., Kulkarni, C. & Zou, J. (2025) Arbiters of ambivalence: Challenges of using LLMs in no-consensus tasks. InFindings of the Association for Computational Linguistics: ACL 2025. 12 Contents 1 Introduction 1 2 Related Work 3 2.1 Medical Benchmarks for Health Reasoning, Communication, and Triage . . . . . . 3 2.2 Rubric-Bas...

work page 2025
[53]

whether the response recommends emergency care,

work page
[54]

how direct and prominent that recommendation is,

work page
[55]

the recommended timeframe for seeking care, and

work page
[56]

five,” “four

the recommended care setting. Use the label that best matches the main recommendation in the response. Acuity labels: •A - Monitor at home (nonurgent): {{RUBRIC_A}} 20 •B - See a doctor within weeks (semiurgent): {{RUBRIC_B}} •C - See a doctor within 24-48 hours (urgent outpatient): {{RUBRIC_C}} •D - Go to the emergency department now (emergent): {{RUBRIC...

work page 2026
[57]

vibrating

to form a calibration set ofN= 40. For each selected case, we injected a realistic disagreement pattern drawn from the 217 ambiguous cases used in the main adjudication analysis. Specifically, we extracted physician rater label sets with ordinal range at least 2, producing a pool of 180 high-disagreement patterns. Each calibration case was assigned one su...

work page