pith. sign in

arxiv: 2606.03157 · v1 · pith:Y5SSZQCHnew · submitted 2026-06-02 · 💻 cs.AI

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Pith reviewed 2026-06-28 10:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords ClinicalMCmulti-course clinical decision-makingLLM benchmarkmedical AI evaluationmulti-agent frameworkhealthcare decision support
0
0 comments X

The pith

ClinicalMC supplies 7,079 multi-course clinical samples to test LLMs on decisions that unfold across admission, treatment, and discharge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClinicalMC to fill the gap where existing benchmarks test LLMs only on single patient encounters. Real clinical work involves repeated examinations, treatment adjustments, and evolving conditions over several courses. The benchmark supplies English and Chinese samples spanning four stages and pairs them with a multi-agent setup that lets patient, examiner, and doctor roles interact. Two evaluation modes, static single-turn and dynamic multi-turn, are used to compare closed-source, open-source, and medical LLMs. The goal is to measure how well current models handle the longitudinal nature of actual patient care.

Core claim

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English samples that cover four stages of multi-course clinical decision-making from triage through final diagnosis, with English cases averaging 5.11 courses and Chinese cases averaging 3.42 courses; performance is measured through a multi-agent framework of patient, examiner, and doctor agents under both single-turn static and multi-turn dynamic settings.

What carries the argument

The ClinicalMC benchmark together with its multi-agent evaluation framework (patient, examiner, and doctor agents) that generates and scores trajectories across repeated clinical courses.

If this is right

  • LLM evaluation can shift from isolated single-encounter tests to repeated interaction across evolving patient states.
  • Differences in model performance between static and dynamic settings become measurable for closed-source, open-source, and medical LLMs.
  • Deployment decisions for LLMs in healthcare can be informed by results that track the full admission-to-discharge sequence rather than single decisions.
  • The benchmark supplies concrete data on how many clinical courses typical patients undergo, enabling more realistic simulation lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that optimize for multi-turn consistency may become a practical next step for medical LLMs.
  • The same multi-agent structure could be adapted to track performance on rare disease pathways or specific specialties once more samples are added.
  • If the benchmark scores correlate with real clinical outcomes, regulators could require multi-course testing before approving AI tools for longitudinal care.

Load-bearing premise

The collected samples and the simulated patient-examiner-doctor interactions faithfully represent real multi-course clinical processes without introducing large artificial biases or oversimplifications.

What would settle it

A panel of practicing clinicians reviews a random subset of the benchmark trajectories and reports that a substantial fraction contain medically implausible condition progressions or agent behaviors that do not match observed hospital practice.

Figures

Figures reproduced from arXiv: 2606.03157 by Chunming Wang, Guangya Yu, Ruihui Hou, Siyi Zhu, Tong Ruan, Yongqi Fan, Ziyue Huai.

Figure 1
Figure 1. Figure 1: The solid boxes highlight the distinctions [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The department distribution of the Chinese and English datasets. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The SimHospital framework includes a doctor agent, an examiner agent, and a patient agent. In different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of error types. reasoning capabilities. 5.3 Error Type To guide future research in clinical decision￾making for LLMs, we manually analyze and clas￾sify 200 error samples generated by LLMs on the Chinese and English datasets of ClinicalMC. These errors are categorized into five types: (a) Redun￾dant Diagnostic and Treatment Plan (RDTP): The model generates an excessive number of un￾necessary di… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt Template of GPT-4 Evaluation [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The performance of Examination Recall, Assessment Score, and Treatment Score for each course in the [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The performance of Examination Recall, Assessment Score, and Treatment Score for each course in the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of the three error types for Chinese data in ClinicalMC. The [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of the three error types for English data in ClinicalMC. The [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt of the doctor agent [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt of the examiner agent. Rule: You are a patient. Instruction: Here is your basic information: {Basic_Information}. A doctor will come to diagnose your physical condition. (1) Respond according to the chief complaint in the medical record. (2) When instructed or advised to undergo an examination, promptly send the examination details to the examiner. (3) After receiving the examination results from t… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt of the patient agent [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for data annotation [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompts for evaluating differential diagnosis, diagnostic basis, and assessment. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Chinese EHR example [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: English EHR example [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
read the original abstract

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ClinicalMC, a benchmark for multi-course clinical decision-making with LLMs. It consists of 1,275 Chinese and 5,804 English samples spanning four stages from admission to discharge (triage; first-course examination/diagnosis/treatment; subsequent multi-course examination/assessment/treatment; final diagnosis), with patients undergoing an average of 3.42 and 5.11 courses respectively. A multi-agent framework (patient, examiner, doctor agents) is proposed to simulate trajectories, and LLMs are evaluated under single-turn static and multi-turn dynamic settings across closed-source, open-source, and medical model categories.

Significance. If the dataset construction and multi-agent simulation prove faithful to real clinical processes, this benchmark would address a clear gap in single-course evaluations and enable more realistic assessment of LLM performance in evolving patient scenarios. The dual-language coverage and explicit multi-turn dynamic setting are strengths that could support safer deployment of LLMs in healthcare. The work also provides a concrete framework for future multi-agent medical evaluations.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: The manuscript states the sample counts and average course numbers but supplies no information on data sourcing (e.g., real EHR extraction vs. synthetic generation), curation criteria, expert validation, or inter-rater reliability. These details are load-bearing for the central claim that ClinicalMC constitutes a reliable benchmark for multi-course decision-making.
  2. [Multi-agent evaluation framework] Multi-agent evaluation framework section: The patient/examiner/doctor agent roles are introduced without describing interaction protocols, bias-mitigation steps, or any validation that the simulated trajectories avoid substantial artificial simplifications relative to actual clinical workflows. This directly affects the soundness of both the single-turn and multi-turn experimental settings.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'extensive evaluation' is made without any quantitative results or performance metrics, which would strengthen the reader's ability to assess the benchmark's utility.
  2. [Benchmark description] Notation: The four stages are described in prose but would benefit from an explicit table or diagram listing stage definitions and transition rules for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify genuine gaps in the current manuscript regarding the transparency of benchmark construction and the multi-agent simulation protocol. We address each point below and will incorporate the requested details in a revised version.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: The manuscript states the sample counts and average course numbers but supplies no information on data sourcing (e.g., real EHR extraction vs. synthetic generation), curation criteria, expert validation, or inter-rater reliability. These details are load-bearing for the central claim that ClinicalMC constitutes a reliable benchmark for multi-course decision-making.

    Authors: We agree that these methodological details are essential for establishing the benchmark's reliability and were insufficiently described. In the revised manuscript we will add a dedicated subsection under Benchmark Construction that specifies: (1) the exact data sources and whether samples were extracted from real EHRs or generated synthetically, (2) the curation criteria and filtering steps applied, (3) the expert validation protocol (including number of clinicians involved and their qualifications), and (4) inter-rater reliability statistics (e.g., Cohen's or Fleiss' kappa). revision: yes

  2. Referee: [Multi-agent evaluation framework] Multi-agent evaluation framework section: The patient/examiner/doctor agent roles are introduced without describing interaction protocols, bias-mitigation steps, or any validation that the simulated trajectories avoid substantial artificial simplifications relative to actual clinical workflows. This directly affects the soundness of both the single-turn and multi-turn experimental settings.

    Authors: We acknowledge that the current description of the multi-agent framework is high-level and lacks the requested operational details. In the revision we will expand the Multi-agent Evaluation Framework section to include: (1) the precise interaction protocols and turn-taking rules between the patient, examiner, and doctor agents, (2) any bias-mitigation techniques employed (e.g., prompt constraints or role-specific instructions), and (3) validation experiments or qualitative checks demonstrating that the generated trajectories remain faithful to real clinical workflows rather than introducing artificial simplifications. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark introduction

full rationale

The paper introduces an external benchmark (ClinicalMC) consisting of curated Chinese and English clinical samples across admission-to-discharge stages, together with a multi-agent simulation framework (patient/examiner/doctor agents) and two evaluation settings (single-turn static, multi-turn dynamic). No equations, fitted parameters, or predictive derivations appear; the work does not claim to derive any quantity from its own outputs or from self-citations that would reduce the central claim to a tautology. The contribution is the construction of new test data and an evaluation protocol, which stands as an independent artifact rather than a self-referential computation. This matches the default expectation that benchmark papers are typically non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution depends on unverified assumptions about the realism of the constructed dataset and the multi-agent simulation; these are domain assumptions rather than derived results.

axioms (2)
  • domain assumption The multi-agent framework with patient, examiner, and doctor agents provides a valid proxy for real clinical interactions.
    The evaluation of LLMs in both static and dynamic settings rests on this modeling choice.
  • domain assumption The 1,275 Chinese and 5,804 English samples across four stages faithfully represent evolving multi-course patient conditions.
    This underpins the claim that the benchmark addresses the identified evaluation gap.

pith-pipeline@v0.9.1-grok · 5796 in / 1311 out tokens · 37616 ms · 2026-06-28T10:14:23.663713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 7 canonical work pages

  1. [1]

    arXiv preprint arXiv:2202.13876 , year=

    Pmc-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems , author=. arXiv preprint arXiv:2202.13876 , year=

  2. [2]

    IEEE Transactions on Knowledge and Data Engineering , volume=

    A survey on neural data-to-text generation , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2023 , publisher=

  3. [3]

    2: Pushing the frontier of open large language models , author=

    Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

  4. [4]

    Expert Systems with Applications , pages=

    CDAFlow: Enhancing LLM Clinical Decision-Making through Agentic Workflow , author=. Expert Systems with Applications , pages=. 2026 , publisher=

  5. [5]

    Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

    Kweon, Sunjun and Kim, Junu and Kim, Jiyoun and Im, Sujeong and Cho, Eunbyeol and Bae, Seongsu and Oh, Jungwoo and Lee, Gyubok and Moon, Jong Hak and You, Seng Chan and Baek, Seungjin and Han, Chang Hoon and Jung, Yoon Bin and Jo, Yohan and Choi, Edward. Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Findings of the As...

  6. [6]

    arXiv preprint arXiv:2302.13971 , year=

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  7. [7]

    Scientific data , volume=

    MIMIC-III, a freely accessible critical care database , author=. Scientific data , volume=. 2016 , publisher=

  8. [8]

    Journal of the American Medical Informatics Association , volume=

    Evaluating the state-of-the-art in automatic de-identification , author=. Journal of the American Medical Informatics Association , volume=. 2007 , publisher=

  9. [9]

    arXiv preprint arXiv:2410.21276 , year=

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  10. [10]

    arXiv preprint arXiv:2503.04691 , year=

    Quantifying the reasoning abilities of llms on real-world clinical cases , author=. arXiv preprint arXiv:2503.04691 , year=

  11. [11]

    arXiv preprint arXiv:2406.13890 , year=

    ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World , author=. arXiv preprint arXiv:2406.13890 , year=

  12. [12]

    Nature medicine , volume=

    Evaluation and mitigation of the limitations of large language models in clinical decision-making , author=. Nature medicine , volume=. 2024 , publisher=

  13. [13]

    MedChain: Bridging the Gap Between

    Jie Liu and Wenxuan Wang and Zizhan Ma and Guolin Huang and SU Yihang and Kao-Jung Chang and Haoliang Li and Linlin Shen and Michael Lyu and Wenting Chen , booktitle=. MedChain: Bridging the Gap Between. 2025 , url=

  14. [14]

    AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

    Fan, Zhihao and Wei, Lai and Tang, Jialong and Chen, Wei and Siyuan, Wang and Wei, Zhongyu and Huang, Fei. AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  15. [15]

    Nature Medicine , pages=

    An evaluation framework for clinical use of large language models in patient interaction tasks , author=. Nature Medicine , pages=. 2025 , publisher=

  16. [16]

    arXiv preprint arXiv:2503.13205 , year=

    Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways , author=. arXiv preprint arXiv:2503.13205 , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  19. [19]

    arXiv preprint arXiv:2311.16867 , year=

    The falcon series of open language models , author=. arXiv preprint arXiv:2311.16867 , year=

  20. [20]

    arXiv preprint arXiv:2412.15115 , year=

    Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

  21. [21]

    ai , author=

    Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

  22. [22]

    arXiv preprint arXiv:2412.19437 , year=

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  23. [23]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  24. [24]

    arXiv preprint arXiv:2401.04088 , year=

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  25. [25]

    2023 , archiveprefix =

    Mistral 7B , author=. 2023 , archiveprefix =. 2310.06825 , primaryClass=

  26. [26]

    arXiv preprint arXiv:2311.09774 , year=

    Huatuogpt-ii, one-stage training for medical adaption of llms , author=. arXiv preprint arXiv:2311.09774 , year=

  27. [27]

    arXiv preprint arXiv:2410.10626 , year=

    Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts , author=. arXiv preprint arXiv:2410.10626 , year=

  28. [28]

    Applied Sciences , volume=

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

  29. [29]

    Conference on health, inference, and learning , pages=

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering , author=. Conference on health, inference, and learning , pages=. 2022 , organization=

  30. [30]

    arXiv preprint arXiv:1909.06146 , year=

    Pubmedqa: A dataset for biomedical research question answering , author=. arXiv preprint arXiv:1909.06146 , year=

  31. [31]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  32. [32]

    arXiv preprint arXiv:2408.10039 , year=

    MSDiagnosis: An EMR-based Dataset for Clinical Multi-Step Diagnosis , author=. arXiv preprint arXiv:2408.10039 , year=

  33. [33]

    arXiv preprint arXiv:2407.13301 , year=

    CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis , author=. arXiv preprint arXiv:2407.13301 , year=

  34. [34]

    2024 , url=

    Yubin Kim and Chanwoo Park and Hyewon Jeong and Yik Siu Chan and Xuhai Xu and Daniel McDuff and Hyeonhoon Lee and Marzyeh Ghassemi and Cynthia Breazeal and Hae Won Park , booktitle=. 2024 , url=

  35. [35]

    M ed A gents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

    Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Li, Ziming and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark. M ed A gents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. Findings of the Association for Computational Linguistics ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.33

  36. [36]

    arXiv preprint arXiv:2405.02957 , year=

    Agent hospital: A simulacrum of hospital with evolvable medical agents , author=. arXiv preprint arXiv:2405.02957 , year=

  37. [37]

    MM ed A gent: Learning to Use Medical Tools with Multi-modal Agent

    Li, Binxu and Yan, Tiankai and Pan, Yuanting and Luo, Jie and Ji, Ruiyang and Ding, Jiayuan and Xu, Zhe and Liu, Shilong and Dong, Haoyu and Lin, Zihao and Wang, Yixin. MM ed A gent: Learning to Use Medical Tools with Multi-modal Agent. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.510

  38. [38]

    Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , year =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  39. [39]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  40. [40]

    International Conference on Learning Representations , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

  41. [41]

    M ed E ureka: A Medical Domain Benchmark for Multi-Granularity and Multi-Data-Type Embedding-Based Retrieval

    Fan, Yongqi and Wang, Nan and Xue, Kui and Liu, Jingping and Ruan, Tong. M ed E ureka: A Medical Domain Benchmark for Multi-Granularity and Multi-Data-Type Embedding-Based Retrieval. Findings of the Association for Computational Linguistics: NAACL 2025. 2025

  42. [42]

    NPJ digital medicine , volume=

    An overview of clinical decision support systems: benefits, risks, and strategies for success , author=. NPJ digital medicine , volume=. 2020 , publisher=

  43. [43]

    Advances in neural information processing systems , volume=

    Learning imbalanced datasets with label-distribution-aware margin loss , author=. Advances in neural information processing systems , volume=

  44. [44]

    Journal of artificial intelligence research , volume=

    SMOTE: synthetic minority over-sampling technique , author=. Journal of artificial intelligence research , volume=

  45. [45]

    Bowen Wang and Jiuyang Chang and Yiming Qian and Guoxin Chen and Junhao Chen and Zhouqiang Jiang and Jiahao Zhang and Yuta Nakashima and Hajime Nagahara , booktitle=. DiRe

  46. [46]

    Journal of the American Medical Informatics Association , volume =

    Zhan, Zaifu and Zhou, Shuang and Li, Mingchen and Zhang, Rui , title =. Journal of the American Medical Informatics Association , volume =. 2025 , month =. doi:10.1093/jamia/ocaf002 , url =

  47. [47]

    Proceedings of COLING

    Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models , author=. Proceedings of COLING. International Conference on Computational Linguistics , volume=

  48. [48]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  49. [49]

    Scientific data , volume=

    MIMIC-IV, a freely accessible electronic health record dataset , author=. Scientific data , volume=. 2023 , publisher=

  50. [50]

    Canadian journal of statistics , volume=

    Beyond kappa: A review of interrater agreement measures , author=. Canadian journal of statistics , volume=. 1999 , publisher=

  51. [51]

    arXiv preprint arXiv:2507.05201 , year=

    Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

  52. [52]

    arXiv preprint arXiv:2509.02208 , year=

    Baichuan-m2: Scaling medical capability with large verifier system , author=. arXiv preprint arXiv:2509.02208 , year=

  53. [53]

    Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems

    Chen, Junying and Cai, Zhenyang and Ji, Ke and Wang, Xidong and Liu, Wanlong and Wang, Rongsheng and Wang, Benyou. Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.751

  54. [54]

    arXiv preprint arXiv:2505.09388 , year=

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  55. [55]

    arXiv preprint arXiv:2501.12948 , year=

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=