Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Chaoyi Wu; Cheng Liang; Pengcheng Qiu; Weidi Xie; Yanfeng Wang; Ya Zhang

arxiv: 2606.05112 · v1 · pith:VRJCLOSPnew · submitted 2026-06-03 · 💻 cs.CL

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Cheng Liang , Pengcheng Qiu , Ya Zhang , Yanfeng Wang , Chaoyi Wu , Weidi Xie This is my paper

Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsclinical decision makingstandardized patientsinteractive benchmarksmedical AI evaluationdynamic simulationrubric scoring

0 comments

The pith

Large language models complete only 60.4 percent of expert rubric items in dynamic clinical encounters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedSP1000, a benchmark of 1,638 standardized patient cases converted into executable interactive scenarios with 24,602 peer-reviewed rubrics. It runs clinical agents in closed-loop simulations against a patient agent and environment controller, scoring each trajectory against the original expert criteria. Results show that performance on existing static medical benchmarks does not carry over: the strongest general model reaches 60.4 percent rubric completion while the best medically specialized model reaches 40 percent, and extra test-time compute yields no gain. The work therefore claims that current LLMs remain too unreliable for safe clinical use because they miss the adaptive, multi-turn information gathering and longitudinal management required in real encounters.

Core claim

MedSP1000 converts peer-reviewed standardized patient teaching cases into closed-loop executable scenarios; when a range of LLMs are evaluated against the original expert rubrics, even the best model completes only 60.4 percent of required items and medically tuned models perform lower, indicating that static benchmarks miss clinically relevant failure modes.

What carries the argument

MedSP1000, an interactive benchmark that executes standardized patient cases in closed loop with a patient agent and environment controller while scoring against human-validated rubrics.

If this is right

Static single-turn medical benchmarks do not predict success in multi-turn clinical management.
Extra test-time compute does not improve rubric completion rates on these cases.
Current LLMs and medically specialized agents fall short of the reliability needed for actual clinical integration.
Process-level SP-style evaluation exposes failure modes that single-turn tests overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark's validity holds, development of clinical agents should prioritize long-horizon adaptation and information-gathering consistency over single-answer accuracy.
The same simulation infrastructure could be reused to test whether human clinicians also drop below 100 percent on the same rubrics, providing a direct human baseline.
Extending the cases to include rare or ambiguous presentations would test whether the observed performance gap widens under greater uncertainty.

Load-bearing premise

The closed-loop simulation with a patient agent and the peer-reviewed rubrics produces trajectories that validly represent clinical decision-making quality.

What would settle it

A model that scores below 60 percent on MedSP1000 but still meets or exceeds human clinician performance in a controlled real-patient trial would falsify the claim that the benchmark reveals clinically relevant unreliability.

read the original abstract

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedSP1000 gives a new large-scale interactive benchmark from real SP cases, but the LLM patient agent has no shown calibration to human behavior so the performance numbers are hard to interpret.

read the letter

The main point is that this paper builds MedSP1000 from 1638 standardized patient cases, turns them into closed-loop simulations with a patient agent and environment controller, and scores trajectories against 24k expert rubrics. The top model hits only 60.4 percent completion while medical-tuned models sit lower, and extra test-time compute adds nothing.

What is new is the scale and the move to full-encounter rubrics instead of single-turn medical QA. The authors take existing peer-reviewed teaching cases, add scripts and contexts, and run agents through multi-turn interactions. That setup does expose gaps that static tests miss, and the human-validated rubrics are a concrete step forward.

The soft spot is the patient agent. It is itself an LLM, yet the paper supplies no calibration data, inter-rater agreement with human SPs, or checks on how the simulated responses match real variability and ambiguity. Without that, the 60 percent ceiling could be an artifact of the simulation rather than a direct measure of clinical readiness. The leap to “not yet reliable enough for actual practice” therefore rests on an assumption that is not yet backed by evidence in the text.

This work is aimed at groups building clinical agents who need better process-level tests. Readers focused on benchmark construction will find the rubric approach and case volume useful even if they disagree with the final claims.

It deserves a serious referee. The benchmark idea is fresh and the empirical contrast with static tests is worth checking in detail.

Referee Report

2 major / 0 minor

Summary. The paper introduces MedSP1000, a benchmark consisting of 1,638 standardized patient (SP) cases converted into executable interactive scenarios with 24,602 peer-reviewed rubrics. It evaluates a range of general and medically specialized LLMs as clinical agents in closed-loop simulations against a patient agent and environment controller, reporting that the best model (GPT-5.5) completes only 60.4% of rubric items while medically specialized models reach at most 40.0%, with no gains from increased test-time compute. The authors conclude that current LLMs are not reliable enough for safe integration into clinical practice and that static benchmarks miss clinically relevant failure modes.

Significance. If the simulation trajectories and rubric scores validly proxy real clinical encounters, the work would demonstrate that dynamic, multi-turn evaluation reveals important limitations not captured by single-turn benchmarks, providing a concrete path toward more realistic assessment of clinical agents.

major comments (2)

[Abstract] The central claim that LLMs are not yet reliable for clinical practice rests on rubric scores from closed-loop trajectories with an LLM-based patient agent. No calibration data, inter-rater agreement metrics with human SPs, or correlation with real SP outcomes are referenced, leaving open the possibility that the 60.4% ceiling reflects simulation artifacts rather than clinical capability (Abstract and evaluation description).
[Abstract] The manuscript provides no details on the patient-agent implementation, environment controller mechanics, rubric application process, or simulation validation steps. These omissions make it impossible to assess whether the reported performance differences are reproducible or clinically meaningful (evaluation run description).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation and reproducibility. We respond to each major point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] The central claim that LLMs are not yet reliable for clinical practice rests on rubric scores from closed-loop trajectories with an LLM-based patient agent. No calibration data, inter-rater agreement metrics with human SPs, or correlation with real SP outcomes are referenced, leaving open the possibility that the 60.4% ceiling reflects simulation artifacts rather than clinical capability (Abstract and evaluation description).

Authors: We agree that the absence of direct calibration data, inter-rater agreement with human SPs, or correlation to real clinical outcomes is a substantive limitation. The rubrics originate from peer-reviewed SP teaching cases with established use in medical education, but the closed-loop use of LLM patient agents may introduce artifacts. In revision we will add an explicit limitations subsection that acknowledges this gap, qualifies the language on clinical reliability, and outlines planned future validation against human SPs. The core empirical observation—that dynamic multi-turn evaluation surfaces failure modes missed by static benchmarks—remains supported by the reported results. revision: partial
Referee: [Abstract] The manuscript provides no details on the patient-agent implementation, environment controller mechanics, rubric application process, or simulation validation steps. These omissions make it impossible to assess whether the reported performance differences are reproducible or clinically meaningful (evaluation run description).

Authors: The initial submission did not provide sufficient implementation detail. The full manuscript contains a methods section describing the patient agent (LLM prompted with the fixed SP script and conversation history), the environment controller (rule-based state machine that updates clinical variables and terminates the encounter), and rubric scoring (hybrid string matching plus LLM-assisted classification with human adjudication on ambiguous cases). To improve reproducibility we will expand this section with pseudocode, exact prompting templates, and a description of the simulation-validation steps already performed on a held-out subset of trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper conducts an empirical benchmark evaluation of LLMs on MedSP1000, which converts existing peer-reviewed SP cases into executable scenarios scored against human-validated rubrics. No derivations, equations, fitted parameters, or predictions are present that could reduce outputs to inputs by construction. Performance metrics such as the 60.4% rubric completion rate are direct measurements against external expert criteria, with no self-citation chains or ansatzes invoked to justify the central claim. The evaluation is self-contained against the provided rubrics and cases.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that LLMs are not clinically reliable rests on domain assumptions about simulation fidelity and rubric validity rather than fitted parameters or new entities.

axioms (1)

domain assumption Standardized patient cases and associated rubrics accurately capture the requirements of real clinical encounters.
Invoked when translating SP teaching cases into executable scenarios and claiming clinical relevance of the scores.

pith-pipeline@v0.9.1-grok · 5832 in / 1222 out tokens · 45526 ms · 2026-06-28T06:31:38.403339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 10 canonical work pages · 7 internal anchors

[1]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[2]

Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

2025
[3]

Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025

Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, et al. Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025

2025
[4]

Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025

2025
[5]

Benchmark evaluation of deepseek large language models in clinical decision-making

Sarah Sandmann, Stefan Hegselmann, Michael Fujarski, Lucas Bickmann, Benjamin Wild, Roland Eils, and Julian Varghese. Benchmark evaluation of deepseek large language models in clinical decision-making. Nature Medicine, pages 1–1, 2025

2025
[6]

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025

Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025

2025
[7]

Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025

Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Yanjie Fan, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, et al. Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025

2025
[8]

Coordinated ai agents for advancing healthcare

Michael Moritz, Eric Topol, and Pranav Rajpurkar. Coordinated ai agents for advancing healthcare. Nature Biomedical Engineering, 9(4):432–438, 2025

2025
[9]

Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025

Oscar Freyer, Sanddhya Jayabalan, Jakob N Kather, and Stephen Gilbert. Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025

2025
[10]

A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026

Georgianna Lin, Rencong Jiang, Noémie Elhadad, and Xuhai ‘Orson’ Xu. A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026

2026
[11]

Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025

Zhen Ling Teo, Arun James Thirunavukarasu, Kabilan Elangovan, Haoran Cheng, Prasanth Moova, Brian Soetikno, Christopher Nielsen, Andreas Pollreisz, Darren Shu Jeng Ting, Robert JT Morris, et al. Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025

2025
[12]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. |19

2021
[13]

Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

2022
[14]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Novembe...

2019
[15]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quionero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

2026
[18]

Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024

2024
[19]

An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025

Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Leandra A Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M Van Allen, David Kim, et al. An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025

2025
[20]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Towards conversational diagnostic artificial intelligence

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic artificial intelligence. Nature, pages 1–9, 2025

2025
[22]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[23]

Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, et al. Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025

work page arXiv 2025
[24]

ACGME core competencies | graduate medical education, 2026

Stanford Medicine. ACGME core competencies | graduate medical education, 2026. Accessed: 2026-05-29

2026
[25]

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, and Karan Singhal. Healthbench professional: Evaluating large language models on real clinician chats, 2026

2026
[26]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. |20

2024
[27]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023
[30]

Markitdown.https://github.com/microsoft/markitdown, 2026

Microsoft. Markitdown.https://github.com/microsoft/markitdown, 2026. GitHub repository. Version v0.1.5. Accessed April 13, 2026

2026
[31]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018

Laura Edgar, Sydney Roberts, and Eric Holmboe. Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018

2018
[33]

The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020

Laura Edgar, Sydney McLean, Sean O Hogan, Stan Hamstra, and Eric S Holmboe. The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020

2024
[34]

Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7

Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7. Accessed: 2026-5-12

2026
[35]

Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/

OpenAI. Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/. Accessed: 2026-5-12

2026
[36]

Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/

Google. Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/. Accessed: 2026-5-12

2026
[37]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[38]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[39]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

moderate

Baichuan M3 Team. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making, 2025. |21 5 Supplementary Share of all attachment files by modality (n = 22,244) 69.3% 18.0% 6.5% 2.6% a 0 200 400 600 800 1000 1200 Cases containing ≥1 file of this modality (n = 1,073) Archive / Other Executable / Binary Simulator program Audio Interactive cou...

2025
[41]

simulatable

Demonstrate effective communication skills when disclosing medical error2. To be able to disclose medical error without blaming others3. To assume the responsibility of the error4. To be able to offer apology5. To recommend current and future actions after the medical error event. Target group: Residents all levels Type of case: Communication/Assessment S...
[42]

skip branch

If simulatable == false, **or** the scenarios field is the empty array [], execute the "skip branch":↪→ - Do not create any scenario directory - Do not copy any file - Do not generate any material for any role - Write a phase2_NOT_APPLICABLE.md in the current working directory containing: simulatable, simulatable_reason, case_shape from phase1, the count ...
[43]

normal flow

If simulatable == true and scenarios is non-empty, continue with the "normal flow" below. [Normal flow] What should each simulated scenario expected by the source material look like? If we strictly enact the content of the source material, and assume the entire process of each scenario is simulated through text, I need to prepare instruction files for the...
[44]

Recurse into sub-folders of`evaluator/`if any exist

Read every readable text file (.md / .txt) directly under`evaluator/`. Recurse into sub-folders of`evaluator/`if any exist
[45]

Read ONLY`evaluator/`. Do NOT read examinee/ , sp_actor/ , environment_controller/ , or any pipeline product elsewhere (phase1_* / phase2_* / *_summary.md / *_packets_index.md / scenario*_NOT_APPLICABLE.md / CLAUDE.md / .codex_tmp_* / __MACOSX/ / files starting with ._ / .DS_Store / Thumbs.db / *.log / *.tmp)
[46]

[Source of scoring items] (strong constraints; violating these pollutes downstream paper data)

Bilingual duplicates: if the same document exists as both an English file and a translated copy whose name only adds a language suffix (e.g.`Foo.md`and`Foo-zh.md`), use ONLY the English file and ignore the translated copy, so each concept is counted exactly once. [Source of scoring items] (strong constraints; violating these pollutes downstream paper data)
[47]

Did the examinee complete / make this?

A scoring item must be a decidable statement about the examinee's behavior or judgment --- a form on which one can ask, "Did the examinee complete / make this?" The following content in the source text is NOT a scoring item and must not be extracted: - Narrative facts (sentences that describe what happens in the case itself). - Structural numbering or ste...
[48]

Each scoring item must have a matching original sentence (or one with identical semantics) somewhere in the evaluator materials
[49]

Scoring items must preserve the original wording; rewriting, merging, abbreviating, or paraphrasing is forbidden
[50]

Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider

Do not invent scoring items that are not explicitly present in the evaluator materials. Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider
[51]

If a competency dimension has no corresponding scoring item in the materials, simply produce no item for that dimension; never fabricate an item just to fill a dimension
[52]

which competency the |37 scoring item's semantics points to

Granularity --- extract scoring items at the source's OWN granularity and prefer the coarser, self-contained form. Do NOT fragment. - A checklist row / checkbox line, a single numbered or bulleted list entry, or a line ending with a colon together with the detail lines that follow it, counts as ONE scoring item; do not break it into several items. - When ...
[53]

Objective changes naturally driven by the patient, the disease, or time: these must be proactively written into the in-world fields so that the examinee perceives the change.↪→
[54]

if the examinee requests X, accept X

Content that occurs only if triggered by the examinee's decision (that is, matters the examinee should reason about, choose, or request on their own, including condition-acceptance anchors phrased in the materials as "if the examinee requests X, accept X"): these must not be restated in the in-world fields and must not be rewritten as hints. Only when the...
[56]

The examinee explicitly calls that role in the current turn (consult request, paging a specialist, activating a code team, etc.)↪→
[57]

explicitly described as on-site or standing roles); do not wait until they speak before adding them

A trigger condition specified in the reference materials requires that role to appear |41 When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Opening turn: according to the reference materials, include the healthcare-team roles present at the opening (nurses, cons...
[58]

Semantically understand the physician's actions without relying on a fixed action inventory
[59]

Return only non-verbal feedback, test results, treatment feedback, and system events that can be executed according to the reference materials↪→
[60]

Maintain the progression index`progress_index`and the current scenario-state label `state_label`↪→
[61]

When receiving the signal eos=true, determine from the reference materials whether a next state exists:↪→ - If the reference materials describe a next patient state (such as vital-sign changes or new symptoms), advance to the next state and return its initial events/feedback↪→ - If the reference materials do not contain a next-state change, mark should_end=true
[62]

response permission

Handle uncertain content conservatively; do not fabricate results absent from the reference materials↪→ Important: during routine feedback (eos=false), return only clinical feedback for the current state and do not advance the scenario. Scenario progression occurs only when eos=true is received. ↪→ ↪→ [Scenario progression rules] - You will receive the cu...
[63]

Never add, remove, split, merge, rephrase, translate, or re-categorize an item

Judge exactly the scoring items listed in the frozen rubric, each under exactly the dimension it is listed. Never add, remove, split, merge, rephrase, translate, or re-categorize an item. ↪→ ↪→
[64]

Every supplied item must appear exactly once, in the output, under its given dimension.↪→

Every output key must be the verbatim original text of a supplied scoring item. Every supplied item must appear exactly once, in the output, under its given dimension.↪→
[65]

Do not move an item to a different dimension even if another dimension seems to fit better

Do not derive new scoring items from the transcript, the action log, or the environment feedback. Do not move an item to a different dimension even if another dimension seems to fit better. ↪→ ↪→
[66]

A dimension listed with no scoring items must be output as an empty object; never back-fill it.↪→ [Completion judgment (true/false)]
[67]

Mark`true`only when the transcript / action log / environment feedback contains explicit positive evidence that the examinee performed or achieved that item.↪→
[68]

If the evidence is missing, indirect, vague, or only verbally mentioned without follow-through, mark`false`.↪→
[69]

overall performance is good

Judge each item independently. Do not write overall summary verdicts such as "overall performance is good" or "essentially met".↪→ [Requirements on`reasoning`] - Keep it within 2-4 sentences total, tied to concrete behaviors / feedback / results observed in the simulation.↪→ - Do not enumerate every scoring item; do not turn it into a long summary. Output...
[70]

In`speak`, write only what you say to the patient
[71]

In`actions`, write only non-verbal operations, such as physical examination, monitoring, testing, medication administration, and management actions↪→
[72]

After receiving environment feedback, continue to advance the diagnosis and treatment process until the management loop is completed↪→
[73]

speak":

If the materials already support immediate initiation of monitoring, examination, key tests, or treatment, do not keep asking history questions repeatedly without taking action↪→ Output format: - Output only one JSON object - The format must be "speak": "...", "actions": ["...", "..."], "eos": false - When you believe everything that should be done in the...
[74]

The plot node described in the reference materials requires that role to appear at that time point↪→
[75]

A trigger condition specified in the reference materials requires that role to appear
[76]

simulation

The patient develops the lack of language ability described under [Default and extension rules], and that family/companion/guardian role has already been explicitly mentioned in the reference materials ↪→ ↪→ When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Open...

[1] [1]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023

[2] [2]

Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

2025

[3] [3]

Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025

Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, et al. Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025

2025

[4] [4]

Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025

2025

[5] [5]

Benchmark evaluation of deepseek large language models in clinical decision-making

Sarah Sandmann, Stefan Hegselmann, Michael Fujarski, Lucas Bickmann, Benjamin Wild, Roland Eils, and Julian Varghese. Benchmark evaluation of deepseek large language models in clinical decision-making. Nature Medicine, pages 1–1, 2025

2025

[6] [6]

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025

Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025

2025

[7] [7]

Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025

Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Yanjie Fan, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, et al. Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025

2025

[8] [8]

Coordinated ai agents for advancing healthcare

Michael Moritz, Eric Topol, and Pranav Rajpurkar. Coordinated ai agents for advancing healthcare. Nature Biomedical Engineering, 9(4):432–438, 2025

2025

[9] [9]

Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025

Oscar Freyer, Sanddhya Jayabalan, Jakob N Kather, and Stephen Gilbert. Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025

2025

[10] [10]

A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026

Georgianna Lin, Rencong Jiang, Noémie Elhadad, and Xuhai ‘Orson’ Xu. A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026

2026

[11] [11]

Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025

Zhen Ling Teo, Arun James Thirunavukarasu, Kabilan Elangovan, Haoran Cheng, Prasanth Moova, Brian Soetikno, Christopher Nielsen, Andreas Pollreisz, Darren Shu Jeng Ting, Robert JT Morris, et al. Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025

2025

[12] [12]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. |19

2021

[13] [13]

Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

2022

[14] [14]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Novembe...

2019

[15] [15]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quionero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

2026

[18] [18]

Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024

2024

[19] [19]

An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025

Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Leandra A Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M Van Allen, David Kim, et al. An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025

2025

[20] [20]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Towards conversational diagnostic artificial intelligence

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic artificial intelligence. Nature, pages 1–9, 2025

2025

[22] [22]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024

[23] [23]

Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, et al. Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025

work page arXiv 2025

[24] [24]

ACGME core competencies | graduate medical education, 2026

Stanford Medicine. ACGME core competencies | graduate medical education, 2026. Accessed: 2026-05-29

2026

[25] [25]

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, and Karan Singhal. Healthbench professional: Evaluating large language models on real clinician chats, 2026

2026

[26] [26]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. |20

2024

[27] [27]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023

[30] [30]

Markitdown.https://github.com/microsoft/markitdown, 2026

Microsoft. Markitdown.https://github.com/microsoft/markitdown, 2026. GitHub repository. Version v0.1.5. Accessed April 13, 2026

2026

[31] [31]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018

Laura Edgar, Sydney Roberts, and Eric Holmboe. Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018

2018

[33] [33]

The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020

Laura Edgar, Sydney McLean, Sean O Hogan, Stan Hamstra, and Eric S Holmboe. The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020

2024

[34] [34]

Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7

Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7. Accessed: 2026-5-12

2026

[35] [35]

Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/

OpenAI. Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/. Accessed: 2026-5-12

2026

[36] [36]

Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/

Google. Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/. Accessed: 2026-5-12

2026

[37] [37]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026

[38] [38]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[39] [39]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

moderate

Baichuan M3 Team. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making, 2025. |21 5 Supplementary Share of all attachment files by modality (n = 22,244) 69.3% 18.0% 6.5% 2.6% a 0 200 400 600 800 1000 1200 Cases containing ≥1 file of this modality (n = 1,073) Archive / Other Executable / Binary Simulator program Audio Interactive cou...

2025

[41] [41]

simulatable

Demonstrate effective communication skills when disclosing medical error2. To be able to disclose medical error without blaming others3. To assume the responsibility of the error4. To be able to offer apology5. To recommend current and future actions after the medical error event. Target group: Residents all levels Type of case: Communication/Assessment S...

[42] [42]

skip branch

If simulatable == false, **or** the scenarios field is the empty array [], execute the "skip branch":↪→ - Do not create any scenario directory - Do not copy any file - Do not generate any material for any role - Write a phase2_NOT_APPLICABLE.md in the current working directory containing: simulatable, simulatable_reason, case_shape from phase1, the count ...

[43] [43]

normal flow

If simulatable == true and scenarios is non-empty, continue with the "normal flow" below. [Normal flow] What should each simulated scenario expected by the source material look like? If we strictly enact the content of the source material, and assume the entire process of each scenario is simulated through text, I need to prepare instruction files for the...

[44] [44]

Recurse into sub-folders of`evaluator/`if any exist

Read every readable text file (.md / .txt) directly under`evaluator/`. Recurse into sub-folders of`evaluator/`if any exist

[45] [45]

Read ONLY`evaluator/`. Do NOT read examinee/ , sp_actor/ , environment_controller/ , or any pipeline product elsewhere (phase1_* / phase2_* / *_summary.md / *_packets_index.md / scenario*_NOT_APPLICABLE.md / CLAUDE.md / .codex_tmp_* / __MACOSX/ / files starting with ._ / .DS_Store / Thumbs.db / *.log / *.tmp)

[46] [46]

[Source of scoring items] (strong constraints; violating these pollutes downstream paper data)

Bilingual duplicates: if the same document exists as both an English file and a translated copy whose name only adds a language suffix (e.g.`Foo.md`and`Foo-zh.md`), use ONLY the English file and ignore the translated copy, so each concept is counted exactly once. [Source of scoring items] (strong constraints; violating these pollutes downstream paper data)

[47] [47]

Did the examinee complete / make this?

A scoring item must be a decidable statement about the examinee's behavior or judgment --- a form on which one can ask, "Did the examinee complete / make this?" The following content in the source text is NOT a scoring item and must not be extracted: - Narrative facts (sentences that describe what happens in the case itself). - Structural numbering or ste...

[48] [48]

Each scoring item must have a matching original sentence (or one with identical semantics) somewhere in the evaluator materials

[49] [49]

Scoring items must preserve the original wording; rewriting, merging, abbreviating, or paraphrasing is forbidden

[50] [50]

Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider

Do not invent scoring items that are not explicitly present in the evaluator materials. Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider

[51] [51]

If a competency dimension has no corresponding scoring item in the materials, simply produce no item for that dimension; never fabricate an item just to fill a dimension

[52] [52]

which competency the |37 scoring item's semantics points to

Granularity --- extract scoring items at the source's OWN granularity and prefer the coarser, self-contained form. Do NOT fragment. - A checklist row / checkbox line, a single numbered or bulleted list entry, or a line ending with a colon together with the detail lines that follow it, counts as ONE scoring item; do not break it into several items. - When ...

[53] [53]

Objective changes naturally driven by the patient, the disease, or time: these must be proactively written into the in-world fields so that the examinee perceives the change.↪→

[54] [54]

if the examinee requests X, accept X

Content that occurs only if triggered by the examinee's decision (that is, matters the examinee should reason about, choose, or request on their own, including condition-acceptance anchors phrased in the materials as "if the examinee requests X, accept X"): these must not be restated in the in-world fields and must not be rewritten as hints. Only when the...

[55] [56]

The examinee explicitly calls that role in the current turn (consult request, paging a specialist, activating a code team, etc.)↪→

[56] [57]

explicitly described as on-site or standing roles); do not wait until they speak before adding them

A trigger condition specified in the reference materials requires that role to appear |41 When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Opening turn: according to the reference materials, include the healthcare-team roles present at the opening (nurses, cons...

[57] [58]

Semantically understand the physician's actions without relying on a fixed action inventory

[58] [59]

Return only non-verbal feedback, test results, treatment feedback, and system events that can be executed according to the reference materials↪→

[59] [60]

Maintain the progression index`progress_index`and the current scenario-state label `state_label`↪→

[60] [61]

When receiving the signal eos=true, determine from the reference materials whether a next state exists:↪→ - If the reference materials describe a next patient state (such as vital-sign changes or new symptoms), advance to the next state and return its initial events/feedback↪→ - If the reference materials do not contain a next-state change, mark should_end=true

[61] [62]

response permission

Handle uncertain content conservatively; do not fabricate results absent from the reference materials↪→ Important: during routine feedback (eos=false), return only clinical feedback for the current state and do not advance the scenario. Scenario progression occurs only when eos=true is received. ↪→ ↪→ [Scenario progression rules] - You will receive the cu...

[62] [63]

Never add, remove, split, merge, rephrase, translate, or re-categorize an item

Judge exactly the scoring items listed in the frozen rubric, each under exactly the dimension it is listed. Never add, remove, split, merge, rephrase, translate, or re-categorize an item. ↪→ ↪→

[63] [64]

Every supplied item must appear exactly once, in the output, under its given dimension.↪→

Every output key must be the verbatim original text of a supplied scoring item. Every supplied item must appear exactly once, in the output, under its given dimension.↪→

[64] [65]

Do not move an item to a different dimension even if another dimension seems to fit better

Do not derive new scoring items from the transcript, the action log, or the environment feedback. Do not move an item to a different dimension even if another dimension seems to fit better. ↪→ ↪→

[65] [66]

A dimension listed with no scoring items must be output as an empty object; never back-fill it.↪→ [Completion judgment (true/false)]

[66] [67]

Mark`true`only when the transcript / action log / environment feedback contains explicit positive evidence that the examinee performed or achieved that item.↪→

[67] [68]

If the evidence is missing, indirect, vague, or only verbally mentioned without follow-through, mark`false`.↪→

[68] [69]

overall performance is good

Judge each item independently. Do not write overall summary verdicts such as "overall performance is good" or "essentially met".↪→ [Requirements on`reasoning`] - Keep it within 2-4 sentences total, tied to concrete behaviors / feedback / results observed in the simulation.↪→ - Do not enumerate every scoring item; do not turn it into a long summary. Output...

[69] [70]

In`speak`, write only what you say to the patient

[70] [71]

In`actions`, write only non-verbal operations, such as physical examination, monitoring, testing, medication administration, and management actions↪→

[71] [72]

After receiving environment feedback, continue to advance the diagnosis and treatment process until the management loop is completed↪→

[72] [73]

speak":

If the materials already support immediate initiation of monitoring, examination, key tests, or treatment, do not keep asking history questions repeatedly without taking action↪→ Output format: - Output only one JSON object - The format must be "speak": "...", "actions": ["...", "..."], "eos": false - When you believe everything that should be done in the...

[73] [74]

The plot node described in the reference materials requires that role to appear at that time point↪→

[74] [75]

A trigger condition specified in the reference materials requires that role to appear

[75] [76]

simulation

The patient develops the lack of language ability described under [Default and extension rules], and that family/companion/guardian role has already been explicitly mentioned in the reference materials ↪→ ↪→ When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Open...