Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3
The pith
Large language models complete only 60.4 percent of expert rubric items in dynamic clinical encounters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedSP1000 converts peer-reviewed standardized patient teaching cases into closed-loop executable scenarios; when a range of LLMs are evaluated against the original expert rubrics, even the best model completes only 60.4 percent of required items and medically tuned models perform lower, indicating that static benchmarks miss clinically relevant failure modes.
What carries the argument
MedSP1000, an interactive benchmark that executes standardized patient cases in closed loop with a patient agent and environment controller while scoring against human-validated rubrics.
If this is right
- Static single-turn medical benchmarks do not predict success in multi-turn clinical management.
- Extra test-time compute does not improve rubric completion rates on these cases.
- Current LLMs and medically specialized agents fall short of the reliability needed for actual clinical integration.
- Process-level SP-style evaluation exposes failure modes that single-turn tests overlook.
Where Pith is reading between the lines
- If the benchmark's validity holds, development of clinical agents should prioritize long-horizon adaptation and information-gathering consistency over single-answer accuracy.
- The same simulation infrastructure could be reused to test whether human clinicians also drop below 100 percent on the same rubrics, providing a direct human baseline.
- Extending the cases to include rare or ambiguous presentations would test whether the observed performance gap widens under greater uncertainty.
Load-bearing premise
The closed-loop simulation with a patient agent and the peer-reviewed rubrics produces trajectories that validly represent clinical decision-making quality.
What would settle it
A model that scores below 60 percent on MedSP1000 but still meets or exceeds human clinician performance in a controlled real-patient trial would falsify the claim that the benchmark reveals clinically relevant unreliability.
read the original abstract
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedSP1000, a benchmark consisting of 1,638 standardized patient (SP) cases converted into executable interactive scenarios with 24,602 peer-reviewed rubrics. It evaluates a range of general and medically specialized LLMs as clinical agents in closed-loop simulations against a patient agent and environment controller, reporting that the best model (GPT-5.5) completes only 60.4% of rubric items while medically specialized models reach at most 40.0%, with no gains from increased test-time compute. The authors conclude that current LLMs are not reliable enough for safe integration into clinical practice and that static benchmarks miss clinically relevant failure modes.
Significance. If the simulation trajectories and rubric scores validly proxy real clinical encounters, the work would demonstrate that dynamic, multi-turn evaluation reveals important limitations not captured by single-turn benchmarks, providing a concrete path toward more realistic assessment of clinical agents.
major comments (2)
- [Abstract] The central claim that LLMs are not yet reliable for clinical practice rests on rubric scores from closed-loop trajectories with an LLM-based patient agent. No calibration data, inter-rater agreement metrics with human SPs, or correlation with real SP outcomes are referenced, leaving open the possibility that the 60.4% ceiling reflects simulation artifacts rather than clinical capability (Abstract and evaluation description).
- [Abstract] The manuscript provides no details on the patient-agent implementation, environment controller mechanics, rubric application process, or simulation validation steps. These omissions make it impossible to assess whether the reported performance differences are reproducible or clinically meaningful (evaluation run description).
Simulated Author's Rebuttal
We thank the referee for the constructive comments on validation and reproducibility. We respond to each major point below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] The central claim that LLMs are not yet reliable for clinical practice rests on rubric scores from closed-loop trajectories with an LLM-based patient agent. No calibration data, inter-rater agreement metrics with human SPs, or correlation with real SP outcomes are referenced, leaving open the possibility that the 60.4% ceiling reflects simulation artifacts rather than clinical capability (Abstract and evaluation description).
Authors: We agree that the absence of direct calibration data, inter-rater agreement with human SPs, or correlation to real clinical outcomes is a substantive limitation. The rubrics originate from peer-reviewed SP teaching cases with established use in medical education, but the closed-loop use of LLM patient agents may introduce artifacts. In revision we will add an explicit limitations subsection that acknowledges this gap, qualifies the language on clinical reliability, and outlines planned future validation against human SPs. The core empirical observation—that dynamic multi-turn evaluation surfaces failure modes missed by static benchmarks—remains supported by the reported results. revision: partial
-
Referee: [Abstract] The manuscript provides no details on the patient-agent implementation, environment controller mechanics, rubric application process, or simulation validation steps. These omissions make it impossible to assess whether the reported performance differences are reproducible or clinically meaningful (evaluation run description).
Authors: The initial submission did not provide sufficient implementation detail. The full manuscript contains a methods section describing the patient agent (LLM prompted with the fixed SP script and conversation history), the environment controller (rule-based state machine that updates clinical variables and terminates the encounter), and rubric scoring (hybrid string matching plus LLM-assisted classification with human adjudication on ambiguous cases). To improve reproducibility we will expand this section with pseudocode, exact prompting templates, and a description of the simulation-validation steps already performed on a held-out subset of trajectories. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation
full rationale
The paper conducts an empirical benchmark evaluation of LLMs on MedSP1000, which converts existing peer-reviewed SP cases into executable scenarios scored against human-validated rubrics. No derivations, equations, fitted parameters, or predictions are present that could reduce outputs to inputs by construction. Performance metrics such as the 60.4% rubric completion rate are direct measurements against external expert criteria, with no self-citation chains or ansatzes invoked to justify the central claim. The evaluation is self-contained against the provided rubrics and cases.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standardized patient cases and associated rubrics accurately capture the requirements of real clinical encounters.
Reference graph
Works this paper leans on
-
[1]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
2023
-
[2]
Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025
2025
-
[3]
Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025
Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, et al. Towards accurate differential diagnosis with large language models.Nature, pages 1–7, 2025
2025
-
[4]
Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025
Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards evaluating and building versatile large language models for medicine.npj Digital Medicine, 8(1):58, 2025
2025
-
[5]
Benchmark evaluation of deepseek large language models in clinical decision-making
Sarah Sandmann, Stefan Hegselmann, Michael Fujarski, Lucas Bickmann, Benjamin Wild, Roland Eils, and Julian Varghese. Benchmark evaluation of deepseek large language models in clinical decision-making. Nature Medicine, pages 1–1, 2025
2025
-
[6]
Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025
Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine, 8(1):263, 2025
2025
-
[7]
Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025
Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Yanjie Fan, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, et al. Quantifying the reasoning abilities of llms on clinical cases.Nature Communications, 16(1):9799, 2025
2025
-
[8]
Coordinated ai agents for advancing healthcare
Michael Moritz, Eric Topol, and Pranav Rajpurkar. Coordinated ai agents for advancing healthcare. Nature Biomedical Engineering, 9(4):432–438, 2025
2025
-
[9]
Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025
Oscar Freyer, Sanddhya Jayabalan, Jakob N Kather, and Stephen Gilbert. Overcoming regulatory barriers to the implementation of ai agents in healthcare.Nature Medicine, 31(10):3239–3243, 2025
2025
-
[10]
A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026
Georgianna Lin, Rencong Jiang, Noémie Elhadad, and Xuhai ‘Orson’ Xu. A framework for longitudinal health ai agents.Nature Health, pages 1–10, 2026
2026
-
[11]
Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025
Zhen Ling Teo, Arun James Thirunavukarasu, Kabilan Elangovan, Haoran Cheng, Prasanth Moova, Brian Soetikno, Christopher Nielsen, Andreas Pollreisz, Darren Shu Jeng Ting, Robert JT Morris, et al. Generative artificial intelligence in medicine.Nature medicine, 31(10):3270–3282, 2025
2025
-
[12]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. |19
2021
-
[13]
Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi- subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022
2022
-
[14]
PubMedQA: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Novembe...
2019
-
[15]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quionero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026
Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026
2026
-
[18]
Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024
Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making.Nature medicine, 30(9):2613–2622, 2024
2024
-
[19]
An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025
Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Leandra A Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M Van Allen, David Kim, et al. An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86, 2025
2025
-
[20]
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Towards conversational diagnostic artificial intelligence
Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic artificial intelligence. Nature, pages 1–9, 2025
2025
-
[22]
Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024
-
[23]
Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025
Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, et al. Evolving diagnostic agents in a virtual clinical environment.arXiv preprint arXiv:2510.24654, 2025
-
[24]
ACGME core competencies | graduate medical education, 2026
Stanford Medicine. ACGME core competencies | graduate medical education, 2026. Accessed: 2026-05-29
2026
-
[25]
Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, and Karan Singhal. Healthbench professional: Evaluating large language models on real clinician chats, 2026
2026
-
[26]
Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. |20
2024
-
[27]
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Medagents: Large language models as collaborators for zero-shot medical reasoning
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023
-
[30]
Markitdown.https://github.com/microsoft/markitdown, 2026
Microsoft. Markitdown.https://github.com/microsoft/markitdown, 2026. GitHub repository. Version v0.1.5. Accessed April 13, 2026
2026
-
[31]
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018
Laura Edgar, Sydney Roberts, and Eric Holmboe. Milestones 2.0: a step forward.Journal of graduate medical education, 10(3):367–369, 2018
2018
-
[33]
The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020
Laura Edgar, Sydney McLean, Sean O Hogan, Stan Hamstra, and Eric S Holmboe. The milestones guidebook.Accreditation Council for Graduate Medical Education, 2024(24):154, 2020
2024
-
[34]
Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7
Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7. Accessed: 2026-5-12
2026
-
[35]
Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/
OpenAI. Introducing gpt-5.5.https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/. Accessed: 2026-5-12
2026
-
[36]
Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/
Google. Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/ pro/. Accessed: 2026-5-12
2026
-
[37]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[38]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
2026
-
[39]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
moderate
Baichuan M3 Team. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making, 2025. |21 5 Supplementary Share of all attachment files by modality (n = 22,244) 69.3% 18.0% 6.5% 2.6% a 0 200 400 600 800 1000 1200 Cases containing ≥1 file of this modality (n = 1,073) Archive / Other Executable / Binary Simulator program Audio Interactive cou...
2025
-
[41]
simulatable
Demonstrate effective communication skills when disclosing medical error2. To be able to disclose medical error without blaming others3. To assume the responsibility of the error4. To be able to offer apology5. To recommend current and future actions after the medical error event. Target group: Residents all levels Type of case: Communication/Assessment S...
-
[42]
skip branch
If simulatable == false, **or** the scenarios field is the empty array [], execute the "skip branch":↪→ - Do not create any scenario directory - Do not copy any file - Do not generate any material for any role - Write a phase2_NOT_APPLICABLE.md in the current working directory containing: simulatable, simulatable_reason, case_shape from phase1, the count ...
-
[43]
normal flow
If simulatable == true and scenarios is non-empty, continue with the "normal flow" below. [Normal flow] What should each simulated scenario expected by the source material look like? If we strictly enact the content of the source material, and assume the entire process of each scenario is simulated through text, I need to prepare instruction files for the...
-
[44]
Recurse into sub-folders of`evaluator/`if any exist
Read every readable text file (.md / .txt) directly under`evaluator/`. Recurse into sub-folders of`evaluator/`if any exist
-
[45]
Read ONLY`evaluator/`. Do NOT read examinee/ , sp_actor/ , environment_controller/ , or any pipeline product elsewhere (phase1_* / phase2_* / *_summary.md / *_packets_index.md / scenario*_NOT_APPLICABLE.md / CLAUDE.md / .codex_tmp_* / __MACOSX/ / files starting with ._ / .DS_Store / Thumbs.db / *.log / *.tmp)
-
[46]
[Source of scoring items] (strong constraints; violating these pollutes downstream paper data)
Bilingual duplicates: if the same document exists as both an English file and a translated copy whose name only adds a language suffix (e.g.`Foo.md`and`Foo-zh.md`), use ONLY the English file and ignore the translated copy, so each concept is counted exactly once. [Source of scoring items] (strong constraints; violating these pollutes downstream paper data)
-
[47]
Did the examinee complete / make this?
A scoring item must be a decidable statement about the examinee's behavior or judgment --- a form on which one can ask, "Did the examinee complete / make this?" The following content in the source text is NOT a scoring item and must not be extracted: - Narrative facts (sentences that describe what happens in the case itself). - Structural numbering or ste...
-
[48]
Each scoring item must have a matching original sentence (or one with identical semantics) somewhere in the evaluator materials
-
[49]
Scoring items must preserve the original wording; rewriting, merging, abbreviating, or paraphrasing is forbidden
-
[50]
Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider
Do not invent scoring items that are not explicitly present in the evaluator materials. Extracting the rubric is the ONLY task here; there is no transcript and no examinee behavior to consider
-
[51]
If a competency dimension has no corresponding scoring item in the materials, simply produce no item for that dimension; never fabricate an item just to fill a dimension
-
[52]
which competency the |37 scoring item's semantics points to
Granularity --- extract scoring items at the source's OWN granularity and prefer the coarser, self-contained form. Do NOT fragment. - A checklist row / checkbox line, a single numbered or bulleted list entry, or a line ending with a colon together with the detail lines that follow it, counts as ONE scoring item; do not break it into several items. - When ...
-
[53]
Objective changes naturally driven by the patient, the disease, or time: these must be proactively written into the in-world fields so that the examinee perceives the change.↪→
-
[54]
if the examinee requests X, accept X
Content that occurs only if triggered by the examinee's decision (that is, matters the examinee should reason about, choose, or request on their own, including condition-acceptance anchors phrased in the materials as "if the examinee requests X, accept X"): these must not be restated in the in-world fields and must not be rewritten as hints. Only when the...
-
[56]
The examinee explicitly calls that role in the current turn (consult request, paging a specialist, activating a code team, etc.)↪→
-
[57]
explicitly described as on-site or standing roles); do not wait until they speak before adding them
A trigger condition specified in the reference materials requires that role to appear |41 When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Opening turn: according to the reference materials, include the healthcare-team roles present at the opening (nurses, cons...
-
[58]
Semantically understand the physician's actions without relying on a fixed action inventory
-
[59]
Return only non-verbal feedback, test results, treatment feedback, and system events that can be executed according to the reference materials↪→
-
[60]
Maintain the progression index`progress_index`and the current scenario-state label `state_label`↪→
-
[61]
When receiving the signal eos=true, determine from the reference materials whether a next state exists:↪→ - If the reference materials describe a next patient state (such as vital-sign changes or new symptoms), advance to the next state and return its initial events/feedback↪→ - If the reference materials do not contain a next-state change, mark should_end=true
-
[62]
response permission
Handle uncertain content conservatively; do not fabricate results absent from the reference materials↪→ Important: during routine feedback (eos=false), return only clinical feedback for the current state and do not advance the scenario. Scenario progression occurs only when eos=true is received. ↪→ ↪→ [Scenario progression rules] - You will receive the cu...
-
[63]
Never add, remove, split, merge, rephrase, translate, or re-categorize an item
Judge exactly the scoring items listed in the frozen rubric, each under exactly the dimension it is listed. Never add, remove, split, merge, rephrase, translate, or re-categorize an item. ↪→ ↪→
-
[64]
Every supplied item must appear exactly once, in the output, under its given dimension.↪→
Every output key must be the verbatim original text of a supplied scoring item. Every supplied item must appear exactly once, in the output, under its given dimension.↪→
-
[65]
Do not move an item to a different dimension even if another dimension seems to fit better
Do not derive new scoring items from the transcript, the action log, or the environment feedback. Do not move an item to a different dimension even if another dimension seems to fit better. ↪→ ↪→
-
[66]
A dimension listed with no scoring items must be output as an empty object; never back-fill it.↪→ [Completion judgment (true/false)]
-
[67]
Mark`true`only when the transcript / action log / environment feedback contains explicit positive evidence that the examinee performed or achieved that item.↪→
-
[68]
If the evidence is missing, indirect, vague, or only verbally mentioned without follow-through, mark`false`.↪→
-
[69]
overall performance is good
Judge each item independently. Do not write overall summary verdicts such as "overall performance is good" or "essentially met".↪→ [Requirements on`reasoning`] - Keep it within 2-4 sentences total, tied to concrete behaviors / feedback / results observed in the simulation.↪→ - Do not enumerate every scoring item; do not turn it into a long summary. Output...
-
[70]
In`speak`, write only what you say to the patient
-
[71]
In`actions`, write only non-verbal operations, such as physical examination, monitoring, testing, medication administration, and management actions↪→
-
[72]
After receiving environment feedback, continue to advance the diagnosis and treatment process until the management loop is completed↪→
-
[73]
speak":
If the materials already support immediate initiation of monitoring, examination, key tests, or treatment, do not keep asking history questions repeatedly without taking action↪→ Output format: - Output only one JSON object - The format must be "speak": "...", "actions": ["...", "..."], "eos": false - When you believe everything that should be done in the...
-
[74]
The plot node described in the reference materials requires that role to appear at that time point↪→
-
[75]
A trigger condition specified in the reference materials requires that role to appear
-
[76]
simulation
The patient develops the lack of language ability described under [Default and extension rules], and that family/companion/guardian role has already been explicitly mentioned in the reference materials ↪→ ↪→ When to remove a role: remove a role only when the reference materials explicitly state that the role leaves; otherwise keep the role present.↪→ Open...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.