arxiv: 2605.02240 · v1 · submitted 2026-05-04 · 💻 cs.AI

Recognition: unknown

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ashwin Nayak, Austin J. Schoeffler, Imran Q. Mohiuddin, Isaac Ogunmola, John L. Havlik, Jonathan H. Chen, Kameron C. Black, Kavita Renduchintala, Prasantha L. Vemu, Roopa Dhatt, Ruoqi Liu, Shivam C. Vedak, Stephen P. Ma

Pith reviewed 2026-05-09 16:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsEHR benchmarkclinical workflowsphysician taskshealthcare AIlong-horizon planningagent evaluation

0 comments

The pith

PhysicianBench shows top LLM agents succeed on only 46% of long-horizon clinical tasks in real EHR systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysicianBench to test whether large language model agents can carry out the extended, multi-step workflows that physicians perform when using electronic health record systems. It assembles 100 tasks taken from actual consultations across 21 specialties, each reviewed by physicians and requiring roughly 27 tool calls to pull records, interpret data, execute orders such as prescriptions, and produce documentation. When 13 different agents are run on these tasks inside a live EHR environment accessed via standard APIs, the strongest model completes fewer than half correctly and open-source models stay at or below 19 percent. A reader should care because proposals to deploy autonomous agents for routine clinical work assume they can manage these full workflows, yet the results indicate current systems fall well short.

Core claim

PhysicianBench consists of 100 long-horizon tasks drawn from real primary-care and subspecialty consultation cases, each instantiated with actual patient records and accessed through the same standard APIs used by commercial EHR vendors. The tasks span diagnosis interpretation, medication prescribing, treatment planning and documentation, with an average of 27 tool calls per task and 670 structured checkpoints for execution-grounded grading. Evaluation of 13 proprietary and open-source LLM agents shows the best pass@1 success rate reaches only 46 percent while open-source models reach at most 19 percent.

What carries the argument

PhysicianBench, a benchmark of 100 tasks with 670 checkpoints that requires agents to retrieve heterogeneous data across encounters, reason over clinical information, execute consequential actions, and produce documentation inside a real EHR environment.

If this is right

Agents need substantial gains in sustained multi-step reasoning over clinical data before they can reliably support documentation or decision-making in live settings.
Progress toward autonomous clinical agents can now be tracked with verifiable, execution-based metrics rather than static knowledge tests.
The benchmark's checkpoint structure can help developers isolate and address specific failure modes in data retrieval and action execution.
Deployment of such agents in clinical practice would require performance thresholds well above the levels demonstrated by current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable benchmarks in other regulated domains such as legal case management or financial compliance could expose similar limits in long-horizon agent performance.
The performance gap might shrink if future models incorporate targeted training on clinical tool-use sequences, though the paper does not test this possibility.
The task decomposition into checkpoints offers a practical way to design hybrid human-agent systems that hand off shorter segments rather than full workflows.

Load-bearing premise

The 100 tasks, adapted from real consultation cases and reviewed by physician panels, accurately represent the long-horizon composite workflows that characterize real clinical systems when executed through standard EHR APIs.

What would settle it

A study in which practicing physicians perform the identical 100 tasks through the same EHR interface and achieve success rates near 100 percent while current agents remain below 50 percent would support the claim that the benchmark exposes a genuine capability gap.

Figures

Figures reproduced from arXiv: 2605.02240 by Ashwin Nayak, Austin J. Schoeffler, Imran Q. Mohiuddin, Isaac Ogunmola, John L. Havlik, Jonathan H. Chen, Kameron C. Black, Kavita Renduchintala, Prasantha L. Vemu, Roopa Dhatt, Ruoqi Liu, Shivam C. Vedak, Stephen P. Ma.

**Figure 1.** Figure 1: PhysicianBench overall model performance ranked by success rate (pass@1). Preprint. arXiv:2605.02240v1 [cs.AI] 4 May 2026 view at source ↗

**Figure 2.** Figure 2: Overview of PhysicianBench. Tasks are curated from real consultation cases, paired with real EHR environments, and further validated by physicians independently. Each task instruction specifies a clinical role, trigger event, required steps, and expected deliverable. During execution, the agent interacts with EHR environment through multi-step tool calls. Agent performance is assessed via a sequence of che… view at source ↗

**Figure 3.** Figure 3: Distribution of tasks in PhysicianBench. (a) Task type distribution as a two-level sunburst: the inner ring shows four high-level clinical workflow types, and the outer ring shows finer-grained subtypes. (b) Distribution across eight major groups with full list in Table A2. • Documentation: assesses the completeness and clinical soundness of the agent’s written output (e.g., assessment notes, management pl… view at source ↗

**Figure 4.** Figure 4: Fine-grained root cause comparison between GPT-5.5 and Claude Opus 4.6. Each failed view at source ↗

read the original abstract

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhysicianBench gives a more grounded look at agent performance on real EHR workflows than prior static tests, but the 46% top score only lands if the 100 tasks and 670 checkpoints actually match the messy realities of clinical systems.

read the letter

The main thing to know is that this benchmark reports a clear performance gap: the strongest model hits 46% success on pass@1 across 100 long-horizon tasks, while open-source ones top out at 19%. That number comes from running agents against actual EHR APIs with real patient data, not simulated or single-step setups. The tasks average 27 tool calls each and span 21 specialties, which is a step beyond the usual knowledge-recall or intent-only medical agent tests. The 670 structured checkpoints with script-based verification add execution grounding that prior work lacked. This setup makes the evaluation more relevant to how doctors actually use EHRs for multi-encounter reasoning and documentation. The paper earns credit for adapting tasks from real consultation cases and having them reviewed by a separate physician panel. That choice avoids the circularity of self-generated problems and focuses on verifiable outcomes. The gap it shows between current agents and clinical demands is worth paying attention to. The soft spot is the limited visibility into task construction. The abstract notes adaptation and review but gives no numbers on how many original steps were retained, what inter-rater agreement looked like, or how the scripts handle incomplete notes, missing data, or edge-case API responses. If the selected tasks systematically dropped the ambiguous or low-signal parts that dominate real workflows, the 46% figure could understate the actual difficulty. The stress-test concern holds up on the provided details. This paper is for groups building or evaluating clinical agents who want something closer to deployment conditions than static benchmarks. A reader working on agent reliability or healthcare AI would get value from the task design and the reported numbers. It deserves peer review because the core contribution is practical and the results are concrete, even though the methods section will need more quantitative detail on adaptation and verification to be fully convincing.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PhysicianBench, a benchmark of 100 long-horizon tasks adapted from real primary-care-to-subspecialty consultation cases, each reviewed by a separate physician panel. Tasks are instantiated in a live EHR environment using standard vendor APIs, require an average of 27 tool calls, and are decomposed into 670 checkpoints verified by task-specific execution scripts. Evaluation across 13 proprietary and open-source LLM agents reports a best-case pass@1 success rate of 46% (proprietary) and at most 19% (open-source), which the authors interpret as evidence of a substantial gap between current agent capabilities and real-world clinical workflow demands.

Significance. If the benchmark's task set and verification procedures faithfully reproduce the error-prone, multi-encounter, incomplete-data realities of live EHR use, the work supplies a rare execution-grounded evaluation framework that moves beyond static knowledge or single-step intent benchmarks. The explicit use of real patient records, standard APIs, and script-verified checkpoints is a methodological strength that could enable reproducible progress tracking and targeted diagnosis of failure modes in long-horizon clinical reasoning.

major comments (3)

[Abstract / Task Construction] Abstract and Task Construction (implied Section 3): the statement that tasks were 'adapted from real consultation cases' and 'independently reviewed by a separate panel of physicians' provides no quantitative mapping (e.g., fraction of original steps retained, inter-rater agreement on checkpoint definitions, or handling of free-text notes and missing data). Without these details the 46% success rate cannot be confidently interpreted as measuring the true capability gap for representative long-horizon workflows.
[Checkpoint Verification] Checkpoint Verification (implied Section 4): the 670 checkpoints are graded by 'task-specific scripts with execution-grounded verification,' yet no description is given of how scripts handle edge cases such as partial data retrieval, ambiguous API responses, or clinically incomplete but syntactically correct actions. This directly affects whether the reported pass@1 rates over- or under-state agent readiness.
[Results] Results (implied Section 5): the headline gap (46% best vs. 19% open-source) is presented without per-task or per-specialty breakdowns or error analysis that would show whether failures cluster on particular workflow types (e.g., multi-encounter data retrieval vs. medication prescribing). Such analysis is load-bearing for the claim that the benchmark reveals 'the demands of real-world clinical workflows.'

minor comments (2)

[Abstract] The abstract states '21 specialties' and 'diverse workflow types' but does not list the exact distribution; a table or appendix would improve clarity.
[Abstract] Notation for 'pass@1' and '670 checkpoints' is introduced without an explicit definition or reference to the supplementary material on the first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and revised the manuscript to provide additional details and analyses as requested. Our point-by-point responses are below.

read point-by-point responses

Referee: [Abstract / Task Construction] Abstract and Task Construction (implied Section 3): the statement that tasks were 'adapted from real consultation cases' and 'independently reviewed by a separate panel of physicians' provides no quantitative mapping (e.g., fraction of original steps retained, inter-rater agreement on checkpoint definitions, or handling of free-text notes and missing data). Without these details the 46% success rate cannot be confidently interpreted as measuring the true capability gap for representative long-horizon workflows.

Authors: We agree that providing quantitative details on the task adaptation and review process would improve the interpretability of our results. In the revised manuscript, we have added a new paragraph in Section 3 describing the adaptation methodology. This includes the average proportion of original consultation steps retained in the benchmark tasks, the inter-rater agreement scores from the physician panel on task and checkpoint definitions, and explicit protocols used for managing free-text clinical notes and missing data fields in the EHR. These revisions allow readers to better assess how faithfully the benchmark represents real-world long-horizon workflows. revision: yes
Referee: [Checkpoint Verification] Checkpoint Verification (implied Section 4): the 670 checkpoints are graded by 'task-specific scripts with execution-grounded verification,' yet no description is given of how scripts handle edge cases such as partial data retrieval, ambiguous API responses, or clinically incomplete but syntactically correct actions. This directly affects whether the reported pass@1 rates over- or under-state agent readiness.

Authors: We appreciate this observation, as transparency in verification is crucial for benchmark validity. We have revised Section 4 to include a comprehensive description of the verification scripts' logic for edge cases. Specifically, we detail how partial data retrieval is handled (requiring a minimum set of fields for checkpoint passage), how ambiguous API responses are managed (by treating them as failures unless the agent explicitly queries for clarification), and how clinically incomplete actions are distinguished from syntactically correct ones (via separate checkpoints for clinical appropriateness). This addition clarifies that our pass@1 rates are conservative estimates of agent readiness. revision: yes
Referee: [Results] Results (implied Section 5): the headline gap (46% best vs. 19% open-source) is presented without per-task or per-specialty breakdowns or error analysis that would show whether failures cluster on particular workflow types (e.g., multi-encounter data retrieval vs. medication prescribing). Such analysis is load-bearing for the claim that the benchmark reveals 'the demands of real-world clinical workflows.'

Authors: We concur that granular analysis strengthens the interpretation of the performance gap. In the updated Section 5, we have incorporated per-task success rates, breakdowns by specialty (e.g., cardiology vs. psychiatry), and a categorized error analysis. Failures are classified into categories such as data retrieval across encounters, reasoning over incomplete information, action execution errors, and documentation issues. The analysis reveals that errors are distributed across workflow types rather than clustering in one area, thereby supporting our conclusion regarding the challenges of real-world clinical workflows. We have also included visualizations of these breakdowns. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark evaluation is externally grounded

full rationale

The paper presents an empirical benchmark of 100 tasks adapted from real consultation cases, independently reviewed by physician panels, instantiated in live EHR environments via standard APIs, and scored by 670 execution-grounded checkpoints. No mathematical derivations, parameter fittings, self-referential definitions, or load-bearing self-citations appear in the provided text. Performance metrics (e.g., 46% pass@1) are direct, independent measurements from running 13 agents; they do not reduce to the inputs by construction. The central claim of a capability gap follows straightforwardly from these external evaluations without tautology or circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the selected tasks and verification scripts faithfully capture real clinical demands; no free parameters or new entities are introduced.

axioms (1)

domain assumption Tasks adapted from real consultation cases and reviewed by separate physician panels represent the composite workflows of actual clinical practice.
This premise underpins the claim that the benchmark measures progress toward autonomous clinical agents.

pith-pipeline@v0.9.0 · 5645 in / 1278 out tokens · 32923 ms · 2026-05-09T16:33:25.035116+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, February 2026

2026
[2]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, April 2026

2026
[3]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, February 2026

2026
[4]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[5]

Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

2026
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

A new paradigm for accelerating clinical data science at stanford medicine, 2020

Somalee Datta, Jose Posada, Garrick Olson, Wencheng Li, Ciaran O’Reilly, Deepa Balraj, Joseph Mesterhazy, Joseph Pallas, Priyamvada Desai, and Nigam Shah. A new paradigm for accelerating clinical data science at stanford medicine.arXiv preprint arXiv:2003.10534, 2020

work page arXiv 2003
[8]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical Report, 2026

2026
[9]

Gemini 3.1 pro: A smarter model for your most com- plex tasks

Google DeepMind. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, February 2026

2026
[10]

FHIR: Fast healthcare interoperability resources

Health Level Seven International. FHIR: Fast healthcare interoperability resources. https: //hl7.org/fhir/. 11
[11]

Healthbench professional: Evaluating large language models on real clinician chats

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, et al. Healthbench professional: Evaluating large language models on real clinician chats. Technical report, tech. rep., OpenAI, 2026

2026
[12]

National comparison of ambulatory physician electronic health record use across specialties.Journal of general internal medicine, 39(14):2868–2870, 2024

A Jay Holmgren, Christine A Sinsky, Lisa Rotenstein, and Nate C Apathy. National comparison of ambulatory physician electronic health record use across specialties.Journal of general internal medicine, 39(14):2868–2870, 2024

2024
[13]

MedAgentBench: a virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. MedAgentBench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

2025
[14]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

2024
[15]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[16]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

2019
[17]

AI-based Clinical Decision Support for Primary Care: A Real- World Study.arXiv preprintarXiv:2507.16947

Robert Korom, Sarah Kiptinness, Najib Adan, Kassim Said, Catherine Ithuli, Oliver Rotich, Boniface Kimani, Irene King’ori, Stellah Kamau, Elizabeth Atemba, et al. Ai-based clinical decision support for primary care: A real-world study.arXiv preprint arXiv:2507.16947, 2025

work page arXiv 2025
[18]

FHIR-AgentBench: Benchmarking llm agents for realistic interoperable ehr question answering

Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, ALISTAIR JOHNSON, Edward Choi, Jong Ha Lee, et al. FHIR-AgentBench: Benchmarking llm agents for realistic interoperable ehr question answering. InMachine Learning for Health, 2025

2025
[19]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution, 2026

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2601.13918 , year=

Yusheng Liao, Chuan Xuan, Yutong Cai, Lina Yang, Zhe Chen, Yanfeng Wang, and Yu Wang. AgentEHR: Advancing autonomous clinical decision-making via retrospective summarization. arXiv preprint arXiv:2601.13918, 2026

work page arXiv 2026
[21]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review arXiv 2023
[22]

Minimax m2.7: Early echoes of self-evolution

MiniMax. Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, March 2026. Accessed: 2026-04-22

2026
[23]

Kimi k2.6: Advancing open-source coding

Moonshot AI. Kimi k2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, April 2026. Accessed: 2026-04-22

2026
[24]

What is a medical decision? a taxonomy based on physician statements in hospital encounters: a qualitative study.BMJ open, 6(2):e010098, 2016

Eirik H Ofstad, Jan C Frich, Edvin Schei, Richard M Frankel, and Pål Gulbrandsen. What is a medical decision? a taxonomy based on physician statements in hospital encounters: a qualitative study.BMJ open, 6(2):e010098, 2016

2016
[25]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

2026
[26]

Introducing gpt-5.5

OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ , April 2026

2026
[27]

MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022. 12

2022
[28]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025
[29]

Qwen3.6-plus: Towards real world agents

Qwen Team. Qwen3.6-plus: Towards real world agents. https://qwen.ai/blog?id=qwen3. 6, April 2026

2026
[30]

Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

work page arXiv 2024
[31]

EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024

2024
[32]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[33]

Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties.Annals of internal medicine, 165(11):753–760, 2016

Christine Sinsky, Lacey Colligan, Ling Li, Mirela Prgomet, Sam Reynolds, Lindsey Goed- ers, Johanna Westbrook, Michael Tutty, and George Blike. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties.Annals of internal medicine, 165(11):753–760, 2016

2016
[34]

HAPI FHIR JPA Server Starter, 2024

Smile CDR. HAPI FHIR JPA Server Starter, 2024. Accessed: 2026-03-11

2024
[35]

arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. MCP-Bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453, 2025

work page arXiv 2025
[36]

Grok 4.20 model documentation.https://openrouter.ai/x-ai/grok-4.20, 2026

xAI. Grok 4.20 model documentation.https://openrouter.ai/x-ai/grok-4.20, 2026

2026
[37]

Mimo-v2.5-pro.https://mimo.xiaomi.com/mimo-v2-5-pro, 2026

Xiaomi. Mimo-v2.5-pro.https://mimo.xiaomi.com/mimo-v2-5-pro, 2026

2026
[38]

Theagentcompany: Benchmarking llm agents on consequential real world tasks

Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: Benchmarking llm agents on consequential real world tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[39]

τ-bench: A bench- mark for Tool-Agent-User interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. τ-bench: A bench- mark for Tool-Agent-User interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[40]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[41]

Webarena: A realistic web environment for build- ing autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build- ing autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. 13 Appendix A Task Taxonomy PhysicianBench comprises 100 clinician-validated...

2024
[42]

Reviewers flagged ambiguities and suggested specific rewording

Instruction clarity.Whether the task instruction is unambiguous and actionable for an autonomous agent, with a single correct interpretation. Reviewers flagged ambiguities and suggested specific rewording
[43]

Reviewers flagged flawed or incomplete reasoning and provided the correct clinical logic

Clinical reasoning validity.Whether the reasoning chain in the solution summary is medically correct and follows accepted clinical guidelines. Reviewers flagged flawed or incomplete reasoning and provided the correct clinical logic
[44]

Completeness.Whether the solution omits clinically important elements that a competent clinician would address (e.g., safety checks, guideline-mandated follow-up, contraindication screening)
[45]

Tasks with safety concerns were flagged and revised or excluded

Patient safety.Whether any recommended action could cause patient harm. Tasks with safety concerns were flagged and revised or excluded
[46]

output truncated, showing first N of M characters; use 19 filters such as ‘code’, ‘date=ge...’, or reduced ‘count’ to narrow results

EHR consistency.Whether all clinical values, dates, and findings cited in the solution match the de-identified patient record. Reviewers verified specific data points against the EHR reference panel. 1https://pypi.org/project/Faker/ 16 Reviewers selected a categorical response for each checklist item (e.g.,Clear / Ambiguous / Missing context) and were req...

2022
[47]

Hydrocortisone increase: 10 → 15 mg AM, 5 mg PM maintained → new total20 mg/day
[48]

Initiate fludrocortisone 0.1 mg daily— essential for primary AI mineralocorticoid replacement
[49]

Cardiology referral if BP remains uncontrolled
[50]

Confirmatory labs: ACTH, 21-hydroxylase Ab, repeat renin/TSH
[51]

Output file written; 44 tool calls total

Sick-day stress dosing reinforcement 29 ♂robotAgentTrajectory ended. Output file written; 44 tool calls total. 30