arxiv: 2605.07058 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

Yicheng Gao , Xiaolin Zhou , Yahan Li , Yue Zhao , Ruishan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsclinical diagnosisPOMDPnoisy environmentsmedical examscost-efficient strategiessynthetic clinical data

0 comments

The pith

An LLM agent trained on noisy simulated patient interactions matches larger models' diagnostic accuracy while ordering fewer and cheaper exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing medical AI benchmarks often reduce diagnosis to single-turn questions or noise-free sequences, missing the back-and-forth uncertainty of real clinics. This paper treats the full process as a partially observable decision problem in which an agent must question the patient, call medical exams as tools, and issue a final diagnosis, all while seven types of patient noise and three types of exam noise can appear at any step. Synthetic conversations are first generated following a standard clinical interview structure, then the agent is further trained to maximize a reward that scores diagnostic correctness, tool-use quality, and combined financial plus discomfort cost. Experiments and ablations show the resulting agent reaches performance levels comparable to much larger models yet follows more economical examination strategies.

Core claim

Clinical diagnosis is formalized as a POMDP whose actions are patient questions, tool-call medical exams, and diagnosis issuance. A noise model adds seven patient noise types and three exam noise types. Synthetic data structured after the Calgary-Cambridge interview model is used for supervised fine-tuning, followed by DAPO optimization of a composite reward for accuracy, tool quality, and exam cost. The trained MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

What carries the argument

A POMDP environment equipped with explicit patient-noise and exam-noise types, plus a two-stage training pipeline of supervised fine-tuning on structured synthetic dialogues followed by reward optimization that trades diagnostic accuracy against exam cost and patient discomfort.

If this is right

Smaller LLMs can reach strong diagnostic performance without scaling model size, provided the training environment includes realistic interaction noise and cost penalties.
Diagnostic agents can be explicitly trained to reduce unnecessary tests, lowering both financial expense and patient discomfort.
The same POMDP-plus-noise setup supports ablation checks that isolate the contribution of the noise model versus the reward terms.
The approach yields agents that adapt questioning and exam ordering to incomplete or noisy information received mid-conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The work implies that high-fidelity simulation of clinical noise may reduce the need for large volumes of real patient data when training medical agents.
The method could transfer to other interactive, high-stakes domains that require balancing information gathering against action costs, such as legal intake or technical troubleshooting.
If the noise model generalizes, future agents might be deployed in live clinical support roles while still respecting cost and comfort constraints.

Load-bearing premise

The chosen set of seven patient noise types, three exam noise types, and Calgary-Cambridge-structured synthetic conversations sufficiently represents the interactive uncertainty of actual clinical diagnosis.

What would settle it

Running the trained agent on transcripts of real doctor-patient encounters that include genuine exam results and final diagnoses would show whether accuracy and cost patterns hold or degrade.

Figures

Figures reproduced from arXiv: 2605.07058 by Ruishan Liu, Xiaolin Zhou, Yahan Li, Yicheng Gao, Yue Zhao.

**Figure 1.** Figure 1: MedExAgent overview. (a) Tool-augmented diagnostic reasoning enables correct diagnoses where conversation-only baselines fail. (b) On the OOD test set AgentClinic-MedQA [35], Our 8B agent matches much larger baselines on diagnosis accuracy. Disclaimer: MedExAgent is a research prototype. It must not be used for medical advice or patient care. Preprint. arXiv:2605.07058v1 [cs.CL] 8 May 2026 [PITH_FULL_IMAG… view at source ↗

**Figure 2.** Figure 2: Method overview. Top left: The interactive diagnosis POMDP, with latent disease d generating noisy observations ωt from agent actions at. Top right: Two-stage finetuning (SFT then DAPO RL). Bottom: Doctor–patient conversations follow the five-stage Calgary–Cambridge model, with patient and exam noise injected at rates pconv and pexam. 4.1 Data Sources DDxPlus. DDxPlus [10] is a large-scale synthetic differ… view at source ↗

read the original abstract

Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbf{MedExAgent}, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedExAgent frames diagnosis as a POMDP with explicit noise types and trains via SFT then DAPO on synthetic data, but stays inside an unvalidated simulation.

read the letter

The paper's main contribution is formalizing clinical diagnosis as a POMDP with three action types—patient questioning, exam tool calls, and final diagnosis—plus a noise taxonomy of seven patient types and three exam types. They generate synthetic conversations following the Calgary-Cambridge model, run supervised fine-tuning first, then apply DAPO to optimize a reward that mixes accuracy, tool quality, and costs including financial and discomfort factors. This moves past the single-turn or noise-free setups common in medical LLM benchmarks and gives a concrete pipeline for interactive agents.

Referee Report

3 major / 2 minor

Summary. The paper formalizes clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. It introduces a systematic noise model with seven patient noise types and three exam noise types. MedExAgent is trained in two stages—supervised fine-tuning on synthetic conversations structured after the Calgary-Cambridge model, followed by DAPO optimization of a composite reward for diagnostic accuracy, tool call quality, and exam costs (financial and discomfort). Through experiments and ablations, the authors claim MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

Significance. If the synthetic environment and results hold under real conditions, this work meaningfully advances interactive medical LLM agents by addressing multi-turn questioning, exam ordering, and noise handling that current single-turn or noise-free benchmarks overlook. The POMDP formulation and two-stage SFT+DAPO pipeline offer a principled way to optimize for both accuracy and efficiency, with potential for more realistic clinical AI training.

major comments (3)

[Environment Definition and Experiments sections] The central performance claims (comparable diagnostic accuracy and cost-efficiency) rest on experiments conducted entirely within an author-defined synthetic POMDP whose noise distributions (seven patient noise types and three exam noise types) and Calgary-Cambridge-structured dialogues receive no external validation, clinician review, or transfer testing to real clinical transcripts; this is load-bearing because the reported gains over baselines could be artifacts of the closed simulation rather than robust handling of genuine uncertainty.
[Abstract and Results] The abstract and results presentation provide no quantitative metrics (e.g., exact accuracy percentages, cost values, statistical significance, error bars, or dataset sizes), making it impossible to assess the strength of the claim that MedExAgent matches larger models; this gap prevents verification of the ablation studies and cross-model comparisons.
[Training Pipeline (DAPO optimization)] The composite reward in the DAPO stage balances accuracy, tool quality, and costs, but the manuscript does not report sensitivity analysis on the weighting hyperparameters or how they interact with the specific noise model; without this, it is unclear whether the observed cost-efficiency is a general property or tuned to the synthetic environment.

minor comments (2)

[Abstract] Define all acronyms (POMDP, SFT, DAPO, etc.) at first use in the abstract and introduction for accessibility.
[Noise Model] The description of the seven patient and three exam noise types would benefit from a table summarizing each type with examples to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, proposing revisions where they strengthen the manuscript without misrepresenting our synthetic evaluation framework.

read point-by-point responses

Referee: [Environment Definition and Experiments sections] The central performance claims (comparable diagnostic accuracy and cost-efficiency) rest on experiments conducted entirely within an author-defined synthetic POMDP whose noise distributions (seven patient noise types and three exam noise types) and Calgary-Cambridge-structured dialogues receive no external validation, clinician review, or transfer testing to real clinical transcripts; this is load-bearing because the reported gains over baselines could be artifacts of the closed simulation rather than robust handling of genuine uncertainty.

Authors: We acknowledge that all experiments are performed in a controlled synthetic POMDP and that the current manuscript contains no external validation, clinician review, or transfer to real transcripts. The synthetic setting was chosen deliberately to enable systematic isolation of noise effects and cost trade-offs under the POMDP formulation. In the revision we will add an expanded limitations subsection that explicitly states this scope, provides further justification for the noise parameters drawn from clinical literature, and outlines future directions for real-world transfer testing. We do not claim the results generalize beyond the defined environment. revision: partial
Referee: [Abstract and Results] The abstract and results presentation provide no quantitative metrics (e.g., exact accuracy percentages, cost values, statistical significance, error bars, or dataset sizes), making it impossible to assess the strength of the claim that MedExAgent matches larger models; this gap prevents verification of the ablation studies and cross-model comparisons.

Authors: We agree that the absence of concrete numerical results in the abstract and results sections hinders evaluation. In the revised manuscript we will insert the missing quantitative details—including exact diagnostic accuracy percentages, mean and variance of exam costs, statistical significance tests, error bars across runs, and dataset sizes—into both the abstract and the main results tables and figures. revision: yes
Referee: [Training Pipeline (DAPO optimization)] The composite reward in the DAPO stage balances accuracy, tool quality, and costs, but the manuscript does not report sensitivity analysis on the weighting hyperparameters or how they interact with the specific noise model; without this, it is unclear whether the observed cost-efficiency is a general property or tuned to the synthetic environment.

Authors: We accept that reporting sensitivity to the reward weights is necessary for assessing robustness. We will conduct additional experiments varying the accuracy, tool-quality, and cost coefficients, analyze their interaction with the seven patient and three exam noise types, and include the results together with a discussion of stability in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline consists of defining a POMDP formulation for diagnosis, specifying a noise model with seven patient and three exam types, generating synthetic dialogues structured on the Calgary-Cambridge model, performing standard SFT, and then applying DAPO to optimize a composite reward. All performance claims are empirical evaluations inside this self-created simulation; no equations reduce diagnostic accuracy or cost metrics to fitted parameters defined by the same data, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming patterns collapse the claimed results to the inputs by construction. The derivation remains self-contained as an empirical training procedure rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified assumption that the synthetic environment and noise model are representative; no free parameters, axioms, or invented entities are explicitly introduced beyond standard LLM and RL components.

pith-pipeline@v0.9.0 · 5558 in / 1146 out tokens · 34582 ms · 2026-05-11T02:24:38.150352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also introduce a systematic noise model comprising seven patient noise types and three exam noise types... composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

[1]

Physician workforce projections

Association of American Medical Colleges. Physician workforce projections. AAMC Data and Reports, 2024

work page 2024
[2]

Balachandran

A. Balachandran. MedEmbed: Medical-focused embedding models. https://github.com/ abhinand5/MedEmbed, 2024

work page 2024
[3]

E. P. Balogh. The diagnostic process. InImproving Diagnosis in Health Care. National Academies Press, Dec. 2015

work page 2015
[4]

J. A. Baron, C. M. S.-B. Johnson, M. A. Schor, D. Olley, L. Nickel, V . Felix, S. M. Bello, C. Greene, R. Lichenstein, K. Bisordi, R. Koka, C. Bearer, R. Macatangay, N. Ada, K. Ballenger, E. Bliss, L. Colliver, G. Dobbins, H. Heitzig, S. Dixon, P. Semesky, J. Garth, M. Fairchild, P. Gaskin, S. Zahid, R. Castillo, S. Edwards, A. Widjaja, Y . Usui, E. Lynch...

work page 2026
[5]

CY 2025 Q4 Clinical Labora- tory Fee Schedule public use file (25CLABQ4)

Centers for Medicare & Medicaid Services. CY 2025 Q4 Clinical Labora- tory Fee Schedule public use file (25CLABQ4). CMS Change Request 14211, Oct. 2025. https://www.cms.gov/medicare/payment/fee-schedules/ clinical-laboratory-fee-schedule-clfs/files/25clabq4

work page 2025
[6]

CY 2026 Physician Fee Schedule Rela- tive Value Files (RVU26A), Jan

Centers for Medicare & Medicaid Services. CY 2026 Physician Fee Schedule Rela- tive Value Files (RVU26A), Jan. 2026. https://www.cms.gov/medicare/payment/ fee-schedules/physician/pfs-relative-value-files/rvu26a

work page 2026
[7]

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024

work page 2024
[8]

X. Chen, H. Zhou, H. Yi, M. You, W. Liu, L. Wang, Z. Qin, H. Li, X. Zhang, Y . Guo, S. Li, Y . Hu, Q. Xiong, R. Li, L. Fan, Q. Lao, W. Fu, J. Li, and K. Li. Grounding large language models in clinical diagnostics.Nature Communications, Mar 2026

work page 2026
[9]

A. V . Eriksen, S. Möller, and J. Ryg. Use of gpt-4 to diagnose complex clinical cases.NEJM AI, 1(1), Dec 2023

work page 2023
[10]

Fansi Tchango, R

A. Fansi Tchango, R. Goel, Z. Wen, J. Martel, and J. Ghosn. Ddxplus: A new dataset for automatic medical diagnosis.Advances in neural information processing systems, 35:31306– 31318, 2022

work page 2022
[11]

Garcia-Gasulla, J

D. Garcia-Gasulla, J. Bayarri-Planas, A. K. Gururajan, E. Lopez-Cuena, A. Tormos, D. Hinjos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, M. Gonzalez-Mallo, S. Alvarez- Napagao, E. Ayguadé-Parra, and U. Cortés. The aloe family recipe for open and specialized healthcare llms, 2025

work page 2025
[12]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

T. Han, L. C. Adams, K. K. Bressem, F. Busch, S. Nebelung, and D. Truhn. Comparative analysis of multimodal large language model performance on clinical vignette questions.JAMA, 331(15):1320, Apr 2024

work page 2024
[14]

A. Javed. Bridging the health care gap in rural populations: Challenges, innovations, and solutions.The American Journal of Medicine, 138(5):761–762, May 2025

work page 2025
[15]

Ji and L

S. Ji and L. Carin. Cost-sensitive feature acquisition and classification.Pattern Recognition, 40(5):1474–1485, May 2007

work page 2007
[16]

Jiang, K

Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents, 2025. 10

work page 2025
[17]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

work page arXiv 2009
[18]

Y . Kim, C. Park, H. Jeong, Y . S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, and H. W. Park. Mdagents: An adaptive collaboration of llms for medical decision-making, 2024

work page 2024
[19]

Kurtz, J

S. Kurtz, J. Silverman, J. Benson, and J. Draper. Marrying content and process in clinical method teaching: enhancing the Calgary-Cambridge guides.Acad. Med., 78(8):802–809, Aug. 2003

work page 2003
[20]

S. M. Kurtz and J. D. Silverman. The Calgary–Cambridge referenced observation guides: an aid to defining the curriculum and organizing the teaching in communication training programmes. Medical Education, 30(2):83–89, 1996

work page 1996
[21]

Kyung, H

D. Kyung, H. Chung, S. Bae, J. Kim, J. H. Sohn, T. Kim, S. K. Kim, and E. Choi. Patientsim: A persona-driven simulator for realistic doctor-patient interactions, 2025

work page 2025
[22]

Labrak, A

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024

work page 2024
[23]

J. M. Lackner, J. Jaccard, L. Keefer, R. Firth, A. M. Carosella, M. Sitrin, and D. Brenner. The accuracy of patient-reported measures for gi symptoms: A comparison of real time and retrospective reports.Neurogastroenterology & Motility, 26(12):1802–1811, Nov 2014

work page 2014
[24]

S. S. Li, V . Balachandran, S. Feng, J. Ilgen, E. Pierson, P. W. Koh, and Y . Tsvetkov. Mediq: Question-asking llms for adaptive and reliable clinical reasoning. InNeurIPS 2024, June 2024

work page 2024
[25]

C. M. Lu. Benefits, risks, & costs of diagnostic tests. In M. A. Papadakis, S. J. McPhee, M. W. Rabow, K. R. McQuaid, and M. Gandhi, editors,Current Medical Diagnosis & Treatment 2024. McGraw-Hill Education, 2024

work page 2024
[26]

H. Lyu, T. Xu, D. Brotman, B. Mayer-Blackwell, M. Cooper, M. Daniel, E. C. Wick, V . Saini, S. Brownlee, and M. A. Makary. Overtreatment in the united states.PLOS ONE, 12(9), Sep 2017

work page 2017
[27]

D. E. Newman-Toker, N. Nassery, A. C. Schaffer, C. W. Yu-Moe, G. D. Clemens, Z. Wang, Y . Zhu, A. S. Saber Tehrani, M. Fanai, A. Hassoon, and et al. Burden of serious harms from diagnostic error in the usa.BMJ Quality & Safety, 33(2):109–120, Jul 2023

work page 2023
[28]

H. Nori, M. Daswani, C. Kelly, S. Lundberg, M. T. Ribeiro, M. Wilson, X. Liu, V . Sounderajah, J. Carlson, M. P. Lungren, B. Gross, P. Hames, M. Suleyman, D. King, and E. Horvitz. Sequential diagnosis with language models, 2025

work page 2025
[29]

A. Pal, L. K. Umapathi, and M. Sankarasubbu. Medmcqa: A large-scale multi-subject multi- choice dataset for medical domain question answering. In G. Flores, G. H. Chen, T. Pollard, J. C. Ho, and T. Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, ...

work page 2022
[30]

C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs, 2025

work page 2025
[31]

P. Qiu, C. Wu, J. Liu, Q. Zheng, Y . Liao, H. Wang, Y . Yue, Q. Fan, S. Zhen, J. Wang, J. Gu, Y . Wang, Y . Zhang, and W. Xie. Evolving diagnostic agents in a virtual clinical environment, 2025

work page 2025
[32]

D. A. Redelmeier, J. V . Tu, M. J. Schull, L. E. Ferris, and J. E. Hux. Problems for clinical judgement: 2. obtaining a reliable past medical history, Mar 2001. 11

work page 2001
[33]

K. Saab, J. Freyberg, C. Park, T. Strother, Y . Cheng, W.-H. Weng, D. G. T. Barrett, D. Stutz, N. Tomasev, A. Palepu, V . Liévin, Y . Sharma, R. Ruparel, A. Ahmed, E. Vedadi, K. Kanada, C. Hughes, Y . Liu, G. Brown, Y . Gao, S. Li, S. S. Mahdavi, J. Manyika, K. Chou, Y . Matias, A. Hassidim, D. R. Webster, P. Kohli, S. M. A. Eslami, J. Barral, A. Rodman, ...

work page 2025
[34]

Sallinen, A.-J

A. Sallinen, A.-J. Solergibert, M. Zhang, G. B. Boyé, M. Dupont-Roc, X. Theimer-Lienhard, E. Boisson, B. Bernath, H. Hadhri, A. Tran, T. Rabbani, T. Brokowski, M. M. D. W. Group, T. G. J. Rudner, and M.-A. Hartley. Llama-3-meditron: An open-weight suite of medical LLMs based on llama-3.1. InWorkshop on Large Language Models and Generative AI for Health at...

work page 2025
[35]

Schmidgall, R

S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

work page 2024
[36]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Shreffler

J. Shreffler. Diagnostic testing accuracy: Sensitivity, specificity, predictive values and likelihood ratios, Mar 2023

work page 2023
[38]

R. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N. Fedorak, and K. I. Kroeker. An overview of clinical decision support systems: Benefits, risks, and strategies for success.npj Digital Medicine, 3(1), Feb 2020

work page 2020
[39]

X. Tang, A. Zou, Z. Zhang, Z. Li, Y . Zhao, X. Zhang, A. Cohan, and M. Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

work page 2024
[40]

T. Tu, A. Palepu, M. Schaekermann, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, N. Tomasev, S. Azizi, K. Singhal, Y . Cheng, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan. Towards conversational diagnostic ai, 2024

work page 2024
[41]

M. S. Whiteley, S. E. Davey, and G. M. Placzek. The access and invasiveness-based classification of medical procedures to clarify non-invasive from different forms of minimally invasive and open surgery.Journal of Minimal Access Surgery, 20(3):301–310, July 2024

work page 2024
[42]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y . Song, X. Wei, H. Zhou, J. Liu, W.-Y . Ma, Y .-Q. Zhang, L. Yan, M. Qiao, Y . Wu, and M. Wang. Dapo: An open-source llm reinforcement learn...

work page 2025
[44]

Z. Zhao, Q. Jin, F. Chen, T. Peng, and S. Yu. A large-scale dataset of patient summaries for retrieval-based clinical decision support systems.Scientific Data, 10(1):909, 2023

work page 2023
[45]

Y . Zhu, Z. Huang, L. Mu, Y . Huang, W. Nie, J. Liu, S. Zhang, P. Liu, and X. Zhang. Diagnosis- arena: Benchmarking diagnostic reasoning for large language models, 2025

work page 2025
[46]

M. N. Zozus, A. Walden, and C. F. Pieper.Comparing the Accuracy of Health Record Data and Self-Reported Data. Patient-Centered Outcomes Research Institute (PCORI), Washington, DC, Mar. 2023. Available from NCBI Bookshelf. 12 A Additional Experiment Details A.1 Hyperparameters In this section, we list the detailed hyperparameters used to finetune and evalu...

work page 2023
[47]

seminoma, classic type

Identify the individual medical conditions in the ground truth. Note that a comma may be part of a single condition name (e.g. "seminoma, classic type" is ONE condition, "Follicular lymphoma, grade 2" is ONE condition). Semicolons or "and" typically separate distinct conditions

work page
[48]

Identify the individual medical conditions in the prediction, using the same logic

work page
[49]

heart attack

For each ground truth condition, check if any predicted condition refers to the same disease. Consider synonyms (e.g. "heart attack" = "myocardial infarction"), abbreviations, and minor wording differences

work page
[50]

stomach pain

Count how many ground truth conditions have a match in the predictions. You MUST respond in exactly this format (numbers only): gt_count: <number of ground truth conditions> pred_count: <number of predicted conditions> matched: <number of matched conditions> Note that in this implementation, instead of directly asking for |G ∩ P| and |G ∪ P| , we ask the ...

work page 2026