pith. machine review for the scientific record. sign in

arxiv: 2605.07058 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

Yicheng Gao , Xiaolin Zhou , Yahan Li , Yue Zhao , Ruishan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM agentsclinical diagnosisPOMDPnoisy environmentsmedical examscost-efficient strategiessynthetic clinical data
0
0 comments X

The pith

An LLM agent trained on noisy simulated patient interactions matches larger models' diagnostic accuracy while ordering fewer and cheaper exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing medical AI benchmarks often reduce diagnosis to single-turn questions or noise-free sequences, missing the back-and-forth uncertainty of real clinics. This paper treats the full process as a partially observable decision problem in which an agent must question the patient, call medical exams as tools, and issue a final diagnosis, all while seven types of patient noise and three types of exam noise can appear at any step. Synthetic conversations are first generated following a standard clinical interview structure, then the agent is further trained to maximize a reward that scores diagnostic correctness, tool-use quality, and combined financial plus discomfort cost. Experiments and ablations show the resulting agent reaches performance levels comparable to much larger models yet follows more economical examination strategies.

Core claim

Clinical diagnosis is formalized as a POMDP whose actions are patient questions, tool-call medical exams, and diagnosis issuance. A noise model adds seven patient noise types and three exam noise types. Synthetic data structured after the Calgary-Cambridge interview model is used for supervised fine-tuning, followed by DAPO optimization of a composite reward for accuracy, tool quality, and exam cost. The trained MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

What carries the argument

A POMDP environment equipped with explicit patient-noise and exam-noise types, plus a two-stage training pipeline of supervised fine-tuning on structured synthetic dialogues followed by reward optimization that trades diagnostic accuracy against exam cost and patient discomfort.

If this is right

  • Smaller LLMs can reach strong diagnostic performance without scaling model size, provided the training environment includes realistic interaction noise and cost penalties.
  • Diagnostic agents can be explicitly trained to reduce unnecessary tests, lowering both financial expense and patient discomfort.
  • The same POMDP-plus-noise setup supports ablation checks that isolate the contribution of the noise model versus the reward terms.
  • The approach yields agents that adapt questioning and exam ordering to incomplete or noisy information received mid-conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The work implies that high-fidelity simulation of clinical noise may reduce the need for large volumes of real patient data when training medical agents.
  • The method could transfer to other interactive, high-stakes domains that require balancing information gathering against action costs, such as legal intake or technical troubleshooting.
  • If the noise model generalizes, future agents might be deployed in live clinical support roles while still respecting cost and comfort constraints.

Load-bearing premise

The chosen set of seven patient noise types, three exam noise types, and Calgary-Cambridge-structured synthetic conversations sufficiently represents the interactive uncertainty of actual clinical diagnosis.

What would settle it

Running the trained agent on transcripts of real doctor-patient encounters that include genuine exam results and final diagnoses would show whether accuracy and cost patterns hold or degrade.

Figures

Figures reproduced from arXiv: 2605.07058 by Ruishan Liu, Xiaolin Zhou, Yahan Li, Yicheng Gao, Yue Zhao.

Figure 1
Figure 1. Figure 1: MedExAgent overview. (a) Tool-augmented diagnostic reasoning enables correct diagnoses where conversation-only baselines fail. (b) On the OOD test set AgentClinic-MedQA [35], Our 8B agent matches much larger baselines on diagnosis accuracy. Disclaimer: MedExAgent is a research prototype. It must not be used for medical advice or patient care. Preprint. arXiv:2605.07058v1 [cs.CL] 8 May 2026 [PITH_FULL_IMAG… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Top left: The interactive diagnosis POMDP, with latent disease d generating noisy observations ωt from agent actions at. Top right: Two-stage finetuning (SFT then DAPO RL). Bottom: Doctor–patient conversations follow the five-stage Calgary–Cambridge model, with patient and exam noise injected at rates pconv and pexam. 4.1 Data Sources DDxPlus. DDxPlus [10] is a large-scale synthetic differ… view at source ↗
read the original abstract

Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbf{MedExAgent}, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. It introduces a systematic noise model with seven patient noise types and three exam noise types. MedExAgent is trained in two stages—supervised fine-tuning on synthetic conversations structured after the Calgary-Cambridge model, followed by DAPO optimization of a composite reward for diagnostic accuracy, tool call quality, and exam costs (financial and discomfort). Through experiments and ablations, the authors claim MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

Significance. If the synthetic environment and results hold under real conditions, this work meaningfully advances interactive medical LLM agents by addressing multi-turn questioning, exam ordering, and noise handling that current single-turn or noise-free benchmarks overlook. The POMDP formulation and two-stage SFT+DAPO pipeline offer a principled way to optimize for both accuracy and efficiency, with potential for more realistic clinical AI training.

major comments (3)
  1. [Environment Definition and Experiments sections] The central performance claims (comparable diagnostic accuracy and cost-efficiency) rest on experiments conducted entirely within an author-defined synthetic POMDP whose noise distributions (seven patient noise types and three exam noise types) and Calgary-Cambridge-structured dialogues receive no external validation, clinician review, or transfer testing to real clinical transcripts; this is load-bearing because the reported gains over baselines could be artifacts of the closed simulation rather than robust handling of genuine uncertainty.
  2. [Abstract and Results] The abstract and results presentation provide no quantitative metrics (e.g., exact accuracy percentages, cost values, statistical significance, error bars, or dataset sizes), making it impossible to assess the strength of the claim that MedExAgent matches larger models; this gap prevents verification of the ablation studies and cross-model comparisons.
  3. [Training Pipeline (DAPO optimization)] The composite reward in the DAPO stage balances accuracy, tool quality, and costs, but the manuscript does not report sensitivity analysis on the weighting hyperparameters or how they interact with the specific noise model; without this, it is unclear whether the observed cost-efficiency is a general property or tuned to the synthetic environment.
minor comments (2)
  1. [Abstract] Define all acronyms (POMDP, SFT, DAPO, etc.) at first use in the abstract and introduction for accessibility.
  2. [Noise Model] The description of the seven patient and three exam noise types would benefit from a table summarizing each type with examples to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, proposing revisions where they strengthen the manuscript without misrepresenting our synthetic evaluation framework.

read point-by-point responses
  1. Referee: [Environment Definition and Experiments sections] The central performance claims (comparable diagnostic accuracy and cost-efficiency) rest on experiments conducted entirely within an author-defined synthetic POMDP whose noise distributions (seven patient noise types and three exam noise types) and Calgary-Cambridge-structured dialogues receive no external validation, clinician review, or transfer testing to real clinical transcripts; this is load-bearing because the reported gains over baselines could be artifacts of the closed simulation rather than robust handling of genuine uncertainty.

    Authors: We acknowledge that all experiments are performed in a controlled synthetic POMDP and that the current manuscript contains no external validation, clinician review, or transfer to real transcripts. The synthetic setting was chosen deliberately to enable systematic isolation of noise effects and cost trade-offs under the POMDP formulation. In the revision we will add an expanded limitations subsection that explicitly states this scope, provides further justification for the noise parameters drawn from clinical literature, and outlines future directions for real-world transfer testing. We do not claim the results generalize beyond the defined environment. revision: partial

  2. Referee: [Abstract and Results] The abstract and results presentation provide no quantitative metrics (e.g., exact accuracy percentages, cost values, statistical significance, error bars, or dataset sizes), making it impossible to assess the strength of the claim that MedExAgent matches larger models; this gap prevents verification of the ablation studies and cross-model comparisons.

    Authors: We agree that the absence of concrete numerical results in the abstract and results sections hinders evaluation. In the revised manuscript we will insert the missing quantitative details—including exact diagnostic accuracy percentages, mean and variance of exam costs, statistical significance tests, error bars across runs, and dataset sizes—into both the abstract and the main results tables and figures. revision: yes

  3. Referee: [Training Pipeline (DAPO optimization)] The composite reward in the DAPO stage balances accuracy, tool quality, and costs, but the manuscript does not report sensitivity analysis on the weighting hyperparameters or how they interact with the specific noise model; without this, it is unclear whether the observed cost-efficiency is a general property or tuned to the synthetic environment.

    Authors: We accept that reporting sensitivity to the reward weights is necessary for assessing robustness. We will conduct additional experiments varying the accuracy, tool-quality, and cost coefficients, analyze their interaction with the seven patient and three exam noise types, and include the results together with a discussion of stability in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline consists of defining a POMDP formulation for diagnosis, specifying a noise model with seven patient and three exam types, generating synthetic dialogues structured on the Calgary-Cambridge model, performing standard SFT, and then applying DAPO to optimize a composite reward. All performance claims are empirical evaluations inside this self-created simulation; no equations reduce diagnostic accuracy or cost metrics to fitted parameters defined by the same data, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming patterns collapse the claimed results to the inputs by construction. The derivation remains self-contained as an empirical training procedure rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified assumption that the synthetic environment and noise model are representative; no free parameters, axioms, or invented entities are explicitly introduced beyond standard LLM and RL components.

pith-pipeline@v0.9.0 · 5558 in / 1146 out tokens · 34582 ms · 2026-05-11T02:24:38.150352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Physician workforce projections

    Association of American Medical Colleges. Physician workforce projections. AAMC Data and Reports, 2024

  2. [2]

    Balachandran

    A. Balachandran. MedEmbed: Medical-focused embedding models. https://github.com/ abhinand5/MedEmbed, 2024

  3. [3]

    E. P. Balogh. The diagnostic process. InImproving Diagnosis in Health Care. National Academies Press, Dec. 2015

  4. [4]

    J. A. Baron, C. M. S.-B. Johnson, M. A. Schor, D. Olley, L. Nickel, V . Felix, S. M. Bello, C. Greene, R. Lichenstein, K. Bisordi, R. Koka, C. Bearer, R. Macatangay, N. Ada, K. Ballenger, E. Bliss, L. Colliver, G. Dobbins, H. Heitzig, S. Dixon, P. Semesky, J. Garth, M. Fairchild, P. Gaskin, S. Zahid, R. Castillo, S. Edwards, A. Widjaja, Y . Usui, E. Lynch...

  5. [5]

    CY 2025 Q4 Clinical Labora- tory Fee Schedule public use file (25CLABQ4)

    Centers for Medicare & Medicaid Services. CY 2025 Q4 Clinical Labora- tory Fee Schedule public use file (25CLABQ4). CMS Change Request 14211, Oct. 2025. https://www.cms.gov/medicare/payment/fee-schedules/ clinical-laboratory-fee-schedule-clfs/files/25clabq4

  6. [6]

    CY 2026 Physician Fee Schedule Rela- tive Value Files (RVU26A), Jan

    Centers for Medicare & Medicaid Services. CY 2026 Physician Fee Schedule Rela- tive Value Files (RVU26A), Jan. 2026. https://www.cms.gov/medicare/payment/ fee-schedules/physician/pfs-relative-value-files/rvu26a

  7. [7]

    J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024

  8. [8]

    X. Chen, H. Zhou, H. Yi, M. You, W. Liu, L. Wang, Z. Qin, H. Li, X. Zhang, Y . Guo, S. Li, Y . Hu, Q. Xiong, R. Li, L. Fan, Q. Lao, W. Fu, J. Li, and K. Li. Grounding large language models in clinical diagnostics.Nature Communications, Mar 2026

  9. [9]

    A. V . Eriksen, S. Möller, and J. Ryg. Use of gpt-4 to diagnose complex clinical cases.NEJM AI, 1(1), Dec 2023

  10. [10]

    Fansi Tchango, R

    A. Fansi Tchango, R. Goel, Z. Wen, J. Martel, and J. Ghosn. Ddxplus: A new dataset for automatic medical diagnosis.Advances in neural information processing systems, 35:31306– 31318, 2022

  11. [11]

    Garcia-Gasulla, J

    D. Garcia-Gasulla, J. Bayarri-Planas, A. K. Gururajan, E. Lopez-Cuena, A. Tormos, D. Hinjos, P. Bernabeu-Perez, A. Arias-Duart, P. A. Martin-Torres, M. Gonzalez-Mallo, S. Alvarez- Napagao, E. Ayguadé-Parra, and U. Cortés. The aloe family recipe for open and specialized healthcare llms, 2025

  12. [12]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  13. [13]

    T. Han, L. C. Adams, K. K. Bressem, F. Busch, S. Nebelung, and D. Truhn. Comparative analysis of multimodal large language model performance on clinical vignette questions.JAMA, 331(15):1320, Apr 2024

  14. [14]

    A. Javed. Bridging the health care gap in rural populations: Challenges, innovations, and solutions.The American Journal of Medicine, 138(5):761–762, May 2025

  15. [15]

    Ji and L

    S. Ji and L. Carin. Cost-sensitive feature acquisition and classification.Pattern Recognition, 40(5):1474–1485, May 2007

  16. [16]

    Jiang, K

    Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents, 2025. 10

  17. [17]

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

  18. [18]

    Y . Kim, C. Park, H. Jeong, Y . S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, and H. W. Park. Mdagents: An adaptive collaboration of llms for medical decision-making, 2024

  19. [19]

    Kurtz, J

    S. Kurtz, J. Silverman, J. Benson, and J. Draper. Marrying content and process in clinical method teaching: enhancing the Calgary-Cambridge guides.Acad. Med., 78(8):802–809, Aug. 2003

  20. [20]

    S. M. Kurtz and J. D. Silverman. The Calgary–Cambridge referenced observation guides: an aid to defining the curriculum and organizing the teaching in communication training programmes. Medical Education, 30(2):83–89, 1996

  21. [21]

    Kyung, H

    D. Kyung, H. Chung, S. Bae, J. Kim, J. H. Sohn, T. Kim, S. K. Kim, and E. Choi. Patientsim: A persona-driven simulator for realistic doctor-patient interactions, 2025

  22. [22]

    Labrak, A

    Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024

  23. [23]

    J. M. Lackner, J. Jaccard, L. Keefer, R. Firth, A. M. Carosella, M. Sitrin, and D. Brenner. The accuracy of patient-reported measures for gi symptoms: A comparison of real time and retrospective reports.Neurogastroenterology & Motility, 26(12):1802–1811, Nov 2014

  24. [24]

    S. S. Li, V . Balachandran, S. Feng, J. Ilgen, E. Pierson, P. W. Koh, and Y . Tsvetkov. Mediq: Question-asking llms for adaptive and reliable clinical reasoning. InNeurIPS 2024, June 2024

  25. [25]

    C. M. Lu. Benefits, risks, & costs of diagnostic tests. In M. A. Papadakis, S. J. McPhee, M. W. Rabow, K. R. McQuaid, and M. Gandhi, editors,Current Medical Diagnosis & Treatment 2024. McGraw-Hill Education, 2024

  26. [26]

    H. Lyu, T. Xu, D. Brotman, B. Mayer-Blackwell, M. Cooper, M. Daniel, E. C. Wick, V . Saini, S. Brownlee, and M. A. Makary. Overtreatment in the united states.PLOS ONE, 12(9), Sep 2017

  27. [27]

    D. E. Newman-Toker, N. Nassery, A. C. Schaffer, C. W. Yu-Moe, G. D. Clemens, Z. Wang, Y . Zhu, A. S. Saber Tehrani, M. Fanai, A. Hassoon, and et al. Burden of serious harms from diagnostic error in the usa.BMJ Quality & Safety, 33(2):109–120, Jul 2023

  28. [28]

    H. Nori, M. Daswani, C. Kelly, S. Lundberg, M. T. Ribeiro, M. Wilson, X. Liu, V . Sounderajah, J. Carlson, M. P. Lungren, B. Gross, P. Hames, M. Suleyman, D. King, and E. Horvitz. Sequential diagnosis with language models, 2025

  29. [29]

    A. Pal, L. K. Umapathi, and M. Sankarasubbu. Medmcqa: A large-scale multi-subject multi- choice dataset for medical domain question answering. In G. Flores, G. H. Chen, T. Pollard, J. C. Ho, and T. Naumann, editors,Proceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, ...

  30. [30]

    C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs, 2025

  31. [31]

    P. Qiu, C. Wu, J. Liu, Q. Zheng, Y . Liao, H. Wang, Y . Yue, Q. Fan, S. Zhen, J. Wang, J. Gu, Y . Wang, Y . Zhang, and W. Xie. Evolving diagnostic agents in a virtual clinical environment, 2025

  32. [32]

    D. A. Redelmeier, J. V . Tu, M. J. Schull, L. E. Ferris, and J. E. Hux. Problems for clinical judgement: 2. obtaining a reliable past medical history, Mar 2001. 11

  33. [33]

    K. Saab, J. Freyberg, C. Park, T. Strother, Y . Cheng, W.-H. Weng, D. G. T. Barrett, D. Stutz, N. Tomasev, A. Palepu, V . Liévin, Y . Sharma, R. Ruparel, A. Ahmed, E. Vedadi, K. Kanada, C. Hughes, Y . Liu, G. Brown, Y . Gao, S. Li, S. S. Mahdavi, J. Manyika, K. Chou, Y . Matias, A. Hassidim, D. R. Webster, P. Kohli, S. M. A. Eslami, J. Barral, A. Rodman, ...

  34. [34]

    Sallinen, A.-J

    A. Sallinen, A.-J. Solergibert, M. Zhang, G. B. Boyé, M. Dupont-Roc, X. Theimer-Lienhard, E. Boisson, B. Bernath, H. Hadhri, A. Tran, T. Rabbani, T. Brokowski, M. M. D. W. Group, T. G. J. Rudner, and M.-A. Hartley. Llama-3-meditron: An open-weight suite of medical LLMs based on llama-3.1. InWorkshop on Large Language Models and Generative AI for Health at...

  35. [35]

    Schmidgall, R

    S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

  36. [36]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  37. [37]

    Shreffler

    J. Shreffler. Diagnostic testing accuracy: Sensitivity, specificity, predictive values and likelihood ratios, Mar 2023

  38. [38]

    R. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N. Fedorak, and K. I. Kroeker. An overview of clinical decision support systems: Benefits, risks, and strategies for success.npj Digital Medicine, 3(1), Feb 2020

  39. [39]

    X. Tang, A. Zou, Z. Zhang, Z. Li, Y . Zhao, X. Zhang, A. Cohan, and M. Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

  40. [40]

    T. Tu, A. Palepu, M. Schaekermann, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, N. Tomasev, S. Azizi, K. Singhal, Y . Cheng, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan. Towards conversational diagnostic ai, 2024

  41. [41]

    M. S. Whiteley, S. E. Davey, and G. M. Placzek. The access and invasiveness-based classification of medical procedures to clarify non-invasive from different forms of minimally invasive and open surgery.Journal of Minimal Access Surgery, 20(3):301–310, July 2024

  42. [42]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y . Song, X. Wei, H. Zhou, J. Liu, W.-Y . Ma, Y .-Q. Zhang, L. Yan, M. Qiao, Y . Wu, and M. Wang. Dapo: An open-source llm reinforcement learn...

  44. [44]

    Z. Zhao, Q. Jin, F. Chen, T. Peng, and S. Yu. A large-scale dataset of patient summaries for retrieval-based clinical decision support systems.Scientific Data, 10(1):909, 2023

  45. [45]

    Y . Zhu, Z. Huang, L. Mu, Y . Huang, W. Nie, J. Liu, S. Zhang, P. Liu, and X. Zhang. Diagnosis- arena: Benchmarking diagnostic reasoning for large language models, 2025

  46. [46]

    M. N. Zozus, A. Walden, and C. F. Pieper.Comparing the Accuracy of Health Record Data and Self-Reported Data. Patient-Centered Outcomes Research Institute (PCORI), Washington, DC, Mar. 2023. Available from NCBI Bookshelf. 12 A Additional Experiment Details A.1 Hyperparameters In this section, we list the detailed hyperparameters used to finetune and evalu...

  47. [47]

    seminoma, classic type

    Identify the individual medical conditions in the ground truth. Note that a comma may be part of a single condition name (e.g. "seminoma, classic type" is ONE condition, "Follicular lymphoma, grade 2" is ONE condition). Semicolons or "and" typically separate distinct conditions

  48. [48]

    Identify the individual medical conditions in the prediction, using the same logic

  49. [49]

    heart attack

    For each ground truth condition, check if any predicted condition refers to the same disease. Consider synonyms (e.g. "heart attack" = "myocardial infarction"), abbreviations, and minor wording differences

  50. [50]

    stomach pain

    Count how many ground truth conditions have a match in the predictions. You MUST respond in exactly this format (numbers only): gt_count: <number of ground truth conditions> pred_count: <number of predicted conditions> matched: <number of matched conditions> Note that in this implementation, instead of directly asking for |G ∩ P| and |G ∪ P| , we ask the ...