Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Jiawei Wang; Lu Zhou; Yichun Feng; Yikai Zheng; Yixue Li; Zhen Lei

arxiv: 2505.19630 · v4 · submitted 2025-05-26 · 💻 cs.CL

Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Yichun Feng , Jiawei Wang , Lu Zhou , Yikai Zheng , Zhen Lei , Yixue Li This is my paper

Pith reviewed 2026-05-19 13:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords medical dialogue systemsreinforcement learningmulti-agent collaborationproactive consultationmulti-turn interactionclinical decision supportAI for healthcare

0 comments

The pith

A multi-agent reinforcement learning system trains a doctor agent to ask strategic questions over multiple turns and reach a 70 percent exact diagnostic match rate with real patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that large language models fall short in clinical settings because they expect patients to supply complete symptom lists in one go and because supervised training only copies surface patterns instead of building understanding step by step. By recasting the consultation as a sequential decision process under uncertainty, the authors train an agent to learn an active questioning policy that elicits the missing information needed for a reliable diagnosis. They support this training with a new multi-turn English medical dialogue dataset and then test the resulting agent in both blinded human reviews and actual patient encounters. A reader would care because the approach offers a concrete way for AI to manage routine initial screenings and thereby ease pressure on limited medical staff.

Core claim

DoctorAgent-RL is a reinforcement-learning multi-agent framework that trains the doctor agent to master a questioning methodology rather than to recall answers, so that key patient details emerge progressively through guided multi-turn dialogue and produce an optimal diagnosis, as measured by a 70 percent exact diagnostic match rate in real-patient trials.

What carries the argument

DoctorAgent-RL, the reinforcement-learning multi-agent framework that models consultation as dynamic decision-making under uncertainty and optimizes the doctor's questioning policy to maximize diagnostic information gain.

If this is right

The trained agent can perform initial screenings so that human clinicians can devote time to more difficult cases.
Reduced misdiagnosis risk and lower overall strain on healthcare resources follow directly from wider use of the agent for routine intake.
The same policy-learning approach can be applied to any domain in which an expert must gather information through successive questions rather than receive it all at once.
Real-patient validation already demonstrates that performance gains seen in simulation transfer to live interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The learned questioning policy could be inspected to see which sequences of questions most reliably surface critical information, offering a data-driven view of efficient clinical inquiry.
Pairing the agent with electronic health record access would let it condition questions on prior history and test whether accuracy rises further.
The framework might be adapted to non-medical information-gathering dialogues such as technical troubleshooting or legal intake interviews.

Load-bearing premise

The MTMedDialog dataset together with the chosen reinforcement-learning reward function faithfully reproduces the uncertainty and response variability of real clinical encounters.

What would settle it

A follow-up study that runs the agent on a fresh, larger cohort of real patients and records an exact diagnostic match rate well below 70 percent or no better than standard large language models.

read the original abstract

Large language models (LLMs) struggle in real-world clinical consultations. Single-turn consultation systems require patients to describe all symptoms at once, which often leads to unclear complaints and vague diagnoses. Traditional dialogue models, constrained by static supervised learning, are limited to superficially imitating existing dialogue patterns and lack the ability to actively construct understanding in dynamic interactions, thus failing to achieve genuine clinical reasoning.To address these challenges, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework, and train a doctor agent on Qwen2.5-7B-Instruct using this framework. Within this framework, a medical consultation is modeled as a dynamic decision-making process under uncertainty. The core intelligence of the doctor agent is shifted from knowing the answer to learning and mastering a questioning methodology aimed at achieving an optimal diagnosis. Through strategic questioning, it guides the progressive emergence of key patient information in multi-turn dialogues. To support this high-fidelity simulation of the real diagnostic process, we constructed MTMedDialog, a novel English multi-turn medical consultation dataset designed for dynamic, interactive training.To validate its real-world effectiveness, rigorous evaluations including blinded human assessments and trials with real patients were conducted. DoctorAgent-RL outperformed frontier models and achieved a 70% exact diagnostic match rate, confirming its potential as a collaborative tool. By handling initial screenings, it can free clinicians to focus on complex cases, thereby addressing critical issues like physician shortages and misdiagnosis risks while alleviating the strain on healthcare resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames medical consultation as multi-agent RL for proactive questioning, adds a new multi-turn dataset, and reports 70% real-patient diagnostic match, but the methods leave the reward design and evaluation controls too opaque to judge robustness.

read the letter

The main takeaway is that this work trains a doctor agent on Qwen2.5 using multi-agent RL to ask questions actively rather than waiting for complete patient descriptions. They built MTMedDialog for that purpose and ran real-patient trials that hit 70% exact diagnostic agreement while beating frontier models on the reported metrics. That combination of RL framing, a purpose-built dataset, and an attempt at live validation is the concrete step forward here. It directly targets the problem of vague or incomplete histories that single-turn systems run into. The multi-agent setup lets different components handle parts of the uncertainty, which fits the dynamic nature of actual consultations better than pure imitation learning. The real-patient piece is also more useful than staying inside simulation only. Those elements give the paper a practical hook for anyone working on medical dialogue tools. The soft spots sit mostly in the experimental reporting. The abstract states the 70% match and the outperformance, yet gives almost no information on how the RL reward was constructed, what the patient simulator actually varied, or the controls and inter-rater numbers behind the human assessments. If the simulator responses are low-variance or overly cooperative, the learned policy could be picking up dataset artifacts instead of general clinical reasoning, exactly as the stress-test note flags. Without ablations on the reward terms or clear blinding details, the central claim stays hard to evaluate. The citation pattern looks standard and does not appear to over-claim prior results. This paper is for researchers building dialogue systems for primary care or testing RL in uncertain domains. A reader who needs a new multi-turn medical dataset or ideas for proactive questioning would find usable material even before the methods are tightened. It deserves a serious referee because the application is timely and the real-patient trial provides something concrete to examine, even if the current write-up needs expansion on the RL objective and evaluation protocol. I would send it out for review with requests for the reward function definition, simulator statistics, and full assessment details.

Referee Report

2 major / 2 minor

Summary. The paper introduces DoctorAgent-RL, a multi-agent reinforcement learning framework that trains a doctor agent based on Qwen2.5-7B-Instruct to perform proactive, multi-turn medical consultations. It constructs the MTMedDialog dataset to simulate dynamic diagnostic interactions under uncertainty and reports that the resulting system outperforms frontier models while achieving a 70% exact diagnostic match rate in blinded human assessments and real-patient trials.

Significance. If the performance claims are substantiated with adequate controls, the work could advance clinical AI by shifting from static imitation learning to RL-driven active questioning strategies, offering a practical path to improve initial screenings and alleviate physician workload. The inclusion of real-patient trials provides a stronger test of applicability than purely simulated benchmarks.

major comments (2)

Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.
RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.

minor comments (2)

Introduction: The contrast between 'knowing the answer' and 'learning a questioning methodology' would be strengthened by a brief illustrative dialogue excerpt showing how the RL policy differs from supervised baselines.
Evaluation: Add a table summarizing inter-rater agreement statistics and baseline model scores to make the human assessment results easier to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.

Authors: We agree that the abstract would benefit from additional context on the evaluation to support the claims. The main text describes the blinded human assessments and real-patient trials, including sample sizes, demographics, comparisons to frontier models as controls, and expert verification of diagnoses. We will revise the abstract to concisely incorporate these details while maintaining brevity. revision: yes
Referee: RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.

Authors: We acknowledge the need for greater specificity here. The manuscript presents the overall multi-agent RL setup and MTMedDialog construction, but we will expand the relevant sections with explicit descriptions of the reward function (including terms for information gain under uncertainty) and simulator mechanics for handling symptom ambiguity, incomplete patient histories, and non-cooperative or variable responses, supported by examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training and external validation on constructed dataset

full rationale

The paper describes an empirical multi-agent RL framework trained on the newly introduced MTMedDialog dataset, with performance claims (70% exact match rate, outperformance of frontier models) resting on blinded human assessments and real-patient trials rather than any closed-form derivations, fitted parameters renamed as predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems appear in the abstract or description, and the evaluation uses external benchmarks and live patient interactions that are independent of the training loop. This structure is self-contained against external benchmarks with no load-bearing self-citation chains or reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Assessment uses only the abstract; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5810 in / 1020 out tokens · 36852 ms · 2026-05-19T13:35:23.489711+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

npj Digital Medicine8(1), 178 (2025)

Kopka, M., Kalckreuth, N., Feufel, M.A.: Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine8(1), 178 (2025)

work page 2025
[5]

The Lancet Global Health11(8), 1162–1164 (2023)

Agyeman-Manu, K., Ghebreyesus, T.A., Maait, M., Rafila, A., Tom, L., Lima, N.T., Wangmo, D.: Prioritising the health and care workforce shortage: protect, invest, together. The Lancet Global Health11(8), 1162–1164 (2023)

work page 2023
[6]

ALQahtani, D.A., Rotgans, J.I., Mamede, S., ALAlwan, I., Magzoub, M.E.M., Altayeb, F.M., Mohamedani, M.A., Schmidt, H.G.: Does time pressure have a negative effect on diagnostic accuracy? Academic Medicine91(5), 710–716 (2016)

work page 2016
[7]

The Annals of Family Medicine22(1), 12–18 (2024)

Arndt, B.G., Micek, M.A., Rule, A., Shafer, C.M., Baltus, J.J., Sinsky, C.A.: More tethered to the ehr: Ehr workload trends among academic primary care physicians, 2019-2023. The Annals of Family Medicine22(1), 12–18 (2024)

work page 2019
[8]

Journal of personalized medicine13(6), 951 (2023)

Al Kuwaiti, A., Nazer, K., Al-Reedy, A., Al-Shehri, S., Al-Muhanna, A., Sub- barayalu, A.V., Al Muhanna, D., Al-Muhanna, F.A.: A review of the role of artificial intelligence in healthcare. Journal of personalized medicine13(6), 951 (2023)

work page 2023
[9]

Medalpaca–an open-source collection of medical conversational ai models and training data

Han, T., Adams, L.C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨ oser, A., Truhn, D., Bressem, K.K.: Medalpaca–an open-source collec- tion of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247 (2023)

work page arXiv 2023
[10]

Biomistral: A collection of open-source pretrained large language models for medical domains

Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373 (2024)

work page arXiv 2024
[11]

npj Digital Medicine8(1), 58 (2025)

Wu, C., Qiu, P., Liu, J., Gu, H., Li, N., Zhang, Y., Wang, Y., Xie, W.: Towards evaluating and building versatile large language models for medicine. npj Digital Medicine8(1), 58 (2025)

work page 2025
[12]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y., Neville, J.: Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

JAMA Internal Medicine184(2), 164–173 (2024)

Auerbach, A.D., Lee, T.M., Hubbard, C.C., Ranji, S.R., Raffel, K., Valdes, G., 21 Boscardin, J., Dalal, A.K., Harris, A., Flynn, E.,et al.: Diagnostic errors in hos- pitalized adults who died or were transferred to intensive care. JAMA Internal Medicine184(2), 164–173 (2024)

work page 2024
[14]

arXiv preprint arXiv:2310.15896 (2023)

Chen, Y., Wang, Z., Xing, X., Xu, Z., Fang, K., Wang, J., Li, S., Wu, J., Liu, Q., Xu, X., et al.: Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023)

work page arXiv 2023
[15]

Deep learning ba sed recommender system: A survey and new perspectives

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Gao, J., Liu, J., Dolan, B.: Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536 (2019)

work page arXiv 1911
[16]

In: Proceedings of the 31st International Conference on Computational Linguistics, pp

Liu, R., Xue, K., Zhang, X., Zhang, S.: Interactive evaluation for medical llms via task-oriented dialogue system. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 4871–4896 (2025)

work page 2025
[17]

Nature Medicine, pages 1–8

Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

work page arXiv 2023
[18]

Advances in Neural Information Processing Systems37, 26045–26081 (2024)

Zhang, K., Zeng, S., Hua, E., Ding, N., Chen, Z.-R., Ma, Z., Li, H., Cui, G., Qi, B., Zhu, X.,et al.: Ultramedical: Building specialized generalists in biomedicine. Advances in Neural Information Processing Systems37, 26045–26081 (2024)

work page 2024
[19]

GigaScience14, 082 (2025)

Feng, Y., Zhou, L., Ma, C., Zheng, Y., He, R., Li, Y.: Knowledge graph–based thought: a knowledge graph–enhanced llm framework for pan-cancer question answering. GigaScience14, 082 (2025)

work page 2025
[20]

Nature, 1–9 (2025)

Tu, T., Schaekermann, M., Palepu, A., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Cheng, Y., et al.: Towards conversational diagnostic artificial intelligence. Nature, 1–9 (2025)

work page 2025
[21]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Li, J., Lai, Y., Li, W., Ren, J., Zhang, M., Kang, X., Wang, S., Li, P., Zhang, Y.-Q., Ma, W., et al.: Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957 (2024)

work page arXiv 2024
[22]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Fan, Z., Tang, J., Chen, W., Wang, S., Wei, Z., Xi, J., Huang, F., Zhou, J.: Ai hos- pital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742 (2024)

work page arXiv 2024
[23]

NPJ digital medicine8(1), 159 (2025)

Chen, X., Yi, H., You, M., Liu, W., Wang, L., Li, H., Zhang, X., Guo, Y., Fan, L., Chen, G.,et al.: Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine8(1), 159 (2025)

work page 2025
[24]

npj Digital Medicine (2026)

Liu, Q., Hu, Z., Huang, T., Niu, Y., Zhang, X., Ma, S., Lin, C., Huat, G.K., Kwon, H.E., Gao, F., et al.: Evomdt: a self-evolving multi-agent system for structured 22 clinical decision-making in multi-cancer. npj Digital Medicine (2026)

work page 2026
[25]

Advances in neural in- formation processing systems, 35:27730–27744

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2503.13939 , year=

Lai, Y., Zhong, J., Li, M., Zhao, S., Yang, X.: Med-r1: Reinforcement learn- ing for generalizable medical reasoning in vision-language models. arXiv preprint arXiv:2503.13939 (2025)

work page arXiv 2025
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., Wang, B.: Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Journal of Medical Internet Research26, 54616 (2024)

Zou, X., He, W., Huang, Y., Ouyang, Y., Zhang, Z., Wu, Y., Wu, Y., Feng, L., Wu, S., Yang, M.,et al.: Ai-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. Journal of Medical Internet Research26, 54616 (2024)

work page 2024
[30]

arXiv preprint arXiv:2503.16463 (2025)

Sun, Z., Liu, Z., Luo, C., Chu, J., Huang, Z.: Improving interactive diagnostic ability of a large language model agent through clinical experience learning. arXiv preprint arXiv:2503.16463 (2025)

work page arXiv 2025
[31]

Qwen2.5-Omni Technical Report

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

Feng, Y., Wang, J., Zhou, L., Lei, Z., Li, Y.: Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue. In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026). IEEE. Accepted for publication. DOI to be assigned. Available at: https://github.com/J...

work page 2026
[33]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Qui˜ nonero-Candela, J., Tsim- pourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., et al.: Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 23 2567–2577 (2019)

work page 2019
[35]

Scientific Data10(1), 170 (2023)

Krithara, A., Nentidis, A., Bougiatiotis, K., Paliouras, G.: Bioasq-qa: A manu- ally curated corpus for biomedical question answering. Scientific Data10(1), 170 (2023)

work page 2023
[36]

arXiv preprint arXiv:2004.03329 (2020)

He, X., Chen, S., Ju, Z., Dong, X., Fang, H., Wang, S., Yang, Y., Zeng, J., Zhang, R., Zhang, R., et al.: Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020)

work page arXiv 2004
[37]

Bioinformatics (2022)

Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Wei, Z.: A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks and Datasets. Bioinformatics (2022)

work page 2022
[38]

In: CCF International Conference on Natural Language Processing and Chinese Computing, pp

Liu, W., Tang, J., Cheng, Y., Li, W., Zheng, Y., Liang, X.: Meddg: an entity- centric medical consultation dataset for entity-aware medical dialogue generation. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459 (2022). Springer

work page 2022
[39]

arXiv preprint arXiv:2106.08087 (2021)

Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Yin, K., Tan, C., Xu, J., Huang, F., et al.: Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021)

work page arXiv 2021
[40]

Nature Medicine (2025)

Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.-Y., Cai, Z.R., et al.: An evaluation framework for conversational reasoning in clinical llms during patient interactions. Nature Medicine (2025)

work page 2025
[41]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b. arXiv preprintarXiv:2310.06825(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

arXiv preprint arXiv:2510.04284 (2025) 24

Lai, Y., Liu, K., Wang, Z., Ma, W., Liu, Y.: Doctor-r1: Mastering clinical inquiry with experiential agentic reinforcement learning. arXiv preprint arXiv:2510.04284 (2025) 24

work page arXiv 2025
[46]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024) 25

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

npj Digital Medicine8(1), 178 (2025)

Kopka, M., Kalckreuth, N., Feufel, M.A.: Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine8(1), 178 (2025)

work page 2025

[5] [5]

The Lancet Global Health11(8), 1162–1164 (2023)

Agyeman-Manu, K., Ghebreyesus, T.A., Maait, M., Rafila, A., Tom, L., Lima, N.T., Wangmo, D.: Prioritising the health and care workforce shortage: protect, invest, together. The Lancet Global Health11(8), 1162–1164 (2023)

work page 2023

[6] [6]

ALQahtani, D.A., Rotgans, J.I., Mamede, S., ALAlwan, I., Magzoub, M.E.M., Altayeb, F.M., Mohamedani, M.A., Schmidt, H.G.: Does time pressure have a negative effect on diagnostic accuracy? Academic Medicine91(5), 710–716 (2016)

work page 2016

[7] [7]

The Annals of Family Medicine22(1), 12–18 (2024)

Arndt, B.G., Micek, M.A., Rule, A., Shafer, C.M., Baltus, J.J., Sinsky, C.A.: More tethered to the ehr: Ehr workload trends among academic primary care physicians, 2019-2023. The Annals of Family Medicine22(1), 12–18 (2024)

work page 2019

[8] [8]

Journal of personalized medicine13(6), 951 (2023)

Al Kuwaiti, A., Nazer, K., Al-Reedy, A., Al-Shehri, S., Al-Muhanna, A., Sub- barayalu, A.V., Al Muhanna, D., Al-Muhanna, F.A.: A review of the role of artificial intelligence in healthcare. Journal of personalized medicine13(6), 951 (2023)

work page 2023

[9] [9]

Medalpaca–an open-source collection of medical conversational ai models and training data

Han, T., Adams, L.C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨ oser, A., Truhn, D., Bressem, K.K.: Medalpaca–an open-source collec- tion of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247 (2023)

work page arXiv 2023

[10] [10]

Biomistral: A collection of open-source pretrained large language models for medical domains

Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373 (2024)

work page arXiv 2024

[11] [11]

npj Digital Medicine8(1), 58 (2025)

Wu, C., Qiu, P., Liu, J., Gu, H., Li, N., Zhang, Y., Wang, Y., Xie, W.: Towards evaluating and building versatile large language models for medicine. npj Digital Medicine8(1), 58 (2025)

work page 2025

[12] [12]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y., Neville, J.: Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

JAMA Internal Medicine184(2), 164–173 (2024)

Auerbach, A.D., Lee, T.M., Hubbard, C.C., Ranji, S.R., Raffel, K., Valdes, G., 21 Boscardin, J., Dalal, A.K., Harris, A., Flynn, E.,et al.: Diagnostic errors in hos- pitalized adults who died or were transferred to intensive care. JAMA Internal Medicine184(2), 164–173 (2024)

work page 2024

[14] [14]

arXiv preprint arXiv:2310.15896 (2023)

Chen, Y., Wang, Z., Xing, X., Xu, Z., Fang, K., Wang, J., Li, S., Wu, J., Liu, Q., Xu, X., et al.: Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023)

work page arXiv 2023

[15] [15]

Deep learning ba sed recommender system: A survey and new perspectives

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Gao, J., Liu, J., Dolan, B.: Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536 (2019)

work page arXiv 1911

[16] [16]

In: Proceedings of the 31st International Conference on Computational Linguistics, pp

Liu, R., Xue, K., Zhang, X., Zhang, S.: Interactive evaluation for medical llms via task-oriented dialogue system. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 4871–4896 (2025)

work page 2025

[17] [17]

Nature Medicine, pages 1–8

Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

work page arXiv 2023

[18] [18]

Advances in Neural Information Processing Systems37, 26045–26081 (2024)

Zhang, K., Zeng, S., Hua, E., Ding, N., Chen, Z.-R., Ma, Z., Li, H., Cui, G., Qi, B., Zhu, X.,et al.: Ultramedical: Building specialized generalists in biomedicine. Advances in Neural Information Processing Systems37, 26045–26081 (2024)

work page 2024

[19] [19]

GigaScience14, 082 (2025)

Feng, Y., Zhou, L., Ma, C., Zheng, Y., He, R., Li, Y.: Knowledge graph–based thought: a knowledge graph–enhanced llm framework for pan-cancer question answering. GigaScience14, 082 (2025)

work page 2025

[20] [20]

Nature, 1–9 (2025)

Tu, T., Schaekermann, M., Palepu, A., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Cheng, Y., et al.: Towards conversational diagnostic artificial intelligence. Nature, 1–9 (2025)

work page 2025

[21] [21]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Li, J., Lai, Y., Li, W., Ren, J., Zhang, M., Kang, X., Wang, S., Li, P., Zhang, Y.-Q., Ma, W., et al.: Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957 (2024)

work page arXiv 2024

[22] [22]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Fan, Z., Tang, J., Chen, W., Wang, S., Wei, Z., Xi, J., Huang, F., Zhou, J.: Ai hos- pital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742 (2024)

work page arXiv 2024

[23] [23]

NPJ digital medicine8(1), 159 (2025)

Chen, X., Yi, H., You, M., Liu, W., Wang, L., Li, H., Zhang, X., Guo, Y., Fan, L., Chen, G.,et al.: Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine8(1), 159 (2025)

work page 2025

[24] [24]

npj Digital Medicine (2026)

Liu, Q., Hu, Z., Huang, T., Niu, Y., Zhang, X., Ma, S., Lin, C., Huat, G.K., Kwon, H.E., Gao, F., et al.: Evomdt: a self-evolving multi-agent system for structured 22 clinical decision-making in multi-cancer. npj Digital Medicine (2026)

work page 2026

[25] [25]

Advances in neural in- formation processing systems, 35:27730–27744

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

work page arXiv 2025

[26] [26]

arXiv preprint arXiv:2503.13939 , year=

Lai, Y., Zhong, J., Li, M., Zhao, S., Yang, X.: Med-r1: Reinforcement learn- ing for generalizable medical reasoning in vision-language models. arXiv preprint arXiv:2503.13939 (2025)

work page arXiv 2025

[27] [27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., Wang, B.: Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Journal of Medical Internet Research26, 54616 (2024)

Zou, X., He, W., Huang, Y., Ouyang, Y., Zhang, Z., Wu, Y., Wu, Y., Feng, L., Wu, S., Yang, M.,et al.: Ai-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. Journal of Medical Internet Research26, 54616 (2024)

work page 2024

[30] [30]

arXiv preprint arXiv:2503.16463 (2025)

Sun, Z., Liu, Z., Luo, C., Chu, J., Huang, Z.: Improving interactive diagnostic ability of a large language model agent through clinical experience learning. arXiv preprint arXiv:2503.16463 (2025)

work page arXiv 2025

[31] [31]

Qwen2.5-Omni Technical Report

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

Feng, Y., Wang, J., Zhou, L., Lei, Z., Li, Y.: Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue. In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026). IEEE. Accepted for publication. DOI to be assigned. Available at: https://github.com/J...

work page 2026

[33] [33]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Qui˜ nonero-Candela, J., Tsim- pourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., et al.: Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 23 2567–2577 (2019)

work page 2019

[35] [35]

Scientific Data10(1), 170 (2023)

Krithara, A., Nentidis, A., Bougiatiotis, K., Paliouras, G.: Bioasq-qa: A manu- ally curated corpus for biomedical question answering. Scientific Data10(1), 170 (2023)

work page 2023

[36] [36]

arXiv preprint arXiv:2004.03329 (2020)

He, X., Chen, S., Ju, Z., Dong, X., Fang, H., Wang, S., Yang, Y., Zeng, J., Zhang, R., Zhang, R., et al.: Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020)

work page arXiv 2004

[37] [37]

Bioinformatics (2022)

Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Wei, Z.: A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks and Datasets. Bioinformatics (2022)

work page 2022

[38] [38]

In: CCF International Conference on Natural Language Processing and Chinese Computing, pp

Liu, W., Tang, J., Cheng, Y., Li, W., Zheng, Y., Liang, X.: Meddg: an entity- centric medical consultation dataset for entity-aware medical dialogue generation. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459 (2022). Springer

work page 2022

[39] [39]

arXiv preprint arXiv:2106.08087 (2021)

Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Yin, K., Tan, C., Xu, J., Huang, F., et al.: Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021)

work page arXiv 2021

[40] [40]

Nature Medicine (2025)

Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.-Y., Cai, Z.R., et al.: An evaluation framework for conversational reasoning in clinical llms during patient interactions. Nature Medicine (2025)

work page 2025

[41] [41]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b. arXiv preprintarXiv:2310.06825(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

arXiv preprint arXiv:2510.04284 (2025) 24

Lai, Y., Liu, K., Wang, Z., Ma, W., Liu, Y.: Doctor-r1: Mastering clinical inquiry with experiential agentic reinforcement learning. arXiv preprint arXiv:2510.04284 (2025) 24

work page arXiv 2025

[46] [46]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024) 25

work page internal anchor Pith review Pith/arXiv arXiv 2024