Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning
Pith reviewed 2026-05-19 13:35 UTC · model grok-4.3
The pith
A multi-agent reinforcement learning system trains a doctor agent to ask strategic questions over multiple turns and reach a 70 percent exact diagnostic match rate with real patients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DoctorAgent-RL is a reinforcement-learning multi-agent framework that trains the doctor agent to master a questioning methodology rather than to recall answers, so that key patient details emerge progressively through guided multi-turn dialogue and produce an optimal diagnosis, as measured by a 70 percent exact diagnostic match rate in real-patient trials.
What carries the argument
DoctorAgent-RL, the reinforcement-learning multi-agent framework that models consultation as dynamic decision-making under uncertainty and optimizes the doctor's questioning policy to maximize diagnostic information gain.
If this is right
- The trained agent can perform initial screenings so that human clinicians can devote time to more difficult cases.
- Reduced misdiagnosis risk and lower overall strain on healthcare resources follow directly from wider use of the agent for routine intake.
- The same policy-learning approach can be applied to any domain in which an expert must gather information through successive questions rather than receive it all at once.
- Real-patient validation already demonstrates that performance gains seen in simulation transfer to live interactions.
Where Pith is reading between the lines
- The learned questioning policy could be inspected to see which sequences of questions most reliably surface critical information, offering a data-driven view of efficient clinical inquiry.
- Pairing the agent with electronic health record access would let it condition questions on prior history and test whether accuracy rises further.
- The framework might be adapted to non-medical information-gathering dialogues such as technical troubleshooting or legal intake interviews.
Load-bearing premise
The MTMedDialog dataset together with the chosen reinforcement-learning reward function faithfully reproduces the uncertainty and response variability of real clinical encounters.
What would settle it
A follow-up study that runs the agent on a fresh, larger cohort of real patients and records an exact diagnostic match rate well below 70 percent or no better than standard large language models.
read the original abstract
Large language models (LLMs) struggle in real-world clinical consultations. Single-turn consultation systems require patients to describe all symptoms at once, which often leads to unclear complaints and vague diagnoses. Traditional dialogue models, constrained by static supervised learning, are limited to superficially imitating existing dialogue patterns and lack the ability to actively construct understanding in dynamic interactions, thus failing to achieve genuine clinical reasoning.To address these challenges, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework, and train a doctor agent on Qwen2.5-7B-Instruct using this framework. Within this framework, a medical consultation is modeled as a dynamic decision-making process under uncertainty. The core intelligence of the doctor agent is shifted from knowing the answer to learning and mastering a questioning methodology aimed at achieving an optimal diagnosis. Through strategic questioning, it guides the progressive emergence of key patient information in multi-turn dialogues. To support this high-fidelity simulation of the real diagnostic process, we constructed MTMedDialog, a novel English multi-turn medical consultation dataset designed for dynamic, interactive training.To validate its real-world effectiveness, rigorous evaluations including blinded human assessments and trials with real patients were conducted. DoctorAgent-RL outperformed frontier models and achieved a 70% exact diagnostic match rate, confirming its potential as a collaborative tool. By handling initial screenings, it can free clinicians to focus on complex cases, thereby addressing critical issues like physician shortages and misdiagnosis risks while alleviating the strain on healthcare resources.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DoctorAgent-RL, a multi-agent reinforcement learning framework that trains a doctor agent based on Qwen2.5-7B-Instruct to perform proactive, multi-turn medical consultations. It constructs the MTMedDialog dataset to simulate dynamic diagnostic interactions under uncertainty and reports that the resulting system outperforms frontier models while achieving a 70% exact diagnostic match rate in blinded human assessments and real-patient trials.
Significance. If the performance claims are substantiated with adequate controls, the work could advance clinical AI by shifting from static imitation learning to RL-driven active questioning strategies, offering a practical path to improve initial screenings and alleviate physician workload. The inclusion of real-patient trials provides a stronger test of applicability than purely simulated benchmarks.
major comments (2)
- Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.
- RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.
minor comments (2)
- Introduction: The contrast between 'knowing the answer' and 'learning a questioning methodology' would be strengthened by a brief illustrative dialogue excerpt showing how the RL policy differs from supervised baselines.
- Evaluation: Add a table summarizing inter-rater agreement statistics and baseline model scores to make the human assessment results easier to interpret.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.
Authors: We agree that the abstract would benefit from additional context on the evaluation to support the claims. The main text describes the blinded human assessments and real-patient trials, including sample sizes, demographics, comparisons to frontier models as controls, and expert verification of diagnoses. We will revise the abstract to concisely incorporate these details while maintaining brevity. revision: yes
-
Referee: RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.
Authors: We acknowledge the need for greater specificity here. The manuscript presents the overall multi-agent RL setup and MTMedDialog construction, but we will expand the relevant sections with explicit descriptions of the reward function (including terms for information gain under uncertainty) and simulator mechanics for handling symptom ambiguity, incomplete patient histories, and non-cooperative or variable responses, supported by examples. revision: yes
Circularity Check
No circularity: empirical RL training and external validation on constructed dataset
full rationale
The paper describes an empirical multi-agent RL framework trained on the newly introduced MTMedDialog dataset, with performance claims (70% exact match rate, outperformance of frontier models) resting on blinded human assessments and real-patient trials rather than any closed-form derivations, fitted parameters renamed as predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems appear in the abstract or description, and the evaluation uses external benchmarks and live patient interactions that are independent of the training loop. This structure is self-contained against external benchmarks with no load-bearing self-citation chains or reductions by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
npj Digital Medicine8(1), 178 (2025)
Kopka, M., Kalckreuth, N., Feufel, M.A.: Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine8(1), 178 (2025)
work page 2025
-
[5]
The Lancet Global Health11(8), 1162–1164 (2023)
Agyeman-Manu, K., Ghebreyesus, T.A., Maait, M., Rafila, A., Tom, L., Lima, N.T., Wangmo, D.: Prioritising the health and care workforce shortage: protect, invest, together. The Lancet Global Health11(8), 1162–1164 (2023)
work page 2023
-
[6]
ALQahtani, D.A., Rotgans, J.I., Mamede, S., ALAlwan, I., Magzoub, M.E.M., Altayeb, F.M., Mohamedani, M.A., Schmidt, H.G.: Does time pressure have a negative effect on diagnostic accuracy? Academic Medicine91(5), 710–716 (2016)
work page 2016
-
[7]
The Annals of Family Medicine22(1), 12–18 (2024)
Arndt, B.G., Micek, M.A., Rule, A., Shafer, C.M., Baltus, J.J., Sinsky, C.A.: More tethered to the ehr: Ehr workload trends among academic primary care physicians, 2019-2023. The Annals of Family Medicine22(1), 12–18 (2024)
work page 2019
-
[8]
Journal of personalized medicine13(6), 951 (2023)
Al Kuwaiti, A., Nazer, K., Al-Reedy, A., Al-Shehri, S., Al-Muhanna, A., Sub- barayalu, A.V., Al Muhanna, D., Al-Muhanna, F.A.: A review of the role of artificial intelligence in healthcare. Journal of personalized medicine13(6), 951 (2023)
work page 2023
-
[9]
Medalpaca–an open-source collection of medical conversational ai models and training data
Han, T., Adams, L.C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨ oser, A., Truhn, D., Bressem, K.K.: Medalpaca–an open-source collec- tion of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247 (2023)
-
[10]
Biomistral: A collection of open-source pretrained large language models for medical domains
Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373 (2024)
-
[11]
npj Digital Medicine8(1), 58 (2025)
Wu, C., Qiu, P., Liu, J., Gu, H., Li, N., Zhang, Y., Wang, Y., Xie, W.: Towards evaluating and building versatile large language models for medicine. npj Digital Medicine8(1), 58 (2025)
work page 2025
-
[12]
LLMs Get Lost In Multi-Turn Conversation
Laban, P., Hayashi, H., Zhou, Y., Neville, J.: Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
JAMA Internal Medicine184(2), 164–173 (2024)
Auerbach, A.D., Lee, T.M., Hubbard, C.C., Ranji, S.R., Raffel, K., Valdes, G., 21 Boscardin, J., Dalal, A.K., Harris, A., Flynn, E.,et al.: Diagnostic errors in hos- pitalized adults who died or were transferred to intensive care. JAMA Internal Medicine184(2), 164–173 (2024)
work page 2024
-
[14]
arXiv preprint arXiv:2310.15896 (2023)
Chen, Y., Wang, Z., Xing, X., Xu, Z., Fang, K., Wang, J., Li, S., Wu, J., Liu, Q., Xu, X., et al.: Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023)
-
[15]
Deep learning ba sed recommender system: A survey and new perspectives
Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Gao, J., Liu, J., Dolan, B.: Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536 (2019)
-
[16]
In: Proceedings of the 31st International Conference on Computational Linguistics, pp
Liu, R., Xue, K., Zhang, X., Zhang, S.: Interactive evaluation for medical llms via task-oriented dialogue system. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 4871–4896 (2025)
work page 2025
-
[17]
Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)
-
[18]
Advances in Neural Information Processing Systems37, 26045–26081 (2024)
Zhang, K., Zeng, S., Hua, E., Ding, N., Chen, Z.-R., Ma, Z., Li, H., Cui, G., Qi, B., Zhu, X.,et al.: Ultramedical: Building specialized generalists in biomedicine. Advances in Neural Information Processing Systems37, 26045–26081 (2024)
work page 2024
-
[19]
Feng, Y., Zhou, L., Ma, C., Zheng, Y., He, R., Li, Y.: Knowledge graph–based thought: a knowledge graph–enhanced llm framework for pan-cancer question answering. GigaScience14, 082 (2025)
work page 2025
-
[20]
Tu, T., Schaekermann, M., Palepu, A., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Cheng, Y., et al.: Towards conversational diagnostic artificial intelligence. Nature, 1–9 (2025)
work page 2025
-
[21]
Agent hospital: A simulacrum of hospital with evolvable medical agents,
Li, J., Lai, Y., Li, W., Ren, J., Zhang, M., Kang, X., Wang, S., Li, P., Zhang, Y.-Q., Ma, W., et al.: Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957 (2024)
-
[22]
Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,
Fan, Z., Tang, J., Chen, W., Wang, S., Wei, Z., Xi, J., Huang, F., Zhou, J.: Ai hos- pital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742 (2024)
-
[23]
NPJ digital medicine8(1), 159 (2025)
Chen, X., Yi, H., You, M., Liu, W., Wang, L., Li, H., Zhang, X., Guo, Y., Fan, L., Chen, G.,et al.: Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine8(1), 159 (2025)
work page 2025
-
[24]
Liu, Q., Hu, Z., Huang, T., Niu, Y., Zhang, X., Ma, S., Lin, C., Huat, G.K., Kwon, H.E., Gao, F., et al.: Evomdt: a self-evolving multi-agent system for structured 22 clinical decision-making in multi-cancer. npj Digital Medicine (2026)
work page 2026
-
[25]
Advances in neural in- formation processing systems, 35:27730–27744
Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)
-
[26]
arXiv preprint arXiv:2503.13939 , year=
Lai, Y., Zhong, J., Li, M., Zhao, S., Yang, X.: Med-r1: Reinforcement learn- ing for generalizable medical reasoning in vision-language models. arXiv preprint arXiv:2503.13939 (2025)
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., Wang, B.: Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Journal of Medical Internet Research26, 54616 (2024)
Zou, X., He, W., Huang, Y., Ouyang, Y., Zhang, Z., Wu, Y., Wu, Y., Feng, L., Wu, S., Yang, M.,et al.: Ai-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. Journal of Medical Internet Research26, 54616 (2024)
work page 2024
-
[30]
arXiv preprint arXiv:2503.16463 (2025)
Sun, Z., Liu, Z., Luo, C., Chu, J., Huang, Z.: Improving interactive diagnostic ability of a large language model agent through clinical experience learning. arXiv preprint arXiv:2503.16463 (2025)
-
[31]
Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Feng, Y., Wang, J., Zhou, L., Lei, Z., Li, Y.: Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue. In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026). IEEE. Accepted for publication. DOI to be assigned. Available at: https://github.com/J...
work page 2026
-
[33]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Qui˜ nonero-Candela, J., Tsim- pourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., et al.: Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 23 2567–2577 (2019)
work page 2019
-
[35]
Scientific Data10(1), 170 (2023)
Krithara, A., Nentidis, A., Bougiatiotis, K., Paliouras, G.: Bioasq-qa: A manu- ally curated corpus for biomedical question answering. Scientific Data10(1), 170 (2023)
work page 2023
-
[36]
arXiv preprint arXiv:2004.03329 (2020)
He, X., Chen, S., Ju, Z., Dong, X., Fang, H., Wang, S., Yang, Y., Zeng, J., Zhang, R., Zhang, R., et al.: Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020)
-
[37]
Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Wei, Z.: A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks and Datasets. Bioinformatics (2022)
work page 2022
-
[38]
In: CCF International Conference on Natural Language Processing and Chinese Computing, pp
Liu, W., Tang, J., Cheng, Y., Li, W., Zheng, Y., Liang, X.: Meddg: an entity- centric medical consultation dataset for entity-aware medical dialogue generation. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459 (2022). Springer
work page 2022
-
[39]
arXiv preprint arXiv:2106.08087 (2021)
Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Yin, K., Tan, C., Xu, J., Huang, F., et al.: Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021)
-
[40]
Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.-Y., Cai, Z.R., et al.: An evaluation framework for conversational reasoning in clinical llms during patient interactions. Nature Medicine (2025)
work page 2025
-
[41]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b. arXiv preprintarXiv:2310.06825(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
arXiv preprint arXiv:2510.04284 (2025) 24
Lai, Y., Liu, K., Wang, Z., Ma, W., Liu, Y.: Doctor-r1: Mastering clinical inquiry with experiential agentic reinforcement learning. arXiv preprint arXiv:2510.04284 (2025) 24
-
[46]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024) 25
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.