pith. sign in

arxiv: 2505.19630 · v4 · submitted 2025-05-26 · 💻 cs.CL

Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Pith reviewed 2026-05-19 13:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords medical dialogue systemsreinforcement learningmulti-agent collaborationproactive consultationmulti-turn interactionclinical decision supportAI for healthcare
0
0 comments X

The pith

A multi-agent reinforcement learning system trains a doctor agent to ask strategic questions over multiple turns and reach a 70 percent exact diagnostic match rate with real patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that large language models fall short in clinical settings because they expect patients to supply complete symptom lists in one go and because supervised training only copies surface patterns instead of building understanding step by step. By recasting the consultation as a sequential decision process under uncertainty, the authors train an agent to learn an active questioning policy that elicits the missing information needed for a reliable diagnosis. They support this training with a new multi-turn English medical dialogue dataset and then test the resulting agent in both blinded human reviews and actual patient encounters. A reader would care because the approach offers a concrete way for AI to manage routine initial screenings and thereby ease pressure on limited medical staff.

Core claim

DoctorAgent-RL is a reinforcement-learning multi-agent framework that trains the doctor agent to master a questioning methodology rather than to recall answers, so that key patient details emerge progressively through guided multi-turn dialogue and produce an optimal diagnosis, as measured by a 70 percent exact diagnostic match rate in real-patient trials.

What carries the argument

DoctorAgent-RL, the reinforcement-learning multi-agent framework that models consultation as dynamic decision-making under uncertainty and optimizes the doctor's questioning policy to maximize diagnostic information gain.

If this is right

  • The trained agent can perform initial screenings so that human clinicians can devote time to more difficult cases.
  • Reduced misdiagnosis risk and lower overall strain on healthcare resources follow directly from wider use of the agent for routine intake.
  • The same policy-learning approach can be applied to any domain in which an expert must gather information through successive questions rather than receive it all at once.
  • Real-patient validation already demonstrates that performance gains seen in simulation transfer to live interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The learned questioning policy could be inspected to see which sequences of questions most reliably surface critical information, offering a data-driven view of efficient clinical inquiry.
  • Pairing the agent with electronic health record access would let it condition questions on prior history and test whether accuracy rises further.
  • The framework might be adapted to non-medical information-gathering dialogues such as technical troubleshooting or legal intake interviews.

Load-bearing premise

The MTMedDialog dataset together with the chosen reinforcement-learning reward function faithfully reproduces the uncertainty and response variability of real clinical encounters.

What would settle it

A follow-up study that runs the agent on a fresh, larger cohort of real patients and records an exact diagnostic match rate well below 70 percent or no better than standard large language models.

read the original abstract

Large language models (LLMs) struggle in real-world clinical consultations. Single-turn consultation systems require patients to describe all symptoms at once, which often leads to unclear complaints and vague diagnoses. Traditional dialogue models, constrained by static supervised learning, are limited to superficially imitating existing dialogue patterns and lack the ability to actively construct understanding in dynamic interactions, thus failing to achieve genuine clinical reasoning.To address these challenges, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework, and train a doctor agent on Qwen2.5-7B-Instruct using this framework. Within this framework, a medical consultation is modeled as a dynamic decision-making process under uncertainty. The core intelligence of the doctor agent is shifted from knowing the answer to learning and mastering a questioning methodology aimed at achieving an optimal diagnosis. Through strategic questioning, it guides the progressive emergence of key patient information in multi-turn dialogues. To support this high-fidelity simulation of the real diagnostic process, we constructed MTMedDialog, a novel English multi-turn medical consultation dataset designed for dynamic, interactive training.To validate its real-world effectiveness, rigorous evaluations including blinded human assessments and trials with real patients were conducted. DoctorAgent-RL outperformed frontier models and achieved a 70% exact diagnostic match rate, confirming its potential as a collaborative tool. By handling initial screenings, it can free clinicians to focus on complex cases, thereby addressing critical issues like physician shortages and misdiagnosis risks while alleviating the strain on healthcare resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DoctorAgent-RL, a multi-agent reinforcement learning framework that trains a doctor agent based on Qwen2.5-7B-Instruct to perform proactive, multi-turn medical consultations. It constructs the MTMedDialog dataset to simulate dynamic diagnostic interactions under uncertainty and reports that the resulting system outperforms frontier models while achieving a 70% exact diagnostic match rate in blinded human assessments and real-patient trials.

Significance. If the performance claims are substantiated with adequate controls, the work could advance clinical AI by shifting from static imitation learning to RL-driven active questioning strategies, offering a practical path to improve initial screenings and alleviate physician workload. The inclusion of real-patient trials provides a stronger test of applicability than purely simulated benchmarks.

major comments (2)
  1. Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.
  2. RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.
minor comments (2)
  1. Introduction: The contrast between 'knowing the answer' and 'learning a questioning methodology' would be strengthened by a brief illustrative dialogue excerpt showing how the RL policy differs from supervised baselines.
  2. Evaluation: Add a table summarizing inter-rater agreement statistics and baseline model scores to make the human assessment results easier to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: The central claim of a 70% exact diagnostic match rate and outperformance of frontier models in real-patient trials lacks any mention of sample size, patient demographics, control conditions, or verification procedures for the diagnoses; this information is load-bearing for assessing whether the result demonstrates genuine clinical reasoning rather than evaluation artifacts.

    Authors: We agree that the abstract would benefit from additional context on the evaluation to support the claims. The main text describes the blinded human assessments and real-patient trials, including sample sizes, demographics, comparisons to frontier models as controls, and expert verification of diagnoses. We will revise the abstract to concisely incorporate these details while maintaining brevity. revision: yes

  2. Referee: RL Framework and Dataset sections: The reward structure and patient simulator in MTMedDialog are not specified in sufficient detail to determine whether they model variable symptom ambiguity, incomplete histories, and non-cooperative responses; without this, the learned policy may exploit low-variance patterns in the training data instead of acquiring robust diagnostic strategies.

    Authors: We acknowledge the need for greater specificity here. The manuscript presents the overall multi-agent RL setup and MTMedDialog construction, but we will expand the relevant sections with explicit descriptions of the reward function (including terms for information gain under uncertainty) and simulator mechanics for handling symptom ambiguity, incomplete patient histories, and non-cooperative or variable responses, supported by examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training and external validation on constructed dataset

full rationale

The paper describes an empirical multi-agent RL framework trained on the newly introduced MTMedDialog dataset, with performance claims (70% exact match rate, outperformance of frontier models) resting on blinded human assessments and real-patient trials rather than any closed-form derivations, fitted parameters renamed as predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems appear in the abstract or description, and the evaluation uses external benchmarks and live patient interactions that are independent of the training loop. This structure is self-contained against external benchmarks with no load-bearing self-citation chains or reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Assessment uses only the abstract; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5810 in / 1020 out tokens · 36852 ms · 2026-05-19T13:35:23.489711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

    cs.CL 2026-05 unverdicted novelty 7.0

    A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

  2. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 20

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  3. [3]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

  4. [4]

    npj Digital Medicine8(1), 178 (2025)

    Kopka, M., Kalckreuth, N., Feufel, M.A.: Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions. npj Digital Medicine8(1), 178 (2025)

  5. [5]

    The Lancet Global Health11(8), 1162–1164 (2023)

    Agyeman-Manu, K., Ghebreyesus, T.A., Maait, M., Rafila, A., Tom, L., Lima, N.T., Wangmo, D.: Prioritising the health and care workforce shortage: protect, invest, together. The Lancet Global Health11(8), 1162–1164 (2023)

  6. [6]

    ALQahtani, D.A., Rotgans, J.I., Mamede, S., ALAlwan, I., Magzoub, M.E.M., Altayeb, F.M., Mohamedani, M.A., Schmidt, H.G.: Does time pressure have a negative effect on diagnostic accuracy? Academic Medicine91(5), 710–716 (2016)

  7. [7]

    The Annals of Family Medicine22(1), 12–18 (2024)

    Arndt, B.G., Micek, M.A., Rule, A., Shafer, C.M., Baltus, J.J., Sinsky, C.A.: More tethered to the ehr: Ehr workload trends among academic primary care physicians, 2019-2023. The Annals of Family Medicine22(1), 12–18 (2024)

  8. [8]

    Journal of personalized medicine13(6), 951 (2023)

    Al Kuwaiti, A., Nazer, K., Al-Reedy, A., Al-Shehri, S., Al-Muhanna, A., Sub- barayalu, A.V., Al Muhanna, D., Al-Muhanna, F.A.: A review of the role of artificial intelligence in healthcare. Journal of personalized medicine13(6), 951 (2023)

  9. [9]

    Medalpaca–an open-source collection of medical conversational ai models and training data

    Han, T., Adams, L.C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨ oser, A., Truhn, D., Bressem, K.K.: Medalpaca–an open-source collec- tion of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247 (2023)

  10. [10]

    Biomistral: A collection of open-source pretrained large language models for medical domains

    Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373 (2024)

  11. [11]

    npj Digital Medicine8(1), 58 (2025)

    Wu, C., Qiu, P., Liu, J., Gu, H., Li, N., Zhang, Y., Wang, Y., Xie, W.: Towards evaluating and building versatile large language models for medicine. npj Digital Medicine8(1), 58 (2025)

  12. [12]

    LLMs Get Lost In Multi-Turn Conversation

    Laban, P., Hayashi, H., Zhou, Y., Neville, J.: Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120 (2025)

  13. [13]

    JAMA Internal Medicine184(2), 164–173 (2024)

    Auerbach, A.D., Lee, T.M., Hubbard, C.C., Ranji, S.R., Raffel, K., Valdes, G., 21 Boscardin, J., Dalal, A.K., Harris, A., Flynn, E.,et al.: Diagnostic errors in hos- pitalized adults who died or were transferred to intensive care. JAMA Internal Medicine184(2), 164–173 (2024)

  14. [14]

    arXiv preprint arXiv:2310.15896 (2023)

    Chen, Y., Wang, Z., Xing, X., Xu, Z., Fang, K., Wang, J., Li, S., Wu, J., Liu, Q., Xu, X., et al.: Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023)

  15. [15]

    Deep learning ba sed recommender system: A survey and new perspectives

    Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Gao, J., Liu, J., Dolan, B.: Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536 (2019)

  16. [16]

    In: Proceedings of the 31st International Conference on Computational Linguistics, pp

    Liu, R., Xue, K., Zhang, X., Zhang, S.: Interactive evaluation for medical llms via task-oriented dialogue system. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 4871–4896 (2025)

  17. [17]

    Nature Medicine, pages 1–8

    Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

  18. [18]

    Advances in Neural Information Processing Systems37, 26045–26081 (2024)

    Zhang, K., Zeng, S., Hua, E., Ding, N., Chen, Z.-R., Ma, Z., Li, H., Cui, G., Qi, B., Zhu, X.,et al.: Ultramedical: Building specialized generalists in biomedicine. Advances in Neural Information Processing Systems37, 26045–26081 (2024)

  19. [19]

    GigaScience14, 082 (2025)

    Feng, Y., Zhou, L., Ma, C., Zheng, Y., He, R., Li, Y.: Knowledge graph–based thought: a knowledge graph–enhanced llm framework for pan-cancer question answering. GigaScience14, 082 (2025)

  20. [20]

    Nature, 1–9 (2025)

    Tu, T., Schaekermann, M., Palepu, A., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Cheng, Y., et al.: Towards conversational diagnostic artificial intelligence. Nature, 1–9 (2025)

  21. [21]

    Agent hospital: A simulacrum of hospital with evolvable medical agents,

    Li, J., Lai, Y., Li, W., Ren, J., Zhang, M., Kang, X., Wang, S., Li, P., Zhang, Y.-Q., Ma, W., et al.: Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957 (2024)

  22. [22]

    Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

    Fan, Z., Tang, J., Chen, W., Wang, S., Wei, Z., Xi, J., Huang, F., Zhou, J.: Ai hos- pital: Benchmarking large language models in a multi-agent medical interaction simulator. arXiv preprint arXiv:2402.09742 (2024)

  23. [23]

    NPJ digital medicine8(1), 159 (2025)

    Chen, X., Yi, H., You, M., Liu, W., Wang, L., Li, H., Zhang, X., Guo, Y., Fan, L., Chen, G.,et al.: Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine8(1), 159 (2025)

  24. [24]

    npj Digital Medicine (2026)

    Liu, Q., Hu, Z., Huang, T., Niu, Y., Zhang, X., Ma, S., Lin, C., Huat, G.K., Kwon, H.E., Gao, F., et al.: Evomdt: a self-evolving multi-agent system for structured 22 clinical decision-making in multi-cancer. npj Digital Medicine (2026)

  25. [25]

    Advances in neural in- formation processing systems, 35:27730–27744

    Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

  26. [26]

    arXiv preprint arXiv:2503.13939 , year=

    Lai, Y., Zhong, J., Li, M., Zhao, S., Yang, X.: Med-r1: Reinforcement learn- ing for generalizable medical reasoning in vision-language models. arXiv preprint arXiv:2503.13939 (2025)

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  28. [28]

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., Wang, B.: Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

  29. [29]

    Journal of Medical Internet Research26, 54616 (2024)

    Zou, X., He, W., Huang, Y., Ouyang, Y., Zhang, Z., Wu, Y., Wu, Y., Feng, L., Wu, S., Yang, M.,et al.: Ai-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. Journal of Medical Internet Research26, 54616 (2024)

  30. [30]

    arXiv preprint arXiv:2503.16463 (2025)

    Sun, Z., Liu, Z., Luo, C., Chu, J., Huang, Z.: Improving interactive diagnostic ability of a large language model agent through clinical experience learning. arXiv preprint arXiv:2503.16463 (2025)

  31. [31]

    Qwen2.5-Omni Technical Report

    Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., et al.: Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

  32. [32]

    In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

    Feng, Y., Wang, J., Zhou, L., Lei, Z., Li, Y.: Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue. In: Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026). IEEE. Accepted for publication. DOI to be assigned. Available at: https://github.com/J...

  33. [33]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Qui˜ nonero-Candela, J., Tsim- pourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., et al.: Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775 (2025)

  34. [34]

    Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 23 2567–2577 (2019)

  35. [35]

    Scientific Data10(1), 170 (2023)

    Krithara, A., Nentidis, A., Bougiatiotis, K., Paliouras, G.: Bioasq-qa: A manu- ally curated corpus for biomedical question answering. Scientific Data10(1), 170 (2023)

  36. [36]

    arXiv preprint arXiv:2004.03329 (2020)

    He, X., Chen, S., Ju, Z., Dong, X., Fang, H., Wang, S., Yang, Y., Zeng, J., Zhang, R., Zhang, R., et al.: Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020)

  37. [37]

    Bioinformatics (2022)

    Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Wei, Z.: A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks and Datasets. Bioinformatics (2022)

  38. [38]

    In: CCF International Conference on Natural Language Processing and Chinese Computing, pp

    Liu, W., Tang, J., Cheng, Y., Li, W., Zheng, Y., Liang, X.: Meddg: an entity- centric medical consultation dataset for entity-aware medical dialogue generation. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459 (2022). Springer

  39. [39]

    arXiv preprint arXiv:2106.08087 (2021)

    Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Yin, K., Tan, C., Xu, J., Huang, F., et al.: Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021)

  40. [40]

    Nature Medicine (2025)

    Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.-Y., Cai, Z.R., et al.: An evaluation framework for conversational reasoning in clinical llms during patient interactions. Nature Medicine (2025)

  41. [41]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  42. [42]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  43. [43]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  44. [44]

    Mistral 7B

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b. arXiv preprintarXiv:2310.06825(2023)

  45. [45]

    arXiv preprint arXiv:2510.04284 (2025) 24

    Lai, Y., Liu, K., Wang, Z., Ma, W., Liu, Y.: Doctor-r1: Mastering clinical inquiry with experiential agentic reinforcement learning. arXiv preprint arXiv:2510.04284 (2025) 24

  46. [46]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  47. [47]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024) 25