Recognition: no theorem link
Medical Reasoning with Large Language Models: A Survey and MR-Bench
Pith reviewed 2026-05-15 10:17 UTC · model grok-4.3
The pith
A benchmark from real hospital data reveals that large language models perform markedly worse on genuine clinical decisions than on medical exams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Medical reasoning is defined as an iterative process of abduction, deduction, and induction; existing LLM approaches are grouped into seven major technical routes; and consistent cross-benchmark testing on MR-Bench, constructed from real-world hospital data, demonstrates a pronounced performance gap between exam-level accuracy and results on authentic clinical decision tasks.
What carries the argument
MR-Bench, a dataset drawn from real hospital records that tests iterative medical reasoning under realistic clinical conditions rather than exam-style questions.
If this is right
- Future LLM development must target robust handling of evolving evidence and patient context rather than isolated factual recall.
- Unified evaluation settings across benchmarks allow clearer comparison of training-based versus training-free reasoning methods.
- Deployment in clinical environments requires new techniques that close the observed accuracy drop on authentic decision tasks.
- Existing methods grouped in the seven routes need targeted adaptation for safety-critical settings.
Where Pith is reading between the lines
- The iterative abduction-deduction-induction framing could guide development of hybrid systems that combine LLMs with external knowledge bases updated in real time.
- MR-Bench-style construction from local records might be replicated in other high-stakes domains such as legal case analysis or scientific hypothesis testing.
- Persistent gaps would imply that full clinical autonomy for LLMs remains distant and that human oversight remains essential for safety.
Load-bearing premise
That a benchmark assembled from existing hospital records adequately represents the full safety-critical, context-dependent, and evidence-evolving character of live clinical decisions.
What would settle it
A controlled experiment in which models reach accuracy on MR-Bench tasks that matches or exceeds their scores on standard medical exams would directly undermine the reported performance gap.
read the original abstract
Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys medical reasoning with LLMs, grounding the discussion in cognitive theories of abduction, deduction, and induction to organize existing methods into seven technical routes (training-based and training-free). It performs a unified cross-benchmark evaluation of representative models under a consistent setting and introduces MR-Bench, a new benchmark derived from real-world hospital data. Evaluations on MR-Bench are used to claim a pronounced performance gap between exam-level tasks and authentic clinical decision-making.
Significance. If the central gap claim holds after validation, the work supplies a needed unified taxonomy and evaluation framework for the field while highlighting deployment risks for LLMs in safety-critical settings that require context-dependent, evolving-evidence reasoning. The MR-Bench contribution could serve as a more realistic testbed than existing exam-style benchmarks.
major comments (2)
- [MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.
- [Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.
minor comments (2)
- [Survey organization] The seven technical routes are introduced in the abstract but would benefit from an explicit enumeration or table early in the survey section to improve readability.
- [Evaluation tables] Ensure all cited benchmarks in the unified evaluation are accompanied by brief descriptions of their task formats so readers can interpret the reported gaps without external lookup.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.
Authors: We agree that additional details on the MR-Bench construction are necessary to support the claims. In the revised manuscript, we will expand the MR-Bench section to include a full protocol: case selection criteria from the hospital dataset (e.g., inclusion of multi-turn interactions and evolving evidence), how temporal context is encoded in the prompts, and the process for anchoring ground-truth labels to verified clinical outcomes and longitudinal patient records. This will clarify that the benchmark goes beyond single-encounter snapshots and better isolate reasoning failures. revision: yes
-
Referee: [Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.
Authors: We appreciate this observation. The unified evaluation was conducted under a fixed setting with the same prompts and decoding parameters across benchmarks. In the revision, we will add explicit data selection criteria (e.g., how subsets were chosen for comparability) and report statistical controls including 95% confidence intervals and variance across multiple runs (e.g., 3-5 seeds). This will provide stronger verification for the observed performance gap. revision: yes
Circularity Check
No significant circularity in survey organization or MR-Bench construction
full rationale
The paper is a survey that reviews existing LLM medical reasoning methods, grounds its conceptualization of reasoning (abduction/deduction/induction) in external cognitive theories, organizes methods into seven routes drawn from literature, performs cross-benchmark evaluations on independent datasets, and introduces MR-Bench as a new construction from real-world hospital data. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction. The performance gap claim rests on empirical results from the newly introduced benchmark rather than self-referential definitions or load-bearing self-citations. This is a standard non-circular survey structure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clinical reasoning can be usefully modeled as an iterative process of abduction, deduction, and induction
Reference graph
Works this paper leans on
-
[1]
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health.2023;2(2):1-12. doi: 10.1371/journal.pdig.0000198
-
[2]
Capabilities of GPT-4 on Medical Challenge Problems
Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint arXiv:2303.13375; 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265
Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265
work page 2023
-
[4]
Large language models encode clinical knowledge.Nature.2023;620(7972):172–180
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge.Nature.2023;620(7972):172–180
work page 2023
-
[5]
Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models.Nature Medicine.2025;31(3):943– 950
work page 2025
-
[6]
Croxford E, Gao Y , First E, et al. Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine.2025;8(1):640
work page 2025
-
[7]
Takayama T, Sado K, Suda K, et al. Evaluating an LLM-Assisted Workflow for Clinical Documentation: A Pilot Randomized Controlled Trial on Time and Quality.medRxiv.2025:2025–10
work page 2025
-
[8]
Current applications and challenges in large language models for patient care: a systematic review
Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Communications Medicine.2025;5(1):26
work page 2025
-
[9]
Oniani D, Wu X, Visweswaran S, et al. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In: IEEE. 2024:694–702
work page 2024
-
[10]
Artsi Y , Sorin V , Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: A systematic review of applications and implementation.Frontiers in Digital Health.2025;7:1659134
work page 2025
-
[11]
Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care.The lancet.2003;362(9391):1225– 1230
work page 2003
-
[12]
Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631
Maity S, Saikia MJ. Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631
work page 2025
-
[13]
Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information
Zhang Z, Zhang Y , Tan H, Li R, Chen X. Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information. arXiv preprint arXiv:2508.13250.2025
-
[14]
Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026
Zhao X, You J, Zhang Y , et al. Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026
work page 2026
-
[15]
Igd: Token decisiveness modeling via information gain in llms for personalized recommendation
Lin Z, Zhang Y , Zhao X, Zhu F, Feng F, Chua TS. Igd: Token decisiveness modeling via information gain in llms for personalized recommendation. NeurIPS.2025
work page 2025
-
[16]
Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38
Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38
work page 2023
-
[17]
Language Models (Mostly) Know What They Know
Kadavath S, Conerly T, Askell A, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221.2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
The clinical reasoning process.Medical education.1987;21(2):86–91
Barrows HS, Feltovich PJ. The clinical reasoning process.Medical education.1987;21(2):86–91
work page 1987
-
[19]
Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330
Edwards I, Jones M, Carr J, Braunack-Mayer A, Jensen GM. Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330
work page 2004
-
[20]
Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100
Ward T, Haig B. Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100
work page 1997
-
[21]
Croskerry P. Clinical cognition and diagnostic error: applications of a dual process model of reasoning.Advances in health sciences education. 2009;14(Suppl 1):27–35
work page 2009
-
[22]
The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315
Stanley DE, Campos DG. The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315
work page 2013
-
[23]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Chen J, Cai Z, Ji K, et al. Huatuogpt-o1, towards medical complex reasoning with llms, 2024.URL https://arxiv. org/abs/2412.18925
work page internal anchor Pith review arXiv 2024
-
[24]
Wu J, Deng W, Li X, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993. 2025
-
[25]
Huang X, Wu J, Liu H, Tang X, Zhou Y . m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025. Medical Reasoning with Large Language Models: A Survey and MR-Bench 17
-
[26]
Thapa R, Wu Q, Wu K, et al. Disentangling reasoning and knowledge in medical large language models.arXiv preprint arXiv:2505.11462.2025
-
[27]
Bedi S, Liu Y , Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models: a systematic review.Jama.2025
work page 2025
-
[28]
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Gururangan S, Marasovi´c A, Swayamdipta S, et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In: Jurafsky D, Chai J, Schluter N, Tetreault J., eds.Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsAssociation for Computational Linguistics. Association for Computational Linguistics 2020; Online:...
work page 2020
-
[29]
Finetuned Language Models are Zero-Shot Learners
Wei J, Bosma M, Zhao V , et al. Finetuned Language Models are Zero-Shot Learners. In: International Conference on Learning Representations. 2022
work page 2022
-
[30]
Wang C, Zhang Y , Wang W, et al. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025
-
[31]
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems.2022;35:27730–27744
work page 2022
-
[32]
Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025
Zhang Y , Xu W, Zhao X, et al. Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025
-
[33]
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys.2023;55(9):1–35
work page 2023
-
[34]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang X, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025
Zhao X, Yan M, Qiu Y , et al. SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025
-
[36]
Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems.2020;33:9459–9474
work page 2020
-
[37]
Xi Z, Chen W, Guo X, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences. 2025;68(2):121101
work page 2025
-
[38]
Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences.2021;11(14):6421
work page 2021
-
[39]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Arora RK, Wei J, Hicks RS, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775.2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9
Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9
work page 2016
-
[41]
MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1
Johnson AE, Bulgarelli L, Shen L, et al. MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1
work page 2023
-
[42]
Wang B, Zhao H, Zhou H, et al. Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671.2025
-
[43]
Dou C, Liu C, Yang F, et al. Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208.2025
-
[44]
Liu X, Liu H, Yang G, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine.2025;31(3):932–942
work page 2025
-
[45]
Wang G, Gao M, Yang S, et al. Citrus: Leveraging expert cognitive pathways in a medical language model for advanced medical decision support. arXiv preprint arXiv:2502.18274.2025
-
[46]
Liao Y , Wu C, Liu J, et al. EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628.2025
-
[47]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu W, Chan HP, Li L, et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044.2025
work page internal anchor Pith review arXiv 2025
-
[48]
Zhang K, Zeng S, Hua E, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems.2024;37:26045–26081
work page 2024
-
[49]
Jiang S, Liao Y , Chen Z, Zhang Y , Wang Y , Wang Y . MedS3: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision.arXiv preprint arXiv:2501.12051.2025
-
[50]
Sellergren A, Kazemzadeh S, Jaroensri T, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201.2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12
Zhou H, Liu F, Wu J, et al. A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12
work page 2025
-
[52]
Capabilities of Gemini Models in Medicine
Saab K, Tu T, Weng WH, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416.2024
work page internal anchor Pith review arXiv 2024
-
[53]
Huang Z, Geng G, Hua S, et al. O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458. 2025
-
[54]
Yu A, Yao L, Liu J, et al. Medresearcher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework. arXiv preprint arXiv:2508.14880.2025
-
[55]
Lan W, Wang W, Ji C, et al. Clinicalgpt-r1: Pushing reasoning capability of generalist disease diagnosis with large language model.arXiv preprint arXiv:2504.09421.2025
-
[56]
Towards accurate differential diagnosis with large language models.Nature.2025:1–7
McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models.Nature.2025:1–7
work page 2025
-
[57]
Towards conversational diagnostic artificial intelligence.Nature.2025:1–9
Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence.Nature.2025:1–9
work page 2025
-
[58]
Xu R, Zhuang Y , Zhong Y , et al. MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale.arXiv preprint arXiv:2506.04405.2025
-
[59]
Cod, towards an interpretable medical agent using chain of diagnosis
Chen J, Gui C, Gao A, et al. Cod, towards an interpretable medical agent using chain of diagnosis. In: Association for Computational Linguistics. 2025:14345–14368
work page 2025
-
[60]
DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration
Jia Z, Jia M, Duan J, Wang J. DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration. In: Association for Computational Linguistics. 2025:26380–26397
work page 2025
-
[61]
Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine
Kim H, Hwang H, Lee J, et al. Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine. 2025;8(1):240
work page 2025
-
[62]
Jeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.Bioinformatics.2024;40(Supplement_1):i119–i129
work page 2024
-
[63]
Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024
Christophe C, Kanithi PK, Raha T, Khan S, Pimentel MA. Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024
-
[64]
Medadapter: Efficient test-time adaptation of large language models towards medical reasoning
Shi W, Xu R, Zhuang Y , et al. Medadapter: Efficient test-time adaptation of large language models towards medical reasoning. In: Association for Computational Linguistics. 2024:22294–22314
work page 2024
-
[65]
QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025
Li A, Yan B, Cai B, et al. QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025
-
[66]
Liu C, Wang H, Pan J, et al. Beyond distillation: Pushing the limits of medical llm reasoning with minimalist rule-based rl.arXiv preprint arXiv:2505.17952.2025. 18 RENET AL
-
[67]
Zheng Q, Sun Y , Wu C, et al. End-to-end agentic rag system training for traceable diagnostic reasoning.arXiv preprint arXiv:2508.15746.2025
-
[68]
Rui S, Chen K, Ma W, Wang X. AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration.arXiv preprint arXiv:2509.24560.2025
-
[69]
Sandeep Nachane S, Gramopadhye O, Chanda P, et al. Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering.arXiv e-prints.2024:arXiv–2403
work page 2024
-
[70]
Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association.2024;31(9):1964–1975
work page 2024
-
[71]
arXiv preprint arXiv:2311.16452 , year=
Nori H, Lee YT, Zhang S, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452.2023
-
[72]
Kwon T, Ong KTi, Kang D, et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In: . 38. AAAI. 2024:18417–18425
work page 2024
-
[73]
Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023
Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023
-
[74]
Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.NPJ Digital Medicine.2024;7(1):20
work page 2024
-
[75]
Li S, Balachandran V , Feng S, et al. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems.2024;37:28858–28888
work page 2024
-
[76]
Zhao X, Liu S, Yang SY , Miao C. Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In: ACM. 2025:4442–4457
work page 2025
-
[77]
Medagents: Large language models as collaborators for zero-shot medical reasoning
Tang X, Zou A, Zhang Z, et al. Medagents: Large language models as collaborators for zero-shot medical reasoning. In: Association for Computational Linguistics. 2024:599–621
work page 2024
-
[78]
Wang B, Xia I, Zhang Y , et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In: Association for Computational Linguistics. 2025:10820–10844
work page 2025
-
[79]
MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence
Liu J, Wang W, Ma Z, et al. MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence. In: NeurIPS. 2025
work page 2025
-
[80]
The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025
Heydari AA, Gu K, Srinivas V , et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.