Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

Jin Song Dong; Xiaoyang Fan; Yufan Cai; Zhe Hou

arxiv: 2605.25566 · v1 · pith:QAJ4ZVHTnew · submitted 2026-05-25 · 💻 cs.AI

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

Xiaoyang Fan , Yufan Cai , Zhe Hou , Jin Song Dong This is my paper

Pith reviewed 2026-06-29 21:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords neuro-symbolic reasoningfuzzy logicexplainable medical AIlarge language modelslogic programmingverifiable diagnosisclinical decision support

0 comments

The pith

A neuro-symbolic framework aligns LLMs with formal logic to produce explainable and verifiable disease diagnoses from patient narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that uses LLMs to pull medical details from patient stories and converts them into symbolic rules based on fuzzy logic. This setup allows two stages of reasoning: first generalizing patterns from the data, then checking the conclusions with a logic engine. The result is diagnoses whose steps can be checked and changed if needed, addressing the lack of transparency in standard LLMs. A reader would care because medical decisions need to be trustworthy and open to review by doctors. The approach aims to match the accuracy of top LLMs while adding formal verifiability.

Core claim

Patient descriptions and clinical guidelines are embedded into a neural knowledge base where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns. These are decoded into a symbolic knowledge base in fuzzy logic and declarative rules. Two-stage reasoning consists of inductive symbolic generalization to capture diagnostic patterns and inference verification via a logic programming engine. Symptoms are treated as fuzzy predicates with probabilistic weights, producing auditable, adjustable inference paths compatible with physician feedback and supporting iterative refinement through formal rules.

What carries the argument

Neuro-symbolic reasoning framework that decodes LLM outputs into fuzzy logic predicates and declarative rules for two-stage inductive and verificatory inference.

If this is right

Inference paths are auditable, adjustable, and compatible with physician feedback.
Misalignments between generated diagnoses and ground truth can be traced and corrected via formal rules.
The system achieves performance comparable to state-of-the-art LLMs on public benchmarks while adding interpretability.
It supports strong generalization and verifiable step-by-step reasoning chains.
The framework reconciles symbolic reasoning with LLMs for real-world clinical narratives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Doctors could modify the declarative rules to add new clinical insights directly.
The method could be tested on private hospital data to check robustness beyond public benchmarks.
Fuzzy weights might allow modeling uncertainty in multi-disease scenarios by combining multiple paths.
Similar alignment could improve LLM use in other regulated fields like legal reasoning.

Load-bearing premise

LLMs can accurately extract structured medical entities, temporal relations, and fuzzy symptom patterns from natural language patient narratives without introducing critical errors or information loss.

What would settle it

Finding cases where the LLM extraction step produces incomplete or inaccurate symbolic representations that cause the logic engine to output wrong diagnoses not caught by verification.

Figures

Figures reproduced from arXiv: 2605.25566 by Jin Song Dong, Xiaoyang Fan, Yufan Cai, Zhe Hou.

**Figure 1.** Figure 1: The Neuro-symbolic Cycle. 3 Approach Our framework comprises three tightly coupled modules: (A) formal knowledge construction and reasoning toolchain, (B) a neuro-symbolic learning cycle that evolves the rule base through both data-driven updates and physician feedback, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The Medical Diagnosis Framework 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a neuro-symbolic setup that feeds LLM-extracted fuzzy predicates into a logic engine for auditable diagnosis, but supplies no numbers, no error analysis on the extraction step, and no details on how the benchmarks were run.

read the letter

The main thing to know is that this is a high-level proposal for turning patient narratives into a fuzzy-logic knowledge base via LLMs, then running inductive generalization and logic-program verification on top. It claims the result is explainable, adjustable by physicians, and formally verifiable.

What the paper does is apply an existing neuro-symbolic pattern to clinical narratives, with fuzzy predicates for symptoms and a two-stage process that tries to reconcile statistical extraction with rule-based checking. The intent to support iterative correction when the output mismatches ground truth is sensible for a medical setting.

The soft spot is exactly where the stress-test note points: the whole claim of formal verifiability rests on the LLM-to-symbolic decoding being faithful, yet the abstract gives no mechanism, bounds, or measurements for extraction errors, hallucinated relations, or fuzzy-value misassignments. Once those errors are baked into the KB, the downstream logic engine just verifies whatever it receives. The validation claim is stated but not supported by any numbers, baselines, or error breakdowns, so there is no way to judge whether the added machinery improves anything or simply matches black-box performance at higher cost.

This is the kind of paper that might interest a reading group working on trustworthy medical AI. It deserves a serious referee only if the full manuscript actually contains the missing experiments and analysis; on the abstract alone the evidence is too thin to evaluate the central claim.

Referee Report

3 major / 1 minor

Summary. The paper proposes a neuro-symbolic framework that embeds patient narratives and clinical guidelines into a neural knowledge base, uses LLMs to extract structured entities, temporal relations, and fuzzy symptom patterns, decodes them into a symbolic KB in fuzzy logic and declarative rules, then applies two-stage reasoning (inductive symbolic generalization followed by logic programming engine verification) to produce auditable, adjustable, and physician-feedback-compatible diagnoses. It claims this yields explainable and formally verifiable medical diagnosis with performance comparable to SOTA LLMs on public benchmarks while adding interpretable reasoning paths.

Significance. If the extraction-to-symbolic step is shown to be reliable and the claimed benchmark results hold with proper controls, the work could meaningfully advance trustworthy clinical AI by combining LLM adaptability with formal verifiability and uncertainty handling via fuzzy predicates. The emphasis on traceable inference paths and iterative refinement addresses a recognized gap in purely neural medical systems. The absence of any quantitative results, error analysis, or benchmark details in the manuscript text prevents assessment of whether these benefits are realized.

major comments (3)

[Abstract / §3] Abstract and implied §3: The central claim of 'formally verifiable' and 'auditable' diagnoses rests on the decoding step producing a faithful symbolic KB from LLM outputs, yet no mechanism, error bounds, or verification procedure is described for detecting LLM hallucinations, temporal relation errors, or fuzzy membership misassignments before the logic engine runs. Any mismatch propagates undetected into the inference verification stage.
[Abstract] Abstract: The statement that the framework was 'validate[d] ... on public benchmarks' with 'performance comparable to state-of-the-art LLMs' is unsupported by any metrics, tables, baseline comparisons, or dataset descriptions, rendering the empirical contribution unevaluable and undermining the claim of effective reconciliation of symbolic reasoning and LLMs.
[Abstract] Abstract: The two-stage reasoning (inductive generalization then logic programming verification) is presented as load-bearing for the verifiability guarantee, but no formal definition of the fuzzy predicates, the inductive generalization operator, or the logic programming engine semantics is supplied, leaving the 'formally verifiable' property without a concrete foundation.

minor comments (1)

The abstract refers to 'Section 3 (implied by abstract)' for the extraction process; explicit section numbering and a high-level architecture diagram would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details where the current presentation is incomplete.

read point-by-point responses

Referee: [Abstract / §3] Abstract and implied §3: The central claim of 'formally verifiable' and 'auditable' diagnoses rests on the decoding step producing a faithful symbolic KB from LLM outputs, yet no mechanism, error bounds, or verification procedure is described for detecting LLM hallucinations, temporal relation errors, or fuzzy membership misassignments before the logic engine runs. Any mismatch propagates undetected into the inference verification stage.

Authors: We agree that the abstract does not explicitly describe pre-inference detection mechanisms for hallucinations or misassignments. The two-stage process relies on the logic programming engine for verification, but to strengthen the claim we will add a dedicated paragraph in the revised abstract and §3 detailing consistency checks, temporal relation validation against guidelines, and probabilistic thresholds on fuzzy predicates. revision: yes
Referee: [Abstract] Abstract: The statement that the framework was 'validate[d] ... on public benchmarks' with 'performance comparable to state-of-the-art LLMs' is unsupported by any metrics, tables, baseline comparisons, or dataset descriptions, rendering the empirical contribution unevaluable and undermining the claim of effective reconciliation of symbolic reasoning and LLMs.

Authors: The abstract summarizes results that appear in the experiments section of the full manuscript. However, the referee is correct that the abstract itself provides no metrics or dataset details. We will revise the abstract to include key quantitative results, baseline comparisons, and dataset references. revision: yes
Referee: [Abstract] Abstract: The two-stage reasoning (inductive generalization then logic programming verification) is presented as load-bearing for the verifiability guarantee, but no formal definition of the fuzzy predicates, the inductive generalization operator, or the logic programming engine semantics is supplied, leaving the 'formally verifiable' property without a concrete foundation.

Authors: Section 3 supplies the formal definitions of fuzzy predicates, the inductive generalization operator, and the semantics of the logic programming engine. The abstract omits these details. We will revise to include a concise formal summary in the abstract to make the foundation explicit without lengthening the text excessively. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal is self-contained with external benchmark validation

full rationale

The paper describes a neuro-symbolic pipeline (LLM entity/relation extraction into fuzzy symbolic KB, followed by inductive generalization and logic-program verification) without any equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The central claim of verifiability is presented as resting on the logic engine operating on the decoded KB, with performance evaluated on public benchmarks; this constitutes an independent check rather than a circular reduction. The extraction step is an assumption, not a circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM extraction can be losslessly mapped to fuzzy predicates and that fuzzy logic plus logic programming can faithfully represent clinical diagnostic standards.

axioms (2)

domain assumption Fuzzy logic predicates with probabilistic weights can represent uncertain medical symptoms and relations extracted from text.
Stated in the abstract when symptoms are treated as fuzzy predicates.
domain assumption A logic programming engine can validate diagnoses against clinical standards once encoded symbolically.
Invoked in the two-stage reasoning description.

pith-pipeline@v0.9.1-grok · 5803 in / 1233 out tokens · 25903 ms · 2026-06-29T21:47:54.288842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 3 canonical work pages

[1]

Goldberger A. et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new re- search resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

2000
[2]

MIMIC-IV , a freely accessible electronic health record dataset.Scientific Data, 10(1), 2023

Johnson A., Bulgarelli L., Shen L., et al. MIMIC-IV , a freely accessible electronic health record dataset.Scientific Data, 10(1), 2023

2023
[3]

MIMIC-IV (version 3.1).PhysioNet, 2024

Johnson A., Bulgarelli L., Pollard T., Gow B., Moody B., Horng S., Celi L.A., and Mark R. MIMIC-IV (version 3.1).PhysioNet, 2024

2024
[4]

Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, and Vince I. Madai. Explain- ability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective.BMC Medical Informatics and Decision Making, 20(1):310, 2020

2020
[5]

Rajan, Dean F

Viraj Bhise, Suja S. Rajan, Dean F. Sittig, Robert O. Morgan, Pooja Chaudhary, and Hardeep Singh. Defining and Measuring Diagnostic Uncertainty in Medicine: A Systematic Review. Journal of General Internal Medicine, 33:103–115, 2018

2018
[6]

Felix Busch, Lena Hoffmann, Christopher Rueger, Elon H. C. van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient care: a systematic review.Communications Me...

2025
[7]

Roentgen: vision-language foundation model for chest x-ray generation.arXiv preprint arXiv:2211.12737, 2022

Pierre Chambon, Christian Bluethgen, Jean-Benoit Delbrouck, Rogier Van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P Langlotz, and Akshay Chaudhari. Roentgen: vision-language foundation model for chest x-ray generation.arXiv preprint arXiv:2211.12737, 2022

work page arXiv 2022
[8]

Stewart, and Jimeng Sun

Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, and Jimeng Sun. GRAM: Graph-based Attention Model for Healthcare Representation Learning. InProceed- ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 787–795, New York, NY , USA, 2017. Association for Computing Machinery

2017
[9]

d’Avila Garcez and Lu ´ıs C

Artur S. d’Avila Garcez and Lu ´ıs C. Lamb. Neurosymbolic AI: The 3rd Wave.Artificial Intelligence Review, 56(11):12387–12406, 2023

2023
[10]

Bioethics in the era of artificial intelligence (AI).Revista Latinoamericana de Bio´etica, 22:8–10, 06 2022

Fabio Diaz. Bioethics in the era of artificial intelligence (AI).Revista Latinoamericana de Bio´etica, 22:8–10, 06 2022

2022
[11]

Hugging Face: The AI community building the future.https:// huggingface.co, 2023

Hugging Face. Hugging Face: The AI community building the future.https:// huggingface.co, 2023

2023
[12]

symptom to diagnosis on Hugging Face.https://huggingface.co/ datasets/gretelai/symptom_to_diagnosis, 2023

Gretel.ai. symptom to diagnosis on Hugging Face.https://huggingface.co/ datasets/gretelai/symptom_to_diagnosis, 2023

2023
[13]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

2017
[14]

Paul K. J. Han, William M. P. Klein, and Neeraj K. Arora. Varieties of Uncertainty in Health Care: A Conceptual Taxonomy.Medical Decision Making, 31(6):828–838, 2011

2011
[15]

Causabil- ity and explainability of artificial intelligence in medicine.WIREs Data Mining and Knowledge Discovery, 9(4):e1312, 2019

Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo M¨uller. Causabil- ity and explainability of artificial intelligence in medicine.WIREs Data Mining and Knowledge Discovery, 9(4):e1312, 2019

2019
[16]

A Survey on Biomedical Automatic Text Summarization with Large Language Models.Information Pro- cessing & Management, 62(5):104216, 2025

Zhenyu Huang, Xianlai Chen, Yunbo Wang, Jincai Huang, and Xing Zhao. A Survey on Biomedical Automatic Text Summarization with Large Language Models.Information Pro- cessing & Management, 62(5):104216, 2025. 11

2025
[17]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., 55(12), March 2023

2023
[18]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

2021
[19]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Novem...

2019
[20]

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Yunsoo Kim, Jinge Wu, Yusuf Abdulle, and Honghan Wu. MedExQA: Medical Question Answering Benchmark with Multiple Explanations. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 167–181, Bangkok, Thailand, August 2024. Association for Computational Linguistics

2024
[21]

ChatDoctor-iCliniq on Hugging Face.https://huggingface.co/ datasets/lavita/ChatDoctor-iCliniq, 2024

Lavita AI. ChatDoctor-iCliniq on Hugging Face.https://huggingface.co/ datasets/lavita/ChatDoctor-iCliniq, 2024

2024
[22]

In- struction Tuning and CoT Prompting for Contextual Medical QA with LLMs.arXiv preprint arXiv:2506.12182, 2025

Chenqian Le, Ziheng Gong, Chihang Wang, Haowei Ni, Panfeng Li, and Xupeng Chen. In- struction Tuning and CoT Prompting for Contextual Medical QA with LLMs.arXiv preprint arXiv:2506.12182, 2025

work page arXiv 2025
[23]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, 2023

Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, 2023

2023
[24]

Leung, Evan W.R

Carson K. Leung, Evan W.R. Madill, Joglas Souza, and Christine Y . Zhang. Towards Trust- worthy Artificial Intelligence in Healthcare. In2022 IEEE 10th International Conference on Healthcare Informatics (ICHI), pages 626–632, 2022

2022
[25]

ChatGPT in health- care: A taxonomy and systematic review.Computer Methods and Programs in Biomedicine, 245:108013, 2024

Jianning Li, Amin Dada, Behrus Puladi, Jens Kleesiek, and Jan Egger. ChatGPT in health- care: A taxonomy and systematic review.Computer Methods and Programs in Biomedicine, 245:108013, 2024

2024
[26]

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.Cureus, 15(6):e40895, 2023

Yunxiang Li, Zihan Li, Kai Zhang, et al. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.Cureus, 15(6):e40895, 2023

2023
[27]

arXiv preprint arXiv:2303.11032 , year=

Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, et al. DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4.arXiv preprint arXiv:2303.11032, 2023

work page arXiv 2023
[28]

Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association, 31(9):1964–1975, 2024

Mary M Lucas, Justin Yang, Jon K Pomeroy, and Christopher C Yang. Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association, 31(9):1964–1975, 2024

1964
[29]

Managing uncertainty and vagueness in descrip- tion logics for the Semantic Web.Web Semantics, 6(4):291–308, November 2008

Thomas Lukasiewicz and Umberto Straccia. Managing uncertainty and vagueness in descrip- tion logics for the Semantic Web.Web Semantics, 6(4):291–308, November 2008

2008
[30]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief- ings in Bioinformatics, 23(6):bbac409, 09 2022

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief- ings in Bioinformatics, 23(6):bbac409, 09 2022

2022
[31]

DeepProbLog: Neural Probabilistic Logic Programming

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. DeepProbLog: Neural Probabilistic Logic Programming. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

2018
[32]

The National Academies Press, Washington, DC, 2015

National Academies of Sciences, Engineering, and Medicine.Improving Diagnosis in Health Care. The National Academies Press, Washington, DC, 2015. 12

2015
[33]

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Chris- tiano, Jan Leike, and Ryan Lowe. Training Language Models to Follow Instructions with Human Feedb...

2022
[34]

End-to-end Differentiable Proving

Tim Rockt ¨aschel and Sebastian Riedel. End-to-end Differentiable Proving. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[35]

Logic Tensor Networks: Deep Learning and Log- ical Reasoning from Data and Knowledge

Luciano Serafini and Artur d’Avila Garcez. Logic Tensor Networks: Deep Learning and Log- ical Reasoning from Data and Knowledge. InInternational Workshop on Neural-Symbolic Learning and Reasoning (NeSy), 2016

2016
[36]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch ¨arli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Ag ¨uera y Arcas, Dale Web- ster, Greg S. Co...

2023
[37]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Ag¨uera y Arcas...

2025
[38]

Temporal reasoning over clinical text: the state of the art.Journal of the American Medical Informatics Association, 20(5):814–819, 2013

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. Temporal reasoning over clinical text: the state of the art.Journal of the American Medical Informatics Association, 20(5):814–819, 2013

2013
[39]

Recitation-Augmented Language Models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-Augmented Language Models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[40]

Inter- active computer-aided diagnosis on medical image using large language models.Communica- tions Engineering, 3:133, 2024

Sheng Wang, Zihao Zhao, Xi Ouyang, Tianming Liu, Qian Wang, and Dinggang Shen. Inter- active computer-aided diagnosis on medical image using large language models.Communica- tions Engineering, 3:133, 2024

2024
[41]

PMC- LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 04 2024

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. PMC- LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 04 2024

2024
[42]

Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B

Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E. Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B. Costa, Mona G. Flores, Ying Zhang, Tanja Magoc, Christopher A. Harle, Gloria Lipori, Duane A. Mitchell, William R. Hogan, Elizabeth A. Shenkman, Jiang Bian, and Yonghui Wu. A large language model for electronic health recor...

2022
[43]

L.A. Zadeh. Fuzzy Logic = Computing with Words.IEEE Transactions on Fuzzy Systems, 4(2):103–111, 1996. 13

1996

[1] [1]

Goldberger A. et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new re- search resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

2000

[2] [2]

MIMIC-IV , a freely accessible electronic health record dataset.Scientific Data, 10(1), 2023

Johnson A., Bulgarelli L., Shen L., et al. MIMIC-IV , a freely accessible electronic health record dataset.Scientific Data, 10(1), 2023

2023

[3] [3]

MIMIC-IV (version 3.1).PhysioNet, 2024

Johnson A., Bulgarelli L., Pollard T., Gow B., Moody B., Horng S., Celi L.A., and Mark R. MIMIC-IV (version 3.1).PhysioNet, 2024

2024

[4] [4]

Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, and Vince I. Madai. Explain- ability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective.BMC Medical Informatics and Decision Making, 20(1):310, 2020

2020

[5] [5]

Rajan, Dean F

Viraj Bhise, Suja S. Rajan, Dean F. Sittig, Robert O. Morgan, Pooja Chaudhary, and Hardeep Singh. Defining and Measuring Diagnostic Uncertainty in Medicine: A Systematic Review. Journal of General Internal Medicine, 33:103–115, 2018

2018

[6] [6]

Felix Busch, Lena Hoffmann, Christopher Rueger, Elon H. C. van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient care: a systematic review.Communications Me...

2025

[7] [7]

Roentgen: vision-language foundation model for chest x-ray generation.arXiv preprint arXiv:2211.12737, 2022

Pierre Chambon, Christian Bluethgen, Jean-Benoit Delbrouck, Rogier Van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P Langlotz, and Akshay Chaudhari. Roentgen: vision-language foundation model for chest x-ray generation.arXiv preprint arXiv:2211.12737, 2022

work page arXiv 2022

[8] [8]

Stewart, and Jimeng Sun

Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, and Jimeng Sun. GRAM: Graph-based Attention Model for Healthcare Representation Learning. InProceed- ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 787–795, New York, NY , USA, 2017. Association for Computing Machinery

2017

[9] [9]

d’Avila Garcez and Lu ´ıs C

Artur S. d’Avila Garcez and Lu ´ıs C. Lamb. Neurosymbolic AI: The 3rd Wave.Artificial Intelligence Review, 56(11):12387–12406, 2023

2023

[10] [10]

Bioethics in the era of artificial intelligence (AI).Revista Latinoamericana de Bio´etica, 22:8–10, 06 2022

Fabio Diaz. Bioethics in the era of artificial intelligence (AI).Revista Latinoamericana de Bio´etica, 22:8–10, 06 2022

2022

[11] [11]

Hugging Face: The AI community building the future.https:// huggingface.co, 2023

Hugging Face. Hugging Face: The AI community building the future.https:// huggingface.co, 2023

2023

[12] [12]

symptom to diagnosis on Hugging Face.https://huggingface.co/ datasets/gretelai/symptom_to_diagnosis, 2023

Gretel.ai. symptom to diagnosis on Hugging Face.https://huggingface.co/ datasets/gretelai/symptom_to_diagnosis, 2023

2023

[13] [13]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

2017

[14] [14]

Paul K. J. Han, William M. P. Klein, and Neeraj K. Arora. Varieties of Uncertainty in Health Care: A Conceptual Taxonomy.Medical Decision Making, 31(6):828–838, 2011

2011

[15] [15]

Causabil- ity and explainability of artificial intelligence in medicine.WIREs Data Mining and Knowledge Discovery, 9(4):e1312, 2019

Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo M¨uller. Causabil- ity and explainability of artificial intelligence in medicine.WIREs Data Mining and Knowledge Discovery, 9(4):e1312, 2019

2019

[16] [16]

A Survey on Biomedical Automatic Text Summarization with Large Language Models.Information Pro- cessing & Management, 62(5):104216, 2025

Zhenyu Huang, Xianlai Chen, Yunbo Wang, Jincai Huang, and Xing Zhao. A Survey on Biomedical Automatic Text Summarization with Large Language Models.Information Pro- cessing & Management, 62(5):104216, 2025. 11

2025

[17] [17]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., 55(12), March 2023

2023

[18] [18]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

2021

[19] [19]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China, Novem...

2019

[20] [20]

MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Yunsoo Kim, Jinge Wu, Yusuf Abdulle, and Honghan Wu. MedExQA: Medical Question Answering Benchmark with Multiple Explanations. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 167–181, Bangkok, Thailand, August 2024. Association for Computational Linguistics

2024

[21] [21]

ChatDoctor-iCliniq on Hugging Face.https://huggingface.co/ datasets/lavita/ChatDoctor-iCliniq, 2024

Lavita AI. ChatDoctor-iCliniq on Hugging Face.https://huggingface.co/ datasets/lavita/ChatDoctor-iCliniq, 2024

2024

[22] [22]

In- struction Tuning and CoT Prompting for Contextual Medical QA with LLMs.arXiv preprint arXiv:2506.12182, 2025

Chenqian Le, Ziheng Gong, Chihang Wang, Haowei Ni, Panfeng Li, and Xupeng Chen. In- struction Tuning and CoT Prompting for Contextual Medical QA with LLMs.arXiv preprint arXiv:2506.12182, 2025

work page arXiv 2025

[23] [23]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, 2023

Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, 2023

2023

[24] [24]

Leung, Evan W.R

Carson K. Leung, Evan W.R. Madill, Joglas Souza, and Christine Y . Zhang. Towards Trust- worthy Artificial Intelligence in Healthcare. In2022 IEEE 10th International Conference on Healthcare Informatics (ICHI), pages 626–632, 2022

2022

[25] [25]

ChatGPT in health- care: A taxonomy and systematic review.Computer Methods and Programs in Biomedicine, 245:108013, 2024

Jianning Li, Amin Dada, Behrus Puladi, Jens Kleesiek, and Jan Egger. ChatGPT in health- care: A taxonomy and systematic review.Computer Methods and Programs in Biomedicine, 245:108013, 2024

2024

[26] [26]

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.Cureus, 15(6):e40895, 2023

Yunxiang Li, Zihan Li, Kai Zhang, et al. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.Cureus, 15(6):e40895, 2023

2023

[27] [27]

arXiv preprint arXiv:2303.11032 , year=

Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, et al. DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4.arXiv preprint arXiv:2303.11032, 2023

work page arXiv 2023

[28] [28]

Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association, 31(9):1964–1975, 2024

Mary M Lucas, Justin Yang, Jon K Pomeroy, and Christopher C Yang. Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association, 31(9):1964–1975, 2024

1964

[29] [29]

Managing uncertainty and vagueness in descrip- tion logics for the Semantic Web.Web Semantics, 6(4):291–308, November 2008

Thomas Lukasiewicz and Umberto Straccia. Managing uncertainty and vagueness in descrip- tion logics for the Semantic Web.Web Semantics, 6(4):291–308, November 2008

2008

[30] [30]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief- ings in Bioinformatics, 23(6):bbac409, 09 2022

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief- ings in Bioinformatics, 23(6):bbac409, 09 2022

2022

[31] [31]

DeepProbLog: Neural Probabilistic Logic Programming

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. DeepProbLog: Neural Probabilistic Logic Programming. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

2018

[32] [32]

The National Academies Press, Washington, DC, 2015

National Academies of Sciences, Engineering, and Medicine.Improving Diagnosis in Health Care. The National Academies Press, Washington, DC, 2015. 12

2015

[33] [33]

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Chris- tiano, Jan Leike, and Ryan Lowe. Training Language Models to Follow Instructions with Human Feedb...

2022

[34] [34]

End-to-end Differentiable Proving

Tim Rockt ¨aschel and Sebastian Riedel. End-to-end Differentiable Proving. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017

[35] [35]

Logic Tensor Networks: Deep Learning and Log- ical Reasoning from Data and Knowledge

Luciano Serafini and Artur d’Avila Garcez. Logic Tensor Networks: Deep Learning and Log- ical Reasoning from Data and Knowledge. InInternational Workshop on Neural-Symbolic Learning and Reasoning (NeSy), 2016

2016

[36] [36]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch ¨arli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Ag ¨uera y Arcas, Dale Web- ster, Greg S. Co...

2023

[37] [37]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Ag¨uera y Arcas...

2025

[38] [38]

Temporal reasoning over clinical text: the state of the art.Journal of the American Medical Informatics Association, 20(5):814–819, 2013

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. Temporal reasoning over clinical text: the state of the art.Journal of the American Medical Informatics Association, 20(5):814–819, 2013

2013

[39] [39]

Recitation-Augmented Language Models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-Augmented Language Models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[40] [40]

Inter- active computer-aided diagnosis on medical image using large language models.Communica- tions Engineering, 3:133, 2024

Sheng Wang, Zihao Zhao, Xi Ouyang, Tianming Liu, Qian Wang, and Dinggang Shen. Inter- active computer-aided diagnosis on medical image using large language models.Communica- tions Engineering, 3:133, 2024

2024

[41] [41]

PMC- LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 04 2024

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. PMC- LLaMA: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 04 2024

2024

[42] [42]

Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B

Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E. Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B. Costa, Mona G. Flores, Ying Zhang, Tanja Magoc, Christopher A. Harle, Gloria Lipori, Duane A. Mitchell, William R. Hogan, Elizabeth A. Shenkman, Jiang Bian, and Yonghui Wu. A large language model for electronic health recor...

2022

[43] [43]

L.A. Zadeh. Fuzzy Logic = Computing with Words.IEEE Transactions on Fuzzy Systems, 4(2):103–111, 1996. 13

1996