arxiv: 2605.01189 · v1 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

Anuradha Chandrasekaran , Dimitrios Zikos , Mutlu Mete , Alan Pang , Brady D. Lund , Kewei Sha

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords neuro-symbolic AIclinical explainabilitySNOMED CTSHAPretrieval-augmented generationMIMIC-IVheart failure predictionexplainable AI

0 comments

The pith

A neuro-symbolic system combines medical ontology with machine learning and language models to raise prediction accuracy while generating natural-language clinical explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NEURON to address the lack of transparency in high-performing clinical AI models. It builds structural representations from the SNOMED CT ontology, feeds them into machine learning models for improved predictions, and then applies a retrieval-augmented generation layer with a large language model to convert SHAP feature scores and patient notes into readable explanations. On MIMIC-IV data for acute heart failure mortality, this yields higher AUC scores than baseline models and far better alignment with human judgment than raw SHAP plots alone. The goal is to produce explanations that clinicians can trust because they are grounded in established medical terminology and patient context rather than opaque feature weights.

Core claim

NEURON integrates SNOMED CT ontology-informed structural representations with machine learning models to enhance predictive reliability. A Retrieval-Augmented Generation grounded LLM layer then synthesizes SHAP feature attributions and patient-specific clinical notes into coherent natural-language explanations, producing AUC scores of 0.84-0.88 on MIMIC-IV acute heart failure mortality prediction while scoring 0.85 on human-aligned metrics compared with 0.50 for raw SHAP visualizations.

What carries the argument

The neuro-symbolic pipeline that first maps raw clinical data to SNOMED CT ontology structures, runs them through an ML predictor, and finally uses RAG-augmented LLM synthesis to turn SHAP attributions plus notes into grounded natural-language text.

If this is right

Prediction AUC for acute heart failure mortality rises from the 0.74-0.77 range to 0.84-0.88.
Explanations align with human judgment at 0.85 rather than 0.50 for unprocessed SHAP visualizations.
The same architecture supplies both higher accuracy and narrative transparency for connected health applications.
Ontology grounding bridges raw numerical features and standard medical nomenclature.
The system scales as a deployable engineering solution for trustworthy clinical AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ontology-plus-RAG pattern could be tested on other high-stakes prediction tasks such as sepsis or readmission risk where note quality varies.
If the explanations remain faithful under conflicting notes, the method might reduce the need for separate post-hoc explanation tools in electronic health record systems.
Clinicians might adopt the outputs more readily because the language is tied to SNOMED terms they already use in documentation.
Future work could measure downstream effects on clinical decision time or error rates once the explanations are shown alongside the numeric risk score.

Load-bearing premise

The RAG-grounded LLM layer will continue to produce clinically accurate and faithful explanations even when the retrieved documents or patient notes contain incomplete or conflicting information.

What would settle it

A test set of heart failure cases in which patient notes contain directly contradictory statements about key risk factors, followed by blinded clinician ratings of whether the generated explanations correctly resolve or ignore the contradictions.

Figures

Figures reproduced from arXiv: 2605.01189 by Alan Pang, Anuradha Chandrasekaran, Brady D. Lund, Dimitrios Zikos, Kewei Sha, Mutlu Mete.

**Figure 1.** Figure 1: Overall design of the NEURON framework. Node2Vec embeddings are created based on this graph, and admission-level embeddings are further generated using TF-IDF weighted pooling, which results in one fixed 32-dimensional vector per admission. The first two levels of ontology are automatically selected, and a bag of ontology features mapped to these two levels is derived for each admission. The tabular featur… view at source ↗

**Figure 2.** Figure 2: Overview of the neuro-symbolic modeling and explanation pipeline for in-hospital mortality prediction in AHF. view at source ↗

**Figure 3.** Figure 3: SHAP results - Neuro-symbolic MLP configuration. view at source ↗

**Figure 4.** Figure 4: Excerpt of the constrained prompt template. view at source ↗

**Figure 5.** Figure 5: Representative integrated narrative explanation. view at source ↗

read the original abstract

Clinical AI adoption is hindered by the black-box/grey-box nature of high-performing models, which lack the ontological grounding and narrative transparency required for professional-level explainability. We present NEURON, a neuro-symbolic system designed to enhance both predictive reliability and clinical interpretability. NEURON integrates SNOMED CT ontology-informed structural representations with machine learning models to bridge the gap between raw data and medical nomenclature. To facilitate human-aligned interaction, the system utilizes a Retrieval-Augmented Generation (RAG) grounded LLM layer to synthesize SHAP feature attributions and patient-specific clinical notes into coherent, natural-language explanations. Validated on the MIMIC-IV dataset for Acute Heart Failure mortality prediction, NEURON improved the AUC from 0.74-0.77 to 0.84-0.88 and significantly outperformed raw SHAP visualizations in human-aligned metrics (0.85 vs. 0.50). Our results demonstrate that NEURON offers a robust, scalable engineering solution for deploying trustworthy, human-centered connected health applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NEURON puts SNOMED CT features and a RAG-LLM explainer around a mortality predictor on MIMIC-IV heart failure cases, claiming AUC gains to 0.84-0.88 plus better human scores than raw SHAP, but the methods section is too thin to judge if either result holds up.

read the letter

The core of the paper is a pipeline that feeds SNOMED CT structural representations into ML models for acute heart failure mortality prediction on MIMIC-IV, then routes SHAP attributions and patient notes through a retrieval-augmented LLM to produce natural-language explanations. It reports lifting AUC from the 0.74-0.77 range to 0.84-0.88 and human-aligned preference scores of 0.85 versus 0.50 for plain SHAP visualizations. That specific combination of ontology grounding, SHAP synthesis, and human evaluation on this dataset is not a routine extension of prior work, so the integration itself counts as new engineering effort. The human evaluation step is also a concrete attempt to measure clinical usability rather than just technical metrics. Those are the parts worth noting. The soft spots are straightforward. The abstract gives no train-test split details, no hyperparameter search description, no statistical significance tests on the AUC lift, and no independent check on whether the SNOMED mappings were accurate or complete. The RAG-LLM layer is presented as the source of faithful, grounded explanations, yet there is no mention of how it handles missing values, conflicting notes, or outdated entries—common in real EHR data—and no post-generation verification against the original model outputs. Without those controls, the human preference numbers could reflect fluent text rather than reliable grounding. This paper is for clinical AI researchers and health informatics groups who want a working example of neuro-symbolic plus LLM explanation on public data. It is not yet ready for direct deployment claims, but the empirical results and pipeline idea are concrete enough that a serious referee should see it. I would send it for peer review with a clear request for expanded methods, split information, and at least one experiment testing explanation faithfulness on deliberately incomplete notes.

Referee Report

3 major / 2 minor

Summary. The paper introduces NEURON, a neuro-symbolic system that integrates SNOMED CT ontology-informed structural representations with machine learning models and employs a Retrieval-Augmented Generation (RAG) grounded LLM layer to synthesize SHAP feature attributions and patient-specific clinical notes into coherent natural-language explanations. Validated on the MIMIC-IV dataset for acute heart failure mortality prediction, it reports AUC improvements from 0.74-0.77 to 0.84-0.88 and superior performance on human-aligned metrics (0.85 vs. 0.50) compared to raw SHAP visualizations.

Significance. If the reported gains hold under rigorous experimental controls, the work could meaningfully advance trustworthy clinical AI by bridging predictive performance with ontology-grounded, human-readable explanations, addressing a key barrier to adoption in connected health applications. The neuro-symbolic pipeline combining symbolic ontology, SHAP attributions, and RAG-LLM generation offers a concrete engineering approach worth further development.

major comments (3)

[Results] Results section: The AUC lift from 0.74-0.77 to 0.84-0.88 is presented without any description of train-test split methodology, hyperparameter search, cross-validation procedure, or statistical significance testing, which are required to substantiate the central performance claim.
[Methods] Methods (RAG-LLM component): No mechanisms are described for conflict detection, source attribution, or post-generation verification of the LLM output against the underlying predictive model when retrieved SNOMED documents or MIMIC notes contain gaps or contradictions, undermining the claim of faithful grounded explainability.
[Human Evaluation] Human evaluation: The reported superiority in human-aligned metrics (0.85 vs. 0.50) provides no information on blinding, randomization of presentation order, or controls for evaluator bias, leaving open the possibility that scores reflect presentation artifacts rather than true explanation quality.

minor comments (2)

[Abstract] Abstract: The baseline and improved AUC ranges (0.74-0.77 and 0.84-0.88) would benefit from explicit mapping to the specific baseline models or conditions that produced each endpoint.
Consider adding a dedicated limitations paragraph addressing potential failure modes of the RAG layer on incomplete EHR data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the reporting of our experimental and evaluation protocols. We address each major comment point by point below, with revisions incorporated where the original manuscript was incomplete.

read point-by-point responses

Referee: [Results] Results section: The AUC lift from 0.74-0.77 to 0.84-0.88 is presented without any description of train-test split methodology, hyperparameter search, cross-validation procedure, or statistical significance testing, which are required to substantiate the central performance claim.

Authors: We agree that these methodological details are necessary to substantiate the performance claims. The original manuscript omitted a full description of the protocol. In the revised version we have added a dedicated Experimental Setup subsection describing a patient-level stratified 70/30 train-test split (to prevent leakage), hyperparameter optimization via 5-fold cross-validation combined with grid search, and statistical significance assessed via bootstrap resampling (1,000 iterations) with paired tests confirming p < 0.01 for the AUC gains. These additions are now in Section 4.1. revision: yes
Referee: [Methods] Methods (RAG-LLM component): No mechanisms are described for conflict detection, source attribution, or post-generation verification of the LLM output against the underlying predictive model when retrieved SNOMED documents or MIMIC notes contain gaps or contradictions, undermining the claim of faithful grounded explainability.

Authors: We concur that explicit safeguards are required for trustworthy grounded explanations. The initial submission did not detail these components. We have expanded the RAG-LLM Methods subsection to describe (i) a conflict-detection step that computes semantic similarity between SNOMED CT concepts and SHAP attributions and flags low-consistency cases, (ii) mandatory source attribution that cites specific retrieved document IDs and note excerpts in every generated explanation, and (iii) a post-generation verification pass that re-queries the predictive model to ensure the narrative remains consistent with the original feature attributions and output probabilities. These mechanisms are now fully specified. revision: yes
Referee: [Human Evaluation] Human evaluation: The reported superiority in human-aligned metrics (0.85 vs. 0.50) provides no information on blinding, randomization of presentation order, or controls for evaluator bias, leaving open the possibility that scores reflect presentation artifacts rather than true explanation quality.

Authors: The referee correctly notes that the human-evaluation protocol was under-reported. We have revised the Human Evaluation section to specify that the study was conducted with double-blinding (evaluators unaware of explanation source), randomized presentation order across the 50 participating clinicians, and explicit bias controls including a pilot inter-rater reliability assessment (Krippendorff’s alpha = 0.81) and exclusion of any evaluator with prior model exposure. These details support the validity of the reported metric improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external dataset with no derivations or self-referential fits

full rationale

The paper presents NEURON as a neuro-symbolic pipeline combining SNOMED CT ontology, ML models, SHAP attributions, and a RAG-grounded LLM layer. All reported performance numbers (AUC lift from 0.74-0.77 to 0.84-0.88 and human-aligned metrics 0.85 vs 0.50) are stated as outcomes of validation experiments on the external public MIMIC-IV dataset for acute heart failure mortality prediction. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are therefore not equivalent to their inputs by construction; they rest on independent empirical measurement rather than definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract does not introduce new mathematical axioms or free parameters; it relies on the pre-existing SNOMED CT ontology (standard domain resource) and standard ML practices. No invented entities are postulated.

axioms (1)

domain assumption SNOMED CT provides a reliable, complete structural representation of clinical concepts and relationships for feature grounding.
Invoked when the system maps raw data features to ontology-informed representations; correctness depends on SNOMED CT coverage and mapping quality.

pith-pipeline@v0.9.0 · 5499 in / 1431 out tokens · 33184 ms · 2026-05-09T15:22:12.223607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 85 canonical work pages · 1 internal anchor

[1]

Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI).IEEE Access6 (2018), 52138–52160. doi:10.1109/ACCESS.2018.2870052

work page doi:10.1109/access.2018.2870052 2018
[2]

Sara Alhasan and Reem Alnanih. 2025. Enhancing AI explainability through the EXACT framework: A user-centric approach.IEEE Access13 (2025), 98208–98228. doi:10.1109/ACCESS.2025.3576234

work page doi:10.1109/access.2025.3576234 2025
[3]

Sajid Ali, Tamer Abuhmed, Shaker El-Sappagh, Khan Muhammad, Jose M Alonso-Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz-Rodríguez, and Francisco Herrera. 2023. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence.Information Fusion99 (2023), 101805. doi:10.101...

work page doi:10.1016/j.inffus.2023.101805 2023
[4]

Shuroug A Alowais, Sahar S Alghamdi, Nada Alsuhebany, Tariq Alqahtani, Abdulrahman I Alshaya, Sumaya N Almohareb, Atheer Aldairem, Mohammed Alrashed, Khalid Bin Saleh, Hisham A Badreldin, et al. 2023. Revolutionizing healthcare: the role of artificial intelligence in clinical practice.BMC Medical Education23, 1 (2023), 689. doi:10.1186/s12909-023-04698-z

work page doi:10.1186/s12909-023-04698-z 2023
[5]

David Alvarez-Melis and Tommi S Jaakkola. 2018. On the robustness of interpretability methods.arXiv:1806.08049(2018)

work page Pith review arXiv 2018
[6]

Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. 2020. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC Medical Informatics and Decision Making20, 1 (2020), 310. doi:10.1186/s12911-020- 01332-6

work page doi:10.1186/s12911-020- 2020
[7]

Lars W Andersen, Won Young Kim, Maureen Chase, Sharri J Mortensen, Ari Moskowitz, Victor Novack, Michael N Cocchi, Michael W Donnino, et al. 2016. The prevalence and significance of abnormal vital signs prior to in-hospital cardiac arrest.Resuscitation98 (2016), 112–117. doi:10.1016/j.resuscitation.2015.08.016

work page doi:10.1016/j.resuscitation.2015.08.016 2016
[8]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al . 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information Fusion58 (2020), 82–115. doi:10....

work page doi:10.1016/j.inffus.2019.12.012 2020
[9]

Sujatha Baddam and Bracken Burns. 2025. Systemic inflammatory response syndrome. InStatPearls [Internet]. StatPearls Publishing, Treasure Island, FL, USA. https://www.ncbi.nlm.nih.gov/books/NBK547669/

2025
[10]

Sher Badshah and Hassan Sajjad. 2025. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA. InProceedings of the 9th Widening NLP Workshop. 251–267. doi:10.18653/v1/2025.winlp-main.37

work page doi:10.18653/v1/2025.winlp-main.37 2025
[11]

TT Bauer, S Ewig, R Marre, N Suttorp, T Welte, and CAPNETZ Study Group. 2006. CRB-65 predicts death from community-acquired pneumonia. Journal of Internal Medicine260, 1 (2006), 93–101. doi:10.1111/j.1365-2796.2006.01657.x

work page doi:10.1111/j.1365-2796.2006.01657.x 2006
[12]

Wobo Bekwelem, Pamela L Lutsey, Laura R Loehr, Sunil K Agarwal, Brad C Astor, Cameron Guild, Christie M Ballantyne, and Aaron R Folsom
[13]

doi:10.1016/j.annepidem.2011.06.005

White blood cell count, C-reactive protein, and incident heart failure in the Atherosclerosis Risk in Communities (ARIC) Study.Annals of Epidemiology21, 10 (2011), 739–748. doi:10.1016/j.annepidem.2011.06.005

work page doi:10.1016/j.annepidem.2011.06.005 2011
[14]

Subrato Bharati, M Rubaiyat Hossain Mondal, and Prajoy Podder. 2023. A review on explainable artificial intelligence for healthcare: Why, how, and when?IEEE Transactions on Artificial Intelligence5, 4 (2023), 1429–1442. doi:10.1109/TAI.2023.3266418

work page doi:10.1109/tai.2023.3266418 2023
[15]

Mikołaj Błaziak, Szymon Urban, Weronika Wietrzyk, Maksym Jura, Gracjan Iwanek, Bartłomiej Stańczykiewicz, Wiktor Kuliczkowski, Robert Zymliński, Maciej Pondel, Petr Berka, et al. 2022. An artificial intelligence approach to guiding the management of heart failure patients using predictive models: A systematic review.Biomedicines10, 9 (2022), 2188. doi:10....

work page doi:10.3390/biomedicines10092188 2022
[16]

Michael Böhm, Karl Swedberg, Michel Komajda, Jeffrey S Borer, Ian Ford, Ariane Dubost-Brama, Guy Lerebours, and Luigi Tavazzi. 2010. Heart rate as a risk factor in chronic heart failure (SHIFT): the association between heart rate and outcomes in a randomised placebo-controlled trial.The 16 Anuradha Chandrasekaran, Dimitrios Zikos, Mutlu Mete, Alan Pang, B...

work page doi:10.1016/s0140-6736(10)61259-7 2010
[17]

Giovanni Briganti and Olivier Le Moine. 2020. Artificial intelligence in medicine: Today and tomorrow.Frontiers in Medicine7 (2020), 509744. doi:10.3389/fmed.2020.00027

work page doi:10.3389/fmed.2020.00027 2020
[18]

Simone Britsch, Markward Britsch, Simon Lindner, Leonie Hahn, Verena Schneider-Lindner, Thomas Helbing, Manfred Thiel, Daniel Duerschmied, and Tobias Becher. 2025. An interpretable machine learning algorithm enables dynamic 48-hour mortality prediction during an ICU stay. Communications Medicine5, 1 (2025), 426. doi:10.1038/s43856-025-01192-z

work page doi:10.1038/s43856-025-01192-z 2025
[19]

Federico Cabitza, Raffaele Rasoini, and Gian Franco Gensini. 2017. Unintended consequences of machine learning in medicine.JAMA318, 6 (2017), 517–518. doi:10.1001/jama.2017.7797

work page doi:10.1001/jama.2017.7797 2017
[20]

Zijun Chen, Tingming Li, Sheng Guo, Deli Zeng, and Kai Wang. 2023. Machine learning-based in-hospital mortality risk prediction tool for intensive care unit patients with heart failure.Frontiers in Cardiovascular Medicine10 (2023), 1119699. doi:10.3389/fcvm.2023.1119699

work page doi:10.3389/fcvm.2023.1119699 2023
[21]

Komala Subramanyam Cherukuri and Haihua Chen. 2024. Investigating the applications of AI in oncology from NIH funded projects: Case study of lung cancer and pancreatic cancer.Proceedings of the Association for Information Science and Technology61, 1 (2024), 871–873. doi:10.1002/pra2.1124

work page doi:10.1002/pra2.1124 2024
[22]

Ovidiu Chioncel, Sean P Collins, Andrew P Ambrosy, Mihai Gheorghiade, and Gerasimos Filippatos. 2015. Pulmonary oedema—therapeutic targets. Cardiac Failure Review1, 1 (2015), 38. doi:10.15420/CFR.2015.01.01.38

work page doi:10.15420/cfr.2015.01.01.38 2015
[23]

Carlo Combi, Beatrice Amico, Riccardo Bellazzi, Andreas Holzinger, Jason H Moore, Marinka Zitnik, and John H Holmes. 2022. A manifesto on explainability for artificial intelligence in medicine.Artificial Intelligence in Medicine133 (2022), 102423. doi:10.1016/j.artmed.2022.102423

work page doi:10.1016/j.artmed.2022.102423 2022
[24]

Roberto Confalonieri and Giancarlo Guizzardi. 2025. On the multiple roles of ontologies in explanations for neuro-symbolic AI.Neurosymbolic Artificial Intelligence1 (2025), NAI–240754. doi:10.3233/NAI-240754

work page doi:10.3233/nai-240754 2025
[25]

Daniel De Backer, Patrick Biston, Jacques Devriendt, Christian Madl, Didier Chochrad, Cesar Aldecoa, Alexandre Brasseur, Pierre Defrance, Philippe Gottignies, and Jean-Louis Vincent. 2010. Comparison of dopamine and norepinephrine in the treatment of shock.New England Journal of Medicine362, 9 (2010), 779–789. doi:10.1056/NEJMoa0907118

work page doi:10.1056/nejmoa0907118 2010
[26]

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4443–4458. doi:10.18653/v1/2020.acl-main.408

work page doi:10.18653/v1/2020.acl-main.408 2020
[27]

Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning.arXiv:1702.08608(2017)

work page internal anchor Pith review arXiv 2017
[28]

Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al
[29]

Surv.55, 9, Article 194 (Jan

Explainable AI (XAI): Core ideas, techniques, and solutions.Comput. Surveys55, 9 (2023), 1–33. doi:10.1145/3561048

work page doi:10.1145/3561048 2023
[30]

Hui Feng, Yuntzu Yin, Emiliano Reynares, and Jay Nanavati. 2025. OntologyRAG: Better and faster biomedical code mapping with retrieval- augmented generation (RAG) leveraging ontology knowledge graphs and large language models. InInternational Workshop on Knowledge-Enhanced Information Retrieval. Springer, 71–86. doi:10.1007/978-3-032-02899-0_4

work page doi:10.1007/978-3-032-02899-0_4 2025
[31]

Gregg C Fonarow, Kirkwood F Adams, William T Abraham, Clyde W Yancy, W John Boscardin, ADHERE Scientific Advisory Committee, et al
[32]

doi:10.1001/jama.293.5.572

Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis.JAMA293, 5 (2005), 572–580. doi:10.1001/jama.293.5.572

work page doi:10.1001/jama.293.5.572 2005
[33]

Artur d’Avila Garcez and Luis C Lamb. 2023. Neurosymbolic AI: the 3rd wave.Artificial Intelligence Review56, 11 (2023), 12387–12406. doi:10.1007/s10462-023-10448-w

work page doi:10.1007/s10462-023-10448-w 2023
[34]

Manas Gaur and Amit Sheth. 2024. Building trustworthy NeuroSymbolic AI Systems: Consistency, reliability, explainability, and safety.AI Magazine45, 1 (2024), 139–155. doi:10.1002/aaai.12149

work page doi:10.1002/aaai.12149 2024
[35]

Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals.Circulation101, 23 (2000), e215–e220. doi:10.1161/01.CIR.101.23.e215

work page doi:10.1161/01.cir.101.23.e215 2000
[36]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on LLM-as-a-judge.The Innovation(2024). doi:10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2024
[37]

Rosita Guido, Stefania Ferrisi, Danilo Lofaro, and Domenico Conforti. 2024. An overview on the advancements of support vector machine models in healthcare applications: A review.Information15, 4 (2024), 235

2024
[38]

Hajar Hakkoum, Ali Idri, and Ibtissam Abnane. 2024. Global and local interpretability techniques of supervised machine learning black box models for numerical medical data.Engineering Applications of Artificial Intelligence131 (2024), 107829. doi:10.1016/j.engappai.2023.107829

work page doi:10.1016/j.engappai.2023.107829 2024
[39]

Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? A formal framework for understanding the roles of explanation data.arXiv:2102.02201(2021)

work page arXiv 2021
[40]

2002.Accuracy and stability of numerical algorithms

Nicholas J Higham. 2002.Accuracy and stability of numerical algorithms. SIAM

2002
[41]

Scorecard

Robert R Hoffman, Mohammadreza Jalaeian, Connor Tate, Gary Klein, and Shane T Mueller. 2023. Evaluating machine-generated explanations: a “Scorecard” method for XAI measurement science.Frontiers in Computer Science5 (2023), 1114806. doi:10.3389/fcomp.2023.1114806

work page doi:10.3389/fcomp.2023.1114806 2023
[42]

Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo Müller. 2019. Causability and explainability of artificial intelligence in medicine.WIREs Data Mining and Knowledge Discovery9, 4 (2019), e1312. doi:10.1002/widm.1312

work page doi:10.1002/widm.1312 2019
[43]

Delower Hossain and Jake Y Chen. 2025. A study on neuro-symbolic artificial intelligence: Healthcare perspectives.arXiv:2503.18213(2025)

work page arXiv 2025
[44]

Jicheng Huang, Yufeng Cai, Xusheng Wu, Xin Huang, Jianwei Liu, and Dehua Hu. 2024. Prediction of mortality events of patients with acute heart failure in intensive care unit based on deep neural network.Computer Methods and Programs in Biomedicine256 (2024), 108403. NEURON: A Neuro-symbolic System for Grounded Clinical Explainability 17 doi:10.1016/j.cmpb...

work page doi:10.1016/j.cmpb.2024.108403 2024
[45]

Shaoyan Huang, Qiuwang Zhang, Lei Liu, Michael JB Kutryk, and Jianzhong Zhang. 2025. Relationship between albumin-corrected anion gap and short-and medium-term all-cause mortality in heart failure patients with a single ICU admission.Frontiers in Cardiovascular Medicine12 (2025), 1608383. doi:10.3389/fcvm.2025.1608383

work page doi:10.3389/fcvm.2025.1608383 2025
[46]

Sujeong Hur, Yura Lee, Joongheum Park, Yeong Jeong Jeon, Jong Ho Cho, Duck Cho, Dobin Lim, Wonil Hwang, Won Chul Cha, and Junsang Yoo
[47]

doi:10.1038/s41746-025-01958-8

Comparison of SHAP and clinician friendly explanations reveals effects on clinical decision behaviour.NPJ Digital Medicine8, 1 (2025), 578. doi:10.1038/s41746-025-01958-8

work page doi:10.1038/s41746-025-01958-8 2025
[48]

Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics. 4198–4205. doi:10.18653/v1/2020.acl-main.386

work page doi:10.18653/v1/2020.acl-main.386 2020
[49]

Pengcheng Jiang, Cao (Danica) Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. 2025. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InProceedings of the Thirteenth International Conference on Learning Represen- tations (ICLR)

2025
[50]

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. 2024. MIMIC-IV. PhysioNet. doi:10.13026/kpb9-mt58 Version 3.1

work page doi:10.13026/kpb9-mt58 2024
[51]

Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. 2023. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data10, 1 (2023), 1. doi:10.1038/s41597-022-01899-x

work page doi:10.1038/s41597-022-01899-x 2023
[52]

Md Abdul Kadir, Amir Mosavi, and Daniel Sonntag. 2023. Evaluation metrics for XAI: A review, taxonomy, and practical applications. In2023 IEEE 27th International Conference on Intelligent Engineering Systems (INES). IEEE, 000111–000124. doi:10.1109/INES59282.2023.10297629

work page doi:10.1109/ines59282.2023.10297629 2023
[53]

Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining interpretability and explainability using sensemaking theory. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 702–714. doi:10.1145/3531146.3533135

work page doi:10.1145/3531146.3533135 2022
[54]

Gwladys Kelodjou, Laurence Rozé, Véronique Masson, Luis Galárraga, Romaric Gaudel, Maurice Tchuente, and Alexandre Termier. 2024. Shaping up SHAP: enhancing stability through layer-wise neighbor selection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 13094–13103. doi:10.1609/aaai.v38i12.29208

work page doi:10.1609/aaai.v38i12.29208 2024
[55]

Arif Khwaja. 2012. KDIGO clinical practice guidelines for acute kidney injury.Nephron Clinical Practice120, 4 (2012), c179–c184. doi:10.1159/ 000339789

2012
[56]

Sujung Lee, Won Ik Cho, Youngrong Lee, Duck Ju Kim, Kyeng Hyun Nam, Sangmin Lee, Jungyo Suh, and Taehoon Ko. 2025. A prompt framework for enhancing LLM-based explainability of medical machine learning models: an intensive care unit application.BMC Medical Informatics and Decision Making25, 1 (2025), 430. doi:10.1186/s12911-025-03239-6

work page doi:10.1186/s12911-025-03239-6 2025
[57]

Marcel Levi and Steven M Opal. 2006. Coagulation abnormalities in critically ill patients.Critical Care10, 4 (2006), 222. doi:10.1186/cc4975

work page doi:10.1186/cc4975 2006
[58]

Avivit Levy, B Riva Shalom, and Michal Chalamish. 2025. A guide to similarity measures and their data science applications.Journal of Big Data 12, 1 (2025), 188. doi:10.1186/s40537-025-01227-1

work page doi:10.1186/s40537-025-01227-1 2025
[59]

Jun Li, Yiwu Sun, Jie Ren, Yifan Wu, and Zhaoyi He. 2025. Machine learning for in-hospital mortality prediction in critically ill patients with acute heart failure: A retrospective analysis based on the MIMIC-IV database.Journal of Cardiothoracic and Vascular Anesthesia39, 3 (2025), 666–674. doi:10.1053/j.jvca.2024.12.016

work page doi:10.1053/j.jvca.2024.12.016 2025
[60]

Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, et al. 2024. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion106 (2024), 102301. doi:10.1016/j.inff...

work page doi:10.1016/j.inffus.2024.102301 2024
[61]

Zeying Lou, Fanghua Zeng, Wenbao Huang, Li Xiao, Kang Zou, and Huasheng Zhou. 2024. Association between the anion-gap and 28-day mortality in critically ill adult patients with sepsis: A retrospective cohort study.Medicine103, 30 (2024), e39029. doi:10.1097/MD.0000000000039029

work page doi:10.1097/md.0000000000039029 2024
[62]

Scott M Lundberg, Gabriel G Erion, and Su-In Lee. 2018. Consistent individualized feature attribution for tree ensembles.arXiv:1802.03888(2018)

work page arXiv 2018
[63]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). 4768–4777

2017
[64]

Alexandre Mebazaa, Markku S Nieminen, Milton Packer, Alain Cohen-Solal, Franz X Kleber, Stuart J Pocock, Roopal Thakkar, Robert J Padley, Pentti Põder, Matti Kivikko, et al. 2007. Levosimendan vs dobutamine for patients with acute decompensated heart failure: the SURVIVE Randomized Trial.JAMA297, 17 (2007), 1883–1891. doi:10.1001/jama.297.17.1883

work page doi:10.1001/jama.297.17.1883 2007
[65]

Munib Mesinovic, Peter Watkinson, and Tingting Zhu. 2025. Explainable machine learning for predicting ICU mortality in myocardial infarction patients using pseudo-dynamic data.Scientific Reports15, 1 (2025), 27887. doi:10.1038/s41598-025-13299-3

work page doi:10.1038/s41598-025-13299-3 2025
[66]

Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. 2016. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records.Scientific Reports6, 1 (2016), 26094. doi:10.1038/srep26094

work page doi:10.1038/srep26094 2016
[67]

Kayo Misumi, Yuya Matsue, Kazutaka Nogi, Yudai Fujimoto, Nobuyuki Kagiyama, Takatoshi Kasai, Takeshi Kitai, Shogo Oishi, Eiichi Akiyama, Satoshi Suzuki, et al. 2023. Derivation and validation of a machine learning-based risk prediction model in patients with acute heart failure. Journal of Cardiology81, 6 (2023), 531–536. doi:10.1016/j.jjcc.2023.02.006

work page doi:10.1016/j.jjcc.2023.02.006 2023
[68]

Sina Mohseni, Jeremy E Block, and Eric Ragan. 2021. Quantitative evaluation of machine learning explanations: A human-grounded benchmark. InProceedings of the 26th International Conference on Intelligent User Interfaces. 22–31. doi:10.1145/3397481.3450689

work page doi:10.1145/3397481.3450689 2021
[69]

2020.Interpretable machine learning

Christoph Molnar. 2020.Interpretable machine learning. Lulu.com. 18 Anuradha Chandrasekaran, Dimitrios Zikos, Mutlu Mete, Alan Pang, Brady D. Lund, and Kewei Sha

2020
[70]

Ramesh Nadarajah, Tanina Younsi, Elizabeth Romer, Keerthenan Raveendra, Yoko M Nakao, Kazuhiro Nakao, Farag Shuweidhi, David C Hogg, Ronen Arbel, Doron Zahger, et al. 2023. Prediction models for heart failure in the community: A systematic review and meta-analysis.European Journal of Heart Failure25, 10 (2023), 1724–1738. doi:10.1002/ejhf.2970

work page doi:10.1002/ejhf.2970 2023
[71]

National Library of Medicine. 2025. SNOMED CT United States Edition. https://www.nlm.nih.gov/healthit/snomedct/index.html Accessed: Dec. 8, 2025

2025
[72]

Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, Jörg Schlötterer, Maurice Van Keulen, and Christin Seifert. 2023. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI.Comput. Surveys55, 13s (2023), 1–42. doi:10.1145/3583558

work page doi:10.1145/3583558 2023
[73]

Sidra Naveed, Gunnar Stevens, and Dean Robin-Kern. 2024. An overview of the empirical evaluation of explainable AI (XAI): A comprehensive guideline for user-centered evaluation in XAI.Applied Sciences14, 23 (2024), 11288. doi:10.3390/app142311288

work page doi:10.3390/app142311288 2024
[74]

Fnu Neha, Deepshikha Bhati, and Deepak Kumar Shukla. 2025. Retrieval-augmented generation (RAG) in healthcare: A comprehensive review.AI 6, 9 (2025), 226. doi:10.3390/ai6090226

work page doi:10.3390/ai6090226 2025
[75]

Ke Niu, You Lu, Xueping Peng, and Jingni Zeng. 2022. Fusion of sequential visits and medical ontology for mortality prediction.Journal of Biomedical Informatics127 (2022), 104012. doi:10.1016/j.jbi.2022.104012

work page doi:10.1016/j.jbi.2022.104012 2022
[76]

Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. 2018. On the stability of feature selection algorithms.Journal of Machine Learning Research18, 174 (2018), 1–54

2018
[77]

Atsushi Ohgushi, Takayuki Ohtani, Natsumi Nakayama, Shigeo Asai, Yoshiyuki Ishii, Atsuo Namiki, Manabu Akazawa, and Hirotoshi Echizen
[78]

doi:10.1186/s40780-015-0036-1

Risk of major bleeding at different PT-INR ranges in elderly Japanese patients with non-valvular atrial fibrillation receiving warfarin: a nested case-control study.Journal of Pharmaceutical Health Care and Sciences2, 1 (2016), 2. doi:10.1186/s40780-015-0036-1

work page doi:10.1186/s40780-015-0036-1 2016
[79]

Evandro S Ortigossa, Thales Gonçalves, and Luis Gustavo Nonato. 2024. Explainable artificial intelligence (XAI)—From theory to methods and applications.IEEE Access12 (2024), 80799–80846

2024
[80]

Ciyuan Peng, Feng Xia, Mehdi Naseriparsa, and Francesco Osborne. 2023. Knowledge graphs: Opportunities and challenges.Artificial Intelligence Review56, 11 (2023), 13071–13102. doi:10.1007/s10462-023-10465-9

work page doi:10.1007/s10462-023-10465-9 2023

Showing first 80 references.