pith. machine review for the scientific record. sign in

arxiv: 2605.04085 · v1 · submitted 2026-04-23 · 💻 cs.CY · cs.AI· cs.CL· stat.ME

Recognition: unknown

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLstat.ME
keywords patient safetygenerative AIclinical summariesFMECArisk assessmentLLMdischarge summariesinter-rater reliability
0
0 comments X

The pith

A new FMECA framework gives a structured method to identify patient safety risks in LLM-generated clinical summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and validates the first FMECA-based framework for assessing patient safety risks in large language model generated clinical summaries. An interdisciplinary panel created a taxonomy of 14 failure modes through literature review and brainstorming, then adapted standard FMECA scales for occurrence, severity, and detectability. The framework was tested by applying it to 36 discharge summaries generated by an open-source LLM from real hospital data, with reviewers scoring independently across rounds. Inter-rater agreement reached moderate-to-substantial levels for identifying failure modes and good agreement for the severity and detectability scores. Usability feedback was positive, supporting the framework as a reproducible tool for spotting clinically relevant risks before deployment.

Core claim

The central claim is that a novel FMECA framework, built around 14 failure modes in categories and using adapted 5-point ordinal scales, offers a systematic and reproducible way to prospectively evaluate patient safety risks arising from LLM-generated clinical summaries, as demonstrated by its application to real-world discharge summaries with improving inter-rater reliability and good usability scores.

What carries the argument

The FMECA framework, which organizes risk assessment around a taxonomy of 14 failure modes together with adapted scales for occurrence, severity, and detectability to calculate criticality.

If this is right

  • The framework supplies a proactive, standardized process for identifying clinically relevant risks in AI-generated clinical text before such tools enter routine use.
  • Application to real discharge summaries shows the method can be used by reviewers to annotate outputs consistently across rounds.
  • Good inter-rater agreement on severity and detectability scores supports its use for comparing risks across different LLMs or prompts.
  • High usability ratings indicate the framework can be adopted by interdisciplinary teams without extensive additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could embed the framework into pre-deployment checks for any generative AI tool that produces clinical text.
  • The same failure-mode approach might be extended to other LLM outputs such as diagnostic reasoning or treatment recommendations.
  • Periodic re-application of the framework could track how risks evolve as newer LLM versions are released.
  • Automating parts of the failure-mode detection step could make the method scalable for large volumes of generated content.

Load-bearing premise

The 14 failure modes identified by the expert panel capture all relevant patient safety risks in LLM-generated clinical content and the adapted 5-point scales are valid and reliable for this new domain.

What would settle it

A follow-up study in which the framework is applied to additional LLM summaries yet misses failure modes that later lead to documented patient harm, or in which separate expert panels produce substantially different sets of failure modes.

read the original abstract

Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper develops and validates a novel FMECA framework for prospective assessment of patient safety risks in LLM-generated clinical summaries. An n=8 interdisciplinary expert panel created a taxonomy of 14 failure modes via literature review and brainstorming, adapted 5-point ordinal scales for occurrence/severity/detectability, applied the framework to 36 GPT-OSS-generated discharge summaries from real Geneva University Hospitals data across two annotation rounds, and reported improved inter-rater agreement plus good usability (mean SUS 79.2/100).

Significance. If the framework's validity and completeness hold, this provides the first structured, reproducible FMECA-based method for identifying clinically relevant risks from generative AI in clinical summarization, addressing a gap as LLMs see wider healthcare use. Strengths include adaptation of an established risk-analysis technique, use of real-world data, expert input, and evidence of practical usability and reliability via inter-rater metrics and SUS scores.

major comments (2)
  1. [Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.
  2. [Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.
minor comments (1)
  1. [Abstract] Abstract: Consider adding a brief limitations paragraph or explicit statement on generalizability (e.g., beyond discharge summaries or the specific open LLM used) to better contextualize the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.

    Authors: The taxonomy was developed following established FMECA practices, which prioritize multidisciplinary expert consensus informed by literature review for novel domains where comprehensive external databases of LLM-specific adverse events do not yet exist. The n=8 panel included clinicians, informaticians, and patient safety experts, and the literature review explicitly covered documented risks in clinical summarization and generative AI outputs. We acknowledge that this internal process does not constitute full external validation against real-world incident databases, which represents a genuine limitation for claims of absolute comprehensiveness. In the revised manuscript, we will expand the Materials and Methods to detail the literature sources consulted and the iterative brainstorming protocol. We will also add a Limitations section that explicitly notes the potential for missed failure modes (such as certain hallucination subtypes) and recommends future retrospective mapping to clinical databases as a validation step. This strengthens transparency without overstating the current evidence. revision: partial

  2. Referee: [Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.

    Authors: We agree that the absence of specific quantitative metrics in the abstract limits readers' ability to assess the validation strength. The full Results section reports the detailed statistics, including round-by-round improvements in agreement for failure mode identification and the scoring dimensions. To address this, we will revise the abstract to include key quantitative values (such as the achieved kappa coefficients for failure mode identification and agreement levels for severity/detectability) along with a brief note on the distribution across the 36 summaries. This change directly supports the validation claims with greater precision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derived from external literature and independent expert panel, then applied to separate data

full rationale

The paper's derivation chain begins with an external literature review plus brainstorming by an n=8 interdisciplinary panel to produce the 14 failure modes and adapted 5-point scales; these are then applied to 36 independently generated summaries for annotation, inter-rater reliability measurement, and usability scoring. No step reduces by construction to its own inputs, no self-citation is load-bearing, no parameter is fitted and renamed as prediction, and no uniqueness theorem or ansatz is smuggled in. The process is self-contained against external benchmarks because the taxonomy originates outside the validation dataset and the agreement/usability metrics are measured on held-out summaries.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on expert-derived taxonomy and adapted standard FMECA scales applied to LLM outputs; no data-fitted parameters or new postulated entities beyond the framework components.

axioms (1)
  • domain assumption An interdisciplinary expert panel can reliably identify and categorize relevant failure modes for LLM-generated clinical summaries through literature review and brainstorming.
    The taxonomy of 14 failure modes was developed by n=8 experts as described in the materials and methods section of the abstract.

pith-pipeline@v0.9.0 · 5684 in / 1275 out tokens · 48602 ms · 2026-05-09T20:51:17.510097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

    Cohen R, Elhadad M, Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics 2013;14:10. https://doi.org/10.1186/1471 -2105-14- 10

  2. [2]

    Summarization of clinical information: A conceptual model

    Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of clinical information: A conceptual model. J Biomed Inform 2011;44:688–99. https://doi.org/10.1016/j.jbi.2011.03.008

  3. [3]

    J., Ting, D

    Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930–40. https://doi.org/10.1038/s41591-023-02448-8

  4. [4]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–80. https://doi.org/10.1038/s41586-023-06291-2

  5. [5]

    The future landscape of large language models in medicine

    Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J -N, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med 2023;3:141. https://doi.org/10.1038/s43856 -023-00370-1

  6. [6]

    Science in the age of large language models

    Birhane A, Kasirzadeh A, Leslie D, Wachter S. Science in the age of large language models. Nat Rev Phys 2023;5:277–

  7. [7]

    https://doi.org/10.1038/s42254-023-00581-4

  8. [8]

    The imperative for regulatory oversight of large language models (or generative AI) in healthcare

    Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. Npj Digit Med 2023;6:120. https://doi.org/10.1038/s41746-023-00873-0

  9. [9]

    Scientific evidence for clinical text summarization using large language models: scoping review

    Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, et al. Scientific evidence for clinical text summarization using large language models: scoping review. J Med Internet Res 2025;27:e68998

  10. [10]

    A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care

    Vithanage D, Yu P, Xie Q, Xu H, Wang L, Deng C. A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care. Comput Biol Med 2025;197:111013. https://doi.org/10.1016/j.compbiomed.2025.111013

  11. [11]

    Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review

    Bednarczyk L, Bjelogrlic M, Zaghir J, Tcherepanova M, Ehrsam J, Bensahla A, et al. Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review. Stud Health Technol Inform 2026

  12. [12]

    Global Regulatory Frameworks for the Use of Artificial Intelligence (AI) in the Healthcare Services Sector

    Palaniappan K, Lin EYT, Vogel S. Global Regulatory Frameworks for the Use of Artificial Intelligence (AI) in the Healthcare Services Sector. Healthcare 2024;12:562. https://doi.org/10.3390/healthcare12050562

  13. [13]

    Unregulated large language models produce medical device -like output

    Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device -like output. Npj Digit Med 2025;8:1–5. https://doi.org/10.1038/s41746-025-01544-y

  14. [14]

    https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)

    MDCG 2019-11 rev.1 - Qualification and classification of software - Regulation (EU) 2017/745 and Regulation (EU) 2017/746 (June 2025) - Public Health n.d. https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)

  15. [15]

    Med Device Regul n.d

    Medical Device Regulation (MDR). Med Device Regul n.d. https://www.medical -device-regulation.eu/download- mdr/ (accessed March 27, 2026)

  16. [16]

    Use of a systematic risk analysis method to improve safety in the production of paediatric parenteral nutrition solutions

    Bonnabry P, Cingria L, Sadeghipour F, Ing H, Fonzo-Christe C, Pfister RE. Use of a systematic risk analysis method to improve safety in the production of paediatric parenteral nutrition solutions. BMJ Qual Saf 2005;14:93 –8. https://doi.org/10.1136/qshc.2003.007914

  17. [17]

    Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process

    Bonnabry P, Cingria L, Ackermann M, Sadeghipour F, Bigler L, Mach N. Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process. Int J Qual Health Care 2006;18:9 –16. https://doi.org/10.1093/intqhc/mzi082

  18. [18]

    Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94

    Shebl NA, Franklin BD, Barber N. Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94. https://doi.org/10.1097/PTS.0b013e3181a6f040

  19. [19]

    FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner

    El Mansouri M, Sekkat H, Talbi M, Tahiri Z, Nhila O. FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner. J Fail Anal Prev 2024;24:436–42. https://doi.org/10.1007/s11668-023-01853-y

  20. [20]

    A Risk Analysis Method to Evaluate the Impact of a Computerized Provider Order Entry System on Patient Safety

    Bonnabry P, Despont-Gros C, Grauser D, Casez P, Despond M, Pugin D, et al. A Risk Analysis Method to Evaluate the Impact of a Computerized Provider Order Entry System on Patient Safety. J Am Med Inform Assoc 2008;15:453–60. https://doi.org/10.1197/jamia.M2677

  21. [21]

    Failure Mode, Effects and Criticality Analysis (FMECA) for Medical Devices: Does Standardization Foster Improvements in the Practice? Procedia Manuf 2015;3:43 –50

    Onofrio R, Piccagli F, Segato F. Failure Mode, Effects and Criticality Analysis (FMECA) for Medical Devices: Does Standardization Foster Improvements in the Practice? Procedia Manuf 2015;3:43 –50. https://doi.org/10.1016/j.promfg.2015.07.106

  22. [22]

    Int Organ Stand n.d

    ISO 14971:2019. Int Organ Stand n.d. https://www.iso.org/standard/72704.html (accessed March 27, 2026)

  23. [23]

    Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables

    Pascarella G, Rossi M, Montella E, Capasso A, De Feo G, Botti G, et al. Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables. Risk Manag Healthc Policy 2021;14:2897 –911. https://doi.org/10.2147/RMHP.S309098

  24. [24]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt -oss-120b & gpt -oss-20b Model Card 2025. https://doi.org/10.48550/arXiv.2508.10925

  25. [25]

    MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d

    Wang Y, Ma X, Zhang G, Ni Y, Chandra A, Guo S, et al. MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d

  26. [26]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024

    Rein D, Hou BL, Stickland AC, Petty J, Pang RY, Dirani J, et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024

  27. [27]

    Prompt Engineering Paradigms for Medical Applications: Scoping Review

    Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J Med Internet Res 2024;26:e60501. https://doi.org/10.2196/60501

  28. [28]

    SUS - A quick and dirty usability scale n.d

    Brooke J. SUS - A quick and dirty usability scale n.d

  29. [29]

    https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)

    PSOPPC: Common Formats Hospital 2.0 n.d. https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)

  30. [30]

    npj Digital Medicine 2025 8:1 8:274-

    Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit Med 2025;8:1 –15. https://doi.org/10.1038/s41746-025-01670-7

  31. [31]

    Evaluating GPT-4o in high -stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam

    Altermatt FR, Neyem A, Sumonte NI, Villagrán I, Mendoza M, Lacassie HJ, et al. Evaluating GPT-4o in high -stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam. BMC Med Educ 2025;25:1499. https://doi.org/10.1186/s12909-025-08084-9

  32. [32]

    Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150

    Shebl NA, Franklin BD, Barber N. Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150. https://doi.org/10.1186/1472-6963-12-150

  33. [33]

    Failure mode and effect analysis improvement: A systematic literature review and future research agenda

    Huang J, You J-X, Liu H-C, Song M-S. Failure mode and effect analysis improvement: A systematic literature review and future research agenda. Reliab Eng Syst Saf 2020;199:106885. https://doi.org/10.1016/j.ress.2020.106885

  34. [34]

    arXiv preprint arXiv:2409.07314 (2024)

    Kanithi P, Christophe C, Pimentel MA, Raha T, Munjal P, Saadi N, et al. MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications. arXivOrg 2024. https://arxiv.org/abs/2409.07314v2 (accessed March 30, 2026)

  35. [35]

    Development and validation of the provider documentation summarization quality instrument for large language models

    Croxford E, Gao Y, Pellegrino N, Wong K, Wills G, First E, et al. Development and validation of the provider documentation summarization quality instrument for large language models. J Am Med Inform Assoc 2025;32:1050–60. https://doi.org/10.1093/jamia/ocaf068

  36. [36]

    A survey of hallucination in large foundation models

    Rawte V, Sheth A, Das A. A Survey of Hallucination in Large Foundation Models. arXivOrg 2023. https://arxiv.org/abs/2309.05922v1 (accessed March 30, 2026). Supplementary materials Supplementary Figure S1. Geneva University Hospitals consent form for research use of health data and biological samples. Supplementary Figure S 2. First page of the System Usab...

  37. [37]

     Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus)

    Analyse soigneusement le texte.  Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus).  Repère les allergies et réactions éventuelles.  Résume l’épisode clinique actuel (motif, diagnostic, plan)

  38. [38]

    S’il n’y en a pas, indiquer « Non mentionné »

    Présente le résultat dans le format EXACT suivant : Antécédents médicaux  [Antécédents 1] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent]  [Antécédents 2] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent]  … Présente d’abord les antécédents médicaux actifs, puis résolus. S’il n’y en a pas, i...

  39. [39]

     Identify relevant medical history (active or resolved diagnoses)

    Carefully analyze the text.  Identify relevant medical history (active or resolved diagnoses).  Note any allergies and reactions.  Summarize the current clinical episode (reason, diagnosis, plan)

  40. [40]

    Not mentioned

    Present the result in the EXACT format below: Medical History  [History 1]: [status: active / resolved / uncertain] — [relevant treatment or comment]  [History 2]: [status: active / resolved / uncertain] — [relevant treatment or comment]  … List active medical histories first, followed by resolved ones. If there are none, indicate “Not mentioned”. Alle...