Recognition: unknown
Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content
Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3
The pith
A new FMECA framework gives a structured method to identify patient safety risks in LLM-generated clinical summaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a novel FMECA framework, built around 14 failure modes in categories and using adapted 5-point ordinal scales, offers a systematic and reproducible way to prospectively evaluate patient safety risks arising from LLM-generated clinical summaries, as demonstrated by its application to real-world discharge summaries with improving inter-rater reliability and good usability scores.
What carries the argument
The FMECA framework, which organizes risk assessment around a taxonomy of 14 failure modes together with adapted scales for occurrence, severity, and detectability to calculate criticality.
If this is right
- The framework supplies a proactive, standardized process for identifying clinically relevant risks in AI-generated clinical text before such tools enter routine use.
- Application to real discharge summaries shows the method can be used by reviewers to annotate outputs consistently across rounds.
- Good inter-rater agreement on severity and detectability scores supports its use for comparing risks across different LLMs or prompts.
- High usability ratings indicate the framework can be adopted by interdisciplinary teams without extensive additional training.
Where Pith is reading between the lines
- Hospitals could embed the framework into pre-deployment checks for any generative AI tool that produces clinical text.
- The same failure-mode approach might be extended to other LLM outputs such as diagnostic reasoning or treatment recommendations.
- Periodic re-application of the framework could track how risks evolve as newer LLM versions are released.
- Automating parts of the failure-mode detection step could make the method scalable for large volumes of generated content.
Load-bearing premise
The 14 failure modes identified by the expert panel capture all relevant patient safety risks in LLM-generated clinical content and the adapted 5-point scales are valid and reliable for this new domain.
What would settle it
A follow-up study in which the framework is applied to additional LLM summaries yet misses failure modes that later lead to documented patient harm, or in which separate expert panels produce substantially different sets of failure modes.
read the original abstract
Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops and validates a novel FMECA framework for prospective assessment of patient safety risks in LLM-generated clinical summaries. An n=8 interdisciplinary expert panel created a taxonomy of 14 failure modes via literature review and brainstorming, adapted 5-point ordinal scales for occurrence/severity/detectability, applied the framework to 36 GPT-OSS-generated discharge summaries from real Geneva University Hospitals data across two annotation rounds, and reported improved inter-rater agreement plus good usability (mean SUS 79.2/100).
Significance. If the framework's validity and completeness hold, this provides the first structured, reproducible FMECA-based method for identifying clinically relevant risks from generative AI in clinical summarization, addressing a gap as LLMs see wider healthcare use. Strengths include adaptation of an established risk-analysis technique, use of real-world data, expert input, and evidence of practical usability and reliability via inter-rater metrics and SUS scores.
major comments (2)
- [Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.
- [Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.
minor comments (1)
- [Abstract] Abstract: Consider adding a brief limitations paragraph or explicit statement on generalizability (e.g., beyond discharge summaries or the specific open LLM used) to better contextualize the findings.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted to improve clarity and rigor.
read point-by-point responses
-
Referee: [Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.
Authors: The taxonomy was developed following established FMECA practices, which prioritize multidisciplinary expert consensus informed by literature review for novel domains where comprehensive external databases of LLM-specific adverse events do not yet exist. The n=8 panel included clinicians, informaticians, and patient safety experts, and the literature review explicitly covered documented risks in clinical summarization and generative AI outputs. We acknowledge that this internal process does not constitute full external validation against real-world incident databases, which represents a genuine limitation for claims of absolute comprehensiveness. In the revised manuscript, we will expand the Materials and Methods to detail the literature sources consulted and the iterative brainstorming protocol. We will also add a Limitations section that explicitly notes the potential for missed failure modes (such as certain hallucination subtypes) and recommends future retrospective mapping to clinical databases as a validation step. This strengthens transparency without overstating the current evidence. revision: partial
-
Referee: [Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.
Authors: We agree that the absence of specific quantitative metrics in the abstract limits readers' ability to assess the validation strength. The full Results section reports the detailed statistics, including round-by-round improvements in agreement for failure mode identification and the scoring dimensions. To address this, we will revise the abstract to include key quantitative values (such as the achieved kappa coefficients for failure mode identification and agreement levels for severity/detectability) along with a brief note on the distribution across the 36 summaries. This change directly supports the validation claims with greater precision. revision: yes
Circularity Check
No circularity: framework derived from external literature and independent expert panel, then applied to separate data
full rationale
The paper's derivation chain begins with an external literature review plus brainstorming by an n=8 interdisciplinary panel to produce the 14 failure modes and adapted 5-point scales; these are then applied to 36 independently generated summaries for annotation, inter-rater reliability measurement, and usability scoring. No step reduces by construction to its own inputs, no self-citation is load-bearing, no parameter is fitted and renamed as prediction, and no uniqueness theorem or ansatz is smuggled in. The process is self-contained against external benchmarks because the taxonomy originates outside the validation dataset and the agreement/usability metrics are measured on held-out summaries.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An interdisciplinary expert panel can reliably identify and categorize relevant failure modes for LLM-generated clinical summaries through literature review and brainstorming.
Reference graph
Works this paper leans on
-
[1]
Cohen R, Elhadad M, Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics 2013;14:10. https://doi.org/10.1186/1471 -2105-14- 10
-
[2]
Summarization of clinical information: A conceptual model
Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of clinical information: A conceptual model. J Biomed Inform 2011;44:688–99. https://doi.org/10.1016/j.jbi.2011.03.008
-
[3]
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930–40. https://doi.org/10.1038/s41591-023-02448-8
-
[4]
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–80. https://doi.org/10.1038/s41586-023-06291-2
-
[5]
The future landscape of large language models in medicine
Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J -N, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med 2023;3:141. https://doi.org/10.1038/s43856 -023-00370-1
-
[6]
Science in the age of large language models
Birhane A, Kasirzadeh A, Leslie D, Wachter S. Science in the age of large language models. Nat Rev Phys 2023;5:277–
2023
-
[7]
https://doi.org/10.1038/s42254-023-00581-4
-
[8]
The imperative for regulatory oversight of large language models (or generative AI) in healthcare
Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. Npj Digit Med 2023;6:120. https://doi.org/10.1038/s41746-023-00873-0
-
[9]
Scientific evidence for clinical text summarization using large language models: scoping review
Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, et al. Scientific evidence for clinical text summarization using large language models: scoping review. J Med Internet Res 2025;27:e68998
2025
-
[10]
Vithanage D, Yu P, Xie Q, Xu H, Wang L, Deng C. A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care. Comput Biol Med 2025;197:111013. https://doi.org/10.1016/j.compbiomed.2025.111013
-
[11]
Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review
Bednarczyk L, Bjelogrlic M, Zaghir J, Tcherepanova M, Ehrsam J, Bensahla A, et al. Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review. Stud Health Technol Inform 2026
2026
-
[12]
Palaniappan K, Lin EYT, Vogel S. Global Regulatory Frameworks for the Use of Artificial Intelligence (AI) in the Healthcare Services Sector. Healthcare 2024;12:562. https://doi.org/10.3390/healthcare12050562
-
[13]
Unregulated large language models produce medical device -like output
Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device -like output. Npj Digit Med 2025;8:1–5. https://doi.org/10.1038/s41746-025-01544-y
-
[14]
https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)
MDCG 2019-11 rev.1 - Qualification and classification of software - Regulation (EU) 2017/745 and Regulation (EU) 2017/746 (June 2025) - Public Health n.d. https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)
2019
-
[15]
Med Device Regul n.d
Medical Device Regulation (MDR). Med Device Regul n.d. https://www.medical -device-regulation.eu/download- mdr/ (accessed March 27, 2026)
2026
-
[16]
Bonnabry P, Cingria L, Sadeghipour F, Ing H, Fonzo-Christe C, Pfister RE. Use of a systematic risk analysis method to improve safety in the production of paediatric parenteral nutrition solutions. BMJ Qual Saf 2005;14:93 –8. https://doi.org/10.1136/qshc.2003.007914
-
[17]
Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process
Bonnabry P, Cingria L, Ackermann M, Sadeghipour F, Bigler L, Mach N. Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process. Int J Qual Health Care 2006;18:9 –16. https://doi.org/10.1093/intqhc/mzi082
-
[18]
Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94
Shebl NA, Franklin BD, Barber N. Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94. https://doi.org/10.1097/PTS.0b013e3181a6f040
-
[19]
FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner
El Mansouri M, Sekkat H, Talbi M, Tahiri Z, Nhila O. FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner. J Fail Anal Prev 2024;24:436–42. https://doi.org/10.1007/s11668-023-01853-y
-
[20]
Bonnabry P, Despont-Gros C, Grauser D, Casez P, Despond M, Pugin D, et al. A Risk Analysis Method to Evaluate the Impact of a Computerized Provider Order Entry System on Patient Safety. J Am Med Inform Assoc 2008;15:453–60. https://doi.org/10.1197/jamia.M2677
-
[21]
Onofrio R, Piccagli F, Segato F. Failure Mode, Effects and Criticality Analysis (FMECA) for Medical Devices: Does Standardization Foster Improvements in the Practice? Procedia Manuf 2015;3:43 –50. https://doi.org/10.1016/j.promfg.2015.07.106
-
[22]
Int Organ Stand n.d
ISO 14971:2019. Int Organ Stand n.d. https://www.iso.org/standard/72704.html (accessed March 27, 2026)
2019
-
[23]
Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables
Pascarella G, Rossi M, Montella E, Capasso A, De Feo G, Botti G, et al. Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables. Risk Manag Healthc Policy 2021;14:2897 –911. https://doi.org/10.2147/RMHP.S309098
-
[24]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt -oss-120b & gpt -oss-20b Model Card 2025. https://doi.org/10.48550/arXiv.2508.10925
work page internal anchor Pith review doi:10.48550/arxiv.2508.10925 2025
-
[25]
MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d
Wang Y, Ma X, Zhang G, Ni Y, Chandra A, Guo S, et al. MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d
-
[26]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024
Rein D, Hou BL, Stickland AC, Petty J, Pang RY, Dirani J, et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024
2024
-
[27]
Prompt Engineering Paradigms for Medical Applications: Scoping Review
Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J Med Internet Res 2024;26:e60501. https://doi.org/10.2196/60501
-
[28]
SUS - A quick and dirty usability scale n.d
Brooke J. SUS - A quick and dirty usability scale n.d
-
[29]
https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)
PSOPPC: Common Formats Hospital 2.0 n.d. https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)
2025
-
[30]
npj Digital Medicine 2025 8:1 8:274-
Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit Med 2025;8:1 –15. https://doi.org/10.1038/s41746-025-01670-7
-
[31]
Altermatt FR, Neyem A, Sumonte NI, Villagrán I, Mendoza M, Lacassie HJ, et al. Evaluating GPT-4o in high -stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam. BMC Med Educ 2025;25:1499. https://doi.org/10.1186/s12909-025-08084-9
-
[32]
Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150
Shebl NA, Franklin BD, Barber N. Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150. https://doi.org/10.1186/1472-6963-12-150
-
[33]
Huang J, You J-X, Liu H-C, Song M-S. Failure mode and effect analysis improvement: A systematic literature review and future research agenda. Reliab Eng Syst Saf 2020;199:106885. https://doi.org/10.1016/j.ress.2020.106885
-
[34]
arXiv preprint arXiv:2409.07314 (2024)
Kanithi P, Christophe C, Pimentel MA, Raha T, Munjal P, Saadi N, et al. MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications. arXivOrg 2024. https://arxiv.org/abs/2409.07314v2 (accessed March 30, 2026)
-
[35]
Croxford E, Gao Y, Pellegrino N, Wong K, Wills G, First E, et al. Development and validation of the provider documentation summarization quality instrument for large language models. J Am Med Inform Assoc 2025;32:1050–60. https://doi.org/10.1093/jamia/ocaf068
-
[36]
A survey of hallucination in large foundation models
Rawte V, Sheth A, Das A. A Survey of Hallucination in Large Foundation Models. arXivOrg 2023. https://arxiv.org/abs/2309.05922v1 (accessed March 30, 2026). Supplementary materials Supplementary Figure S1. Geneva University Hospitals consent form for research use of health data and biological samples. Supplementary Figure S 2. First page of the System Usab...
-
[37]
Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus)
Analyse soigneusement le texte. Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus). Repère les allergies et réactions éventuelles. Résume l’épisode clinique actuel (motif, diagnostic, plan)
-
[38]
S’il n’y en a pas, indiquer « Non mentionné »
Présente le résultat dans le format EXACT suivant : Antécédents médicaux [Antécédents 1] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent] [Antécédents 2] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent] … Présente d’abord les antécédents médicaux actifs, puis résolus. S’il n’y en a pas, i...
-
[39]
Identify relevant medical history (active or resolved diagnoses)
Carefully analyze the text. Identify relevant medical history (active or resolved diagnoses). Note any allergies and reactions. Summarize the current clinical episode (reason, diagnosis, plan)
-
[40]
Not mentioned
Present the result in the EXACT format below: Medical History [History 1]: [status: active / resolved / uncertain] — [relevant treatment or comment] [History 2]: [status: active / resolved / uncertain] — [relevant treatment or comment] … List active medical histories first, followed by resolved ones. If there are none, indicate “Not mentioned”. Alle...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.