arxiv: 2605.04085 · v1 · submitted 2026-04-23 · 💻 cs.CY · cs.AI· cs.CL· stat.ME

Recognition: unknown

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Lydie Bednarczyk , Jamil Zaghir , Julien Ehrsam , Maria Tcherepanova , Christian Skalafouris , Karim Gariani , Catherine Geslin , Claire-B\'en\'edicte Rivara

show 6 more authors

Pascal Bonnabry Laetitia Gosetto Richard Dubos Mina Bjelogrlic Christophe Gaudet-Blavignac Christian Lovis

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLstat.ME

keywords patient safetygenerative AIclinical summariesFMECArisk assessmentLLMdischarge summariesinter-rater reliability

0 comments

The pith

A new FMECA framework gives a structured method to identify patient safety risks in LLM-generated clinical summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and validates the first FMECA-based framework for assessing patient safety risks in large language model generated clinical summaries. An interdisciplinary panel created a taxonomy of 14 failure modes through literature review and brainstorming, then adapted standard FMECA scales for occurrence, severity, and detectability. The framework was tested by applying it to 36 discharge summaries generated by an open-source LLM from real hospital data, with reviewers scoring independently across rounds. Inter-rater agreement reached moderate-to-substantial levels for identifying failure modes and good agreement for the severity and detectability scores. Usability feedback was positive, supporting the framework as a reproducible tool for spotting clinically relevant risks before deployment.

Core claim

The central claim is that a novel FMECA framework, built around 14 failure modes in categories and using adapted 5-point ordinal scales, offers a systematic and reproducible way to prospectively evaluate patient safety risks arising from LLM-generated clinical summaries, as demonstrated by its application to real-world discharge summaries with improving inter-rater reliability and good usability scores.

What carries the argument

The FMECA framework, which organizes risk assessment around a taxonomy of 14 failure modes together with adapted scales for occurrence, severity, and detectability to calculate criticality.

If this is right

The framework supplies a proactive, standardized process for identifying clinically relevant risks in AI-generated clinical text before such tools enter routine use.
Application to real discharge summaries shows the method can be used by reviewers to annotate outputs consistently across rounds.
Good inter-rater agreement on severity and detectability scores supports its use for comparing risks across different LLMs or prompts.
High usability ratings indicate the framework can be adopted by interdisciplinary teams without extensive additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals could embed the framework into pre-deployment checks for any generative AI tool that produces clinical text.
The same failure-mode approach might be extended to other LLM outputs such as diagnostic reasoning or treatment recommendations.
Periodic re-application of the framework could track how risks evolve as newer LLM versions are released.
Automating parts of the failure-mode detection step could make the method scalable for large volumes of generated content.

Load-bearing premise

The 14 failure modes identified by the expert panel capture all relevant patient safety risks in LLM-generated clinical content and the adapted 5-point scales are valid and reliable for this new domain.

What would settle it

A follow-up study in which the framework is applied to additional LLM summaries yet misses failure modes that later lead to documented patient harm, or in which separate expert panels produce substantially different sets of failure modes.

read the original abstract

Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first FMECA framework for LLM clinical summaries with internal reliability checks, but the 14-mode taxonomy lacks external validation against real errors.

read the letter

The paper's core offering is a new FMECA framework with a 14-mode taxonomy for spotting patient safety risks in LLM-generated clinical summaries, tested on 36 real discharge summaries from Geneva University Hospitals data. It adapts the standard occurrence-severity-detectability scales to 5-point versions and reports improved inter-rater agreement after two review rounds plus a mean SUS score of 79.2, which indicates the tool feels usable to the evaluators. This is genuinely new as the first dedicated adaptation for LLM clinical content, and the work does a clean job of laying out the expert-panel process, applying it to actual hospital notes, and showing basic reproducibility metrics. The structured approach gives clinicians and informaticians a concrete starting point for thinking about risks like factual errors or omissions in generated notes. The soft spots are straightforward. The 14 failure modes came from an n=8 interdisciplinary panel using literature review and brainstorming, with no follow-up test against real adverse-event databases or alternative risk lists to confirm nothing major was missed, such as subtle context drift or model-specific hallucinations. The test set is narrow—one open model, discharge summaries only, four patients—so the adapted scales' calibration for this domain stays unproven beyond internal consistency. No quantitative risk scores or direct comparisons to existing safety methods appear. This paper is for clinical informatics groups and AI safety teams in healthcare who want a practical, proactive assessment method rather than post-hoc fixes. A reader focused on deployment tools will get value from the taxonomy and the usability data. It deserves peer review because the framework is reproducible and addresses a clear gap, even if referees will likely ask for broader validation data.

Referee Report

2 major / 1 minor

Summary. The paper develops and validates a novel FMECA framework for prospective assessment of patient safety risks in LLM-generated clinical summaries. An n=8 interdisciplinary expert panel created a taxonomy of 14 failure modes via literature review and brainstorming, adapted 5-point ordinal scales for occurrence/severity/detectability, applied the framework to 36 GPT-OSS-generated discharge summaries from real Geneva University Hospitals data across two annotation rounds, and reported improved inter-rater agreement plus good usability (mean SUS 79.2/100).

Significance. If the framework's validity and completeness hold, this provides the first structured, reproducible FMECA-based method for identifying clinically relevant risks from generative AI in clinical summarization, addressing a gap as LLMs see wider healthcare use. Strengths include adaptation of an established risk-analysis technique, use of real-world data, expert input, and evidence of practical usability and reliability via inter-rater metrics and SUS scores.

major comments (2)

[Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.
[Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.

minor comments (1)

[Abstract] Abstract: Consider adding a brief limitations paragraph or explicit statement on generalizability (e.g., beyond discharge summaries or the specific open LLM used) to better contextualize the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted to improve clarity and rigor.

read point-by-point responses

Referee: [Materials and Methods] Materials and Methods: The taxonomy of 14 failure modes originates from an n=8 panel's literature review plus brainstorming, with no reported external validation step (e.g., mapping to documented adverse events from clinical databases or comparison against alternative risk taxonomies). This is load-bearing for the central claim of a 'systematic' method that 'comprehensively' identifies clinically relevant risks, as the framework may miss LLM-specific issues such as hallucinated contraindications or context drift without such checks.

Authors: The taxonomy was developed following established FMECA practices, which prioritize multidisciplinary expert consensus informed by literature review for novel domains where comprehensive external databases of LLM-specific adverse events do not yet exist. The n=8 panel included clinicians, informaticians, and patient safety experts, and the literature review explicitly covered documented risks in clinical summarization and generative AI outputs. We acknowledge that this internal process does not constitute full external validation against real-world incident databases, which represents a genuine limitation for claims of absolute comprehensiveness. In the revised manuscript, we will expand the Materials and Methods to detail the literature sources consulted and the iterative brainstorming protocol. We will also add a Limitations section that explicitly notes the potential for missed failure modes (such as certain hallucination subtypes) and recommends future retrospective mapping to clinical databases as a validation step. This strengthens transparency without overstating the current evidence. revision: partial
Referee: [Results] Results: While improved inter-rater agreement is reported (moderate-to-substantial for failure mode identification, good for severity/detectability), the abstract and summary provide no specific quantitative metrics (e.g., exact kappa coefficients, percentage agreement, or per-mode distributions across the 36 summaries), limiting assessment of whether the data fully support the validation claims.

Authors: We agree that the absence of specific quantitative metrics in the abstract limits readers' ability to assess the validation strength. The full Results section reports the detailed statistics, including round-by-round improvements in agreement for failure mode identification and the scoring dimensions. To address this, we will revise the abstract to include key quantitative values (such as the achieved kappa coefficients for failure mode identification and agreement levels for severity/detectability) along with a brief note on the distribution across the 36 summaries. This change directly supports the validation claims with greater precision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derived from external literature and independent expert panel, then applied to separate data

full rationale

The paper's derivation chain begins with an external literature review plus brainstorming by an n=8 interdisciplinary panel to produce the 14 failure modes and adapted 5-point scales; these are then applied to 36 independently generated summaries for annotation, inter-rater reliability measurement, and usability scoring. No step reduces by construction to its own inputs, no self-citation is load-bearing, no parameter is fitted and renamed as prediction, and no uniqueness theorem or ansatz is smuggled in. The process is self-contained against external benchmarks because the taxonomy originates outside the validation dataset and the agreement/usability metrics are measured on held-out summaries.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on expert-derived taxonomy and adapted standard FMECA scales applied to LLM outputs; no data-fitted parameters or new postulated entities beyond the framework components.

axioms (1)

domain assumption An interdisciplinary expert panel can reliably identify and categorize relevant failure modes for LLM-generated clinical summaries through literature review and brainstorming.
The taxonomy of 14 failure modes was developed by n=8 experts as described in the materials and methods section of the abstract.

pith-pipeline@v0.9.0 · 5684 in / 1275 out tokens · 48602 ms · 2026-05-09T20:51:17.510097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Cohen R, Elhadad M, Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics 2013;14:10. https://doi.org/10.1186/1471 -2105-14- 10

work page doi:10.1186/1471 2013
[2]

Summarization of clinical information: A conceptual model

Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of clinical information: A conceptual model. J Biomed Inform 2011;44:688–99. https://doi.org/10.1016/j.jbi.2011.03.008

work page doi:10.1016/j.jbi.2011.03.008 2011
[3]

J., Ting, D

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930–40. https://doi.org/10.1038/s41591-023-02448-8

work page doi:10.1038/s41591-023-02448-8 2023
[4]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–80. https://doi.org/10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[5]

The future landscape of large language models in medicine

Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J -N, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med 2023;3:141. https://doi.org/10.1038/s43856 -023-00370-1

work page doi:10.1038/s43856 2023
[6]

Science in the age of large language models

Birhane A, Kasirzadeh A, Leslie D, Wachter S. Science in the age of large language models. Nat Rev Phys 2023;5:277–

2023
[7]

https://doi.org/10.1038/s42254-023-00581-4

work page doi:10.1038/s42254-023-00581-4
[8]

The imperative for regulatory oversight of large language models (or generative AI) in healthcare

Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. Npj Digit Med 2023;6:120. https://doi.org/10.1038/s41746-023-00873-0

work page doi:10.1038/s41746-023-00873-0 2023
[9]

Scientific evidence for clinical text summarization using large language models: scoping review

Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, et al. Scientific evidence for clinical text summarization using large language models: scoping review. J Med Internet Res 2025;27:e68998

2025
[10]

A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care

Vithanage D, Yu P, Xie Q, Xu H, Wang L, Deng C. A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care. Comput Biol Med 2025;197:111013. https://doi.org/10.1016/j.compbiomed.2025.111013

work page doi:10.1016/j.compbiomed.2025.111013 2025
[11]

Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review

Bednarczyk L, Bjelogrlic M, Zaghir J, Tcherepanova M, Ehrsam J, Bensahla A, et al. Advancing Knowledge in Evaluating the Clinical Impact of Large Language Models for Clinical Text Summarization: A Narrative Review. Stud Health Technol Inform 2026

2026
[12]

Global Regulatory Frameworks for the Use of Artificial Intelligence (AI) in the Healthcare Services Sector

Palaniappan K, Lin EYT, Vogel S. Global Regulatory Frameworks for the Use of Artificial Intelligence (AI) in the Healthcare Services Sector. Healthcare 2024;12:562. https://doi.org/10.3390/healthcare12050562

work page doi:10.3390/healthcare12050562 2024
[13]

Unregulated large language models produce medical device -like output

Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device -like output. Npj Digit Med 2025;8:1–5. https://doi.org/10.1038/s41746-025-01544-y

work page doi:10.1038/s41746-025-01544-y 2025
[14]

https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)

MDCG 2019-11 rev.1 - Qualification and classification of software - Regulation (EU) 2017/745 and Regulation (EU) 2017/746 (June 2025) - Public Health n.d. https://health.ec.europa.eu/latest-updates/update-mdcg-2019-11-rev1- qualification-and-classification-software-regulation-eu-2017745-and-2025-06-17_en (accessed March 27, 2026)

2019
[15]

Med Device Regul n.d

Medical Device Regulation (MDR). Med Device Regul n.d. https://www.medical -device-regulation.eu/download- mdr/ (accessed March 27, 2026)

2026
[16]

Use of a systematic risk analysis method to improve safety in the production of paediatric parenteral nutrition solutions

Bonnabry P, Cingria L, Sadeghipour F, Ing H, Fonzo-Christe C, Pfister RE. Use of a systematic risk analysis method to improve safety in the production of paediatric parenteral nutrition solutions. BMJ Qual Saf 2005;14:93 –8. https://doi.org/10.1136/qshc.2003.007914

work page doi:10.1136/qshc.2003.007914 2005
[17]

Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process

Bonnabry P, Cingria L, Ackermann M, Sadeghipour F, Bigler L, Mach N. Use of a prospective risk analysis method to improve the safety of the cancer chemotherapy process. Int J Qual Health Care 2006;18:9 –16. https://doi.org/10.1093/intqhc/mzi082

work page doi:10.1093/intqhc/mzi082 2006
[18]

Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94

Shebl NA, Franklin BD, Barber N. Is failure mode and effect analysis reliable? J Patient Saf 2009;5:86 –94. https://doi.org/10.1097/PTS.0b013e3181a6f040

work page doi:10.1097/pts.0b013e3181a6f040 2009
[19]

FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner

El Mansouri M, Sekkat H, Talbi M, Tahiri Z, Nhila O. FMECA Process Analysis for Managing the Failures of 16 -Slice CT Scanner. J Fail Anal Prev 2024;24:436–42. https://doi.org/10.1007/s11668-023-01853-y

work page doi:10.1007/s11668-023-01853-y 2024
[20]

A Risk Analysis Method to Evaluate the Impact of a Computerized Provider Order Entry System on Patient Safety

Bonnabry P, Despont-Gros C, Grauser D, Casez P, Despond M, Pugin D, et al. A Risk Analysis Method to Evaluate the Impact of a Computerized Provider Order Entry System on Patient Safety. J Am Med Inform Assoc 2008;15:453–60. https://doi.org/10.1197/jamia.M2677

work page doi:10.1197/jamia.m2677 2008
[21]

Failure Mode, Effects and Criticality Analysis (FMECA) for Medical Devices: Does Standardization Foster Improvements in the Practice? Procedia Manuf 2015;3:43 –50

Onofrio R, Piccagli F, Segato F. Failure Mode, Effects and Criticality Analysis (FMECA) for Medical Devices: Does Standardization Foster Improvements in the Practice? Procedia Manuf 2015;3:43 –50. https://doi.org/10.1016/j.promfg.2015.07.106

work page doi:10.1016/j.promfg.2015.07.106 2015
[22]

Int Organ Stand n.d

ISO 14971:2019. Int Organ Stand n.d. https://www.iso.org/standard/72704.html (accessed March 27, 2026)

2019
[23]

Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables

Pascarella G, Rossi M, Montella E, Capasso A, De Feo G, Botti G, et al. Risk Analysis in Healthcare Organizations: Methodological Framework and Critical Variables. Risk Manag Healthc Policy 2021;14:2897 –911. https://doi.org/10.2147/RMHP.S309098

work page doi:10.2147/rmhp.s309098 2021
[24]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt -oss-120b & gpt -oss-20b Model Card 2025. https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review doi:10.48550/arxiv.2508.10925 2025
[25]

MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d

Wang Y, Ma X, Zhang G, Ni Y, Chandra A, Guo S, et al. MMLU-Pro: A More Robust and Challenging Multi -Task Language Understanding Benchmark n.d
[26]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024

Rein D, Hou BL, Stickland AC, Petty J, Pang RY, Dirani J, et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2024

2024
[27]

Prompt Engineering Paradigms for Medical Applications: Scoping Review

Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J Med Internet Res 2024;26:e60501. https://doi.org/10.2196/60501

work page doi:10.2196/60501 2024
[28]

SUS - A quick and dirty usability scale n.d

Brooke J. SUS - A quick and dirty usability scale n.d
[29]

https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)

PSOPPC: Common Formats Hospital 2.0 n.d. https://www.psoppc.org/psoppc_web/publicpages/commonFormatsHV2.0 (accessed October 13, 2025)

2025
[30]

npj Digital Medicine 2025 8:1 8:274-

Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit Med 2025;8:1 –15. https://doi.org/10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025
[31]

Evaluating GPT-4o in high -stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam

Altermatt FR, Neyem A, Sumonte NI, Villagrán I, Mendoza M, Lacassie HJ, et al. Evaluating GPT-4o in high -stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam. BMC Med Educ 2025;25:1499. https://doi.org/10.1186/s12909-025-08084-9

work page doi:10.1186/s12909-025-08084-9 2025
[32]

Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150

Shebl NA, Franklin BD, Barber N. Failure mode and effects analysis outputs: are they valid? BMC Health Serv Res 2012;12:150. https://doi.org/10.1186/1472-6963-12-150

work page doi:10.1186/1472-6963-12-150 2012
[33]

Failure mode and effect analysis improvement: A systematic literature review and future research agenda

Huang J, You J-X, Liu H-C, Song M-S. Failure mode and effect analysis improvement: A systematic literature review and future research agenda. Reliab Eng Syst Saf 2020;199:106885. https://doi.org/10.1016/j.ress.2020.106885

work page doi:10.1016/j.ress.2020.106885 2020
[34]

arXiv preprint arXiv:2409.07314 (2024)

Kanithi P, Christophe C, Pimentel MA, Raha T, Munjal P, Saadi N, et al. MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications. arXivOrg 2024. https://arxiv.org/abs/2409.07314v2 (accessed March 30, 2026)

work page arXiv 2024
[35]

Development and validation of the provider documentation summarization quality instrument for large language models

Croxford E, Gao Y, Pellegrino N, Wong K, Wills G, First E, et al. Development and validation of the provider documentation summarization quality instrument for large language models. J Am Med Inform Assoc 2025;32:1050–60. https://doi.org/10.1093/jamia/ocaf068

work page doi:10.1093/jamia/ocaf068 2025
[36]

A survey of hallucination in large foundation models

Rawte V, Sheth A, Das A. A Survey of Hallucination in Large Foundation Models. arXivOrg 2023. https://arxiv.org/abs/2309.05922v1 (accessed March 30, 2026). Supplementary materials Supplementary Figure S1. Geneva University Hospitals consent form for research use of health data and biological samples. Supplementary Figure S 2. First page of the System Usab...

work page arXiv 2023
[37]

 Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus)

Analyse soigneusement le texte.  Identifie les antécédents médicaux pertinents (diagnostics actifs ou résolus).  Repère les allergies et réactions éventuelles.  Résume l’épisode clinique actuel (motif, diagnostic, plan)
[38]

S’il n’y en a pas, indiquer « Non mentionné »

Présente le résultat dans le format EXACT suivant : Antécédents médicaux  [Antécédents 1] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent]  [Antécédents 2] : [statut : actif / résolu / incertain] — [traitement ou commentaire pertinent]  … Présente d’abord les antécédents médicaux actifs, puis résolus. S’il n’y en a pas, i...
[39]

 Identify relevant medical history (active or resolved diagnoses)

Carefully analyze the text.  Identify relevant medical history (active or resolved diagnoses).  Note any allergies and reactions.  Summarize the current clinical episode (reason, diagnosis, plan)
[40]

Not mentioned

Present the result in the EXACT format below: Medical History  [History 1]: [status: active / resolved / uncertain] — [relevant treatment or comment]  [History 2]: [status: active / resolved / uncertain] — [relevant treatment or comment]  … List active medical histories first, followed by resolved ones. If there are none, indicate “Not mentioned”. Alle...