Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Linghua Yu; Xiaojun Wu; Yuan Shen

arxiv: 2606.07929 · v1 · pith:3KCLQBMYnew · submitted 2026-06-06 · 💻 cs.AI

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Yuan Shen , Xiaojun Wu , Linghua Yu This is my paper

Pith reviewed 2026-06-27 20:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords medical LLMsstress testingsafety evaluationnarrative perturbationquantized modelssupervised fine-tuningclinical AIAI safety

0 comments

The pith

Narrative stress testing uncovers latent safety failures in medical LLMs missed by accuracy benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AI-MASLD, a stress-audit framework adapted from metabolic stress testing in hepatology, to evaluate clinical LLMs. It applies six narrative perturbation probes to 240 clinical cases across seven models and tracks results through three indices: metabolic index, perturbation flip rate, and counterfactual fairness index. All models perform similarly under clean baseline conditions, but under realistic narrative stress their results diverge into distinct phenotypes. Quantized models display pseudonormalization where low flip rates conceal functional collapse, and medical supervised fine-tuning degrades logical stability, fairness, and information extraction. One open-weight model matches or exceeds proprietary models on every safety dimension.

Core claim

Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension.

What carries the argument

AI-MASLD, the stress-audit framework that applies six narrative perturbation probes and three indices (metabolic index, perturbation flip rate, counterfactual fairness index) to detect safety-relevant failures in clinical LLMs.

If this is right

Clean baseline accuracy does not predict performance under narrative stress.
Quantized models can exhibit low flip rates while undergoing functional collapse.
Medical supervised fine-tuning reduces logical stability, fairness, and information extraction.
An open-weight model can match or exceed proprietary models on all measured safety indices.
Narrative stress auditing detects safety issues invisible to accuracy-only evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment decisions for clinical LLMs should require stress-audit results in addition to benchmark scores.
The same narrative stress approach could be adapted to evaluate LLMs in other regulated domains such as law or finance.
Fine-tuning methods for medical LLMs may need redesign to avoid trading stability and fairness for domain accuracy.
Perturbation flip rate alone is insufficient as a safety metric because it can mask collapse in quantized models.

Load-bearing premise

The six narrative perturbation probes and the three indices validly measure safety-relevant failure modes that would appear in actual clinical deployment.

What would settle it

Direct observation in real clinical use showing that models with high stress-test failure rates produce no more errors or harms than models with low failure rates.

read the original abstract

Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Narrative stress testing for medical LLMs uncovers potential safety issues missed by accuracy benchmarks, but the new indices lack demonstrated links to clinical outcomes.

read the letter

The paper's main point is that standard accuracy benchmarks fail to catch safety-relevant failures in medical LLMs, and their new stress-audit framework shows clear divergence under narrative perturbations, with quantized models hiding collapse and medical SFT hurting stability.

What stands out as new is the adaptation of metabolic stress testing logic to this domain, using 240 cases and six probes to define the metabolic index, perturbation flip rate, and counterfactual fairness index. The finding that an open-weight model held up well against proprietary ones on these measures is also worth noting. The work does a good job of making the case that clean benchmarks are not enough and that we need tests that simulate varied clinical narratives.

The soft spots are around validation. The probes and indices are presented without evidence that they correspond to real-world clinical risks or errors that have occurred in practice. No clinician input on case realism or potential for harm is described, so the phenotypes might not translate to deployment concerns. The abstract gives no equations or raw numbers, leaving the strength of the divergence claims open to question until the full methods are seen.

This is aimed at people building or regulating medical AI systems who need better ways to test for robustness. A reader working on LLM evaluation would get some useful ideas from the framework, even if the specific results require more support. It should go to peer review so the details can be examined properly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the AI-MASLD stress-audit framework for evaluating clinical large language models by adapting metabolic stress testing from hepatology. Using 240 clinical cases and six narrative perturbation probes, seven models are subjected to double-stress testing and evaluated on three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). The paper reports that while all models perform well under clean conditions, under realistic narrative stress, performance diverges into two phenotypes: quantized models exhibit pseudonormalization where low flip rates mask functional collapse, and medical supervised fine-tuning systematically degrades logical stability, fairness, and information extraction. An open-weight model matches or exceeds proprietary models on safety dimensions.

Significance. If the narrative stress probes and derived indices validly identify safety-relevant failure modes that standard accuracy benchmarks miss, this work would be significant for the responsible deployment of LLMs in medicine by establishing stress auditing as a necessary evaluation complement. The identification of pseudonormalization in quantized models and degradation from medical SFT could inform model selection and fine-tuning practices.

major comments (3)

[Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.
[Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.
[Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.

minor comments (2)

[Abstract] Specify the exact seven models tested and name the open-weight model that performed competitively.
[Methods] Clarify selection criteria and domain coverage for the 240 clinical cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses

Referee: [Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.

Authors: We acknowledge the omission of these details in the submitted version. The revised manuscript will include the complete mathematical definitions, formulas, and step-by-step operational procedures for MI, PFR, and CFI within the Methods section, explicitly linking each index to the constructs of logical stability, fairness, and information extraction. revision: yes
Referee: [Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.

Authors: The perturbation probes were derived from established patterns of narrative variation in clinical documentation. While the study did not include new clinician validation, the reported phenotypes are directly evidenced by the divergence between clean and stressed conditions across models. We will add a Discussion subsection that maps the probes and phenotypes to documented clinical error modes drawn from the existing literature on LLM deployment risks in medicine. revision: partial
Referee: [Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.

Authors: We agree that quantitative support should be strengthened. The revised Results section will report the appropriate statistical tests, include error bars and effect sizes for all phenotype comparisons, and provide access to the underlying raw data via supplementary materials. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation without derivations or self-referential definitions.

full rationale

The paper presents an empirical stress-audit framework (AI-MASLD) applied to 240 clinical cases across six narrative perturbation probes and three indices (MI, PFR, CFI). All claims rest on observed performance divergences under baseline vs. stressed conditions for seven models, with no equations, derivations, fitted parameters renamed as predictions, or self-citation chains supporting uniqueness theorems. The abstract and description contain no self-definitional steps, ansatzes smuggled via citation, or renaming of known results. This is a standard empirical comparison study whose central claims do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only; the framework introduces three new indices whose computation rules are not stated, implying unlisted choices in how perturbations are generated and scores aggregated. Experimental design parameters are chosen by hand.

free parameters (2)

Number of clinical cases = 240
240 cases selected for the study; no derivation given.
Number of narrative perturbation probes = 6
Six probes chosen to create stress conditions; selection criteria not stated.

axioms (1)

domain assumption Narrative perturbations on clinical cases simulate realistic stresses that would affect LLM safety in deployment
This premise underpins the entire stress-audit claim and is invoked when interpreting the divergence in performance.

pith-pipeline@v0.9.1-grok · 5696 in / 1405 out tokens · 30936 ms · 2026-06-27T20:08:50.969945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Large language models in medicine

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature medicine 2023;29:1930-1940

2023
[2]

Creation and adoption of large language models in medicine

Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. Jama 2023;330:866-869

2023
[3]

High-performance medicine: the convergence of human and artificial intelligence

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44-56

2019
[4]

Large language models encode clinical knowledge

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180

2023
[5]

A large language model for electronic health records

Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, et al. A large language model for electronic health records. NPJ digital medicine 2022;5:194

2022
[6]

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 2023;183:589-596. 27

2023
[7]

Testing and evaluation of health care applications of large language models: a systematic review

Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, et al. Testing and evaluation of health care applications of large language models: a systematic review. Jama 2025;333:319-328

2025
[8]

Toward expert- level medical question answering with large language models

Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, et al. Toward expert- level medical question answering with large language models. Nature medicine 2025;31:943-950

2025
[9]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 2021;11:6421

2021
[10]

Pubmedqa: A dataset for biomedical research question answering

Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP); 2019; 2019. p. 2567-2577

2019
[11]

Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust

Charon R. Rita Charon,“Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust” The Journal of the American Medical Association 286 (2001), 1897-1902. Readings in the Development of the Medical Humanities 2015:207

2001
[12]

Accuracy of a generative artificial intelligence model in a complex diagnostic challenge

Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama 2023;330:78-80

2023
[13]

Medical large language models are vulnerable to data-poisoning attacks

Alber DA, Yang Z, Alyakin A, Yang E, Rai S, Valliani AA, Zhang J, et al. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine 2025;31:618-626

2025
[14]

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health 2023;2:e0000198

2023
[15]

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases

Dinc MT, Bardak AE, Bahar F, Noronha C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA open 2025;8:ooaf055

2025
[16]

Adversarial examples for evaluating reading comprehension systems

Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems. In: Proceedings of the 2017 conference on empirical methods in natural language processing; 2017

2017
[17]

The clinician and dataset shift in artificial intelligence

Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, et al. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 2021;385:283- 286

2021
[18]

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Ntinopoulos V, Biefer HRC, Tudorache I, Papadopoulos N, Odavic D, Risteski P, Haeussler A, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ health & care informatics 2025;32:e101139

2025
[19]

Clinical entity augmented retrieval for clinical information extraction

Lopez I, Swaminathan A, Vedula K, Narayanan S, Nateghi Haredasht F, Ma SP, Liang AS, et al. Clinical entity augmented retrieval for clinical information extraction. NPJ digital medicine 2025;8:45

2025
[20]

Dissecting racial bias in an algorithm used to manage the health of populations

Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447-453

2019
[21]

Hidden in plain sight—reconsidering the use of race correction in clinical algorithms

Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. In: Mass Medical Soc; 2020. p. 874-882

2020
[22]

Liver cirrhosis

Tsochatzis EA, Bosch J, Burroughs AK. Liver cirrhosis. The Lancet 2014;383:1749-1761

2014
[23]

A multisociety Delphi consensus statement on new fatty liver disease nomenclature

Rinella ME, Lazarus JV, Ratziu V, Francque SM, Sanyal AJ, Kanwal F, Romero D, et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. Hepatology 2023;78:1966-1986

2023
[24]

Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models

Bouguettaya A, Stuart EM, Aboujaoude E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digital Medicine 2025;8:332. 28

2025
[25]

Sociodemographic biases in medical decision making by large language models

Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, Charney AW, et al. Sociodemographic biases in medical decision making by large language models. Nature Medicine 2025;31:1873-1881

2025
[26]

ESC Guidelines for the diagnosis and management of chronic coronary syndromes

Knuuti J, Wijns W, Saraste A, Capodanno D, Barbato E, Funck-Brentano C, Prescott E, et al. ESC Guidelines for the diagnosis and management of chronic coronary syndromes. ESC Scientific Document Group. Eur. Heart J 2019:425-431

2019
[27]

Ethics and governance of artificial intelligence for health: large multi- modal models

Organization WH. Ethics and governance of artificial intelligence for health: large multi- modal models. WHO guidance: World Health Organization, 2024

2024
[28]

EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis

Angeli P, Bernardi M, Villanueva C, Francoz C, Mookerjee RP, Trebicka J, Krag A, et al. EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis. Journal of hepatology 2018;69:406-460

2018
[29]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022;35:27730-27744

2022
[30]

A General Language Assistant as a Laboratory for Alignment

Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, Jones A, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

The fallacy of AI functionality

Raji ID, Kumar IE, Horowitz A, Selbst A. The fallacy of AI functionality. In: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency; 2022; 2022. p. 959-972

2022
[32]

Multimodal biomedical AI

Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nature medicine 2022;28:1773-1784

2022
[33]

Judging llm-as-a- judge with mt-bench and chatbot arena

Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. Judging llm-as-a- judge with mt-bench and chatbot arena. Advances in neural information processing systems 2023;36:46595-46623

2023
[34]

Towards Deep Learning Models Resistant to Adversarial Attacks

Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 2020;33:9459-9474

2020
[36]

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 2024;7:20

2024
[37]

Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50

Greene JA, Podolsky SH. Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50. The New England journal of medicine 2012;367:1481

2012
[38]

Diabetes is extremely well- controlled, blood sugar never high

Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical trials: Springer, 2015. 29 Appendix Supplementary Tables Supplementary Table 1 | Representative baseline–stress case variant pairs for each probe category. One illustrative probe from each of the six categories (P1–P5, CFI), showing the clean baseline prompt and the cor...

2015

[1] [1]

Large language models in medicine

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature medicine 2023;29:1930-1940

2023

[2] [2]

Creation and adoption of large language models in medicine

Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. Jama 2023;330:866-869

2023

[3] [3]

High-performance medicine: the convergence of human and artificial intelligence

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44-56

2019

[4] [4]

Large language models encode clinical knowledge

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180

2023

[5] [5]

A large language model for electronic health records

Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, et al. A large language model for electronic health records. NPJ digital medicine 2022;5:194

2022

[6] [6]

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 2023;183:589-596. 27

2023

[7] [7]

Testing and evaluation of health care applications of large language models: a systematic review

Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, et al. Testing and evaluation of health care applications of large language models: a systematic review. Jama 2025;333:319-328

2025

[8] [8]

Toward expert- level medical question answering with large language models

Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, et al. Toward expert- level medical question answering with large language models. Nature medicine 2025;31:943-950

2025

[9] [9]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 2021;11:6421

2021

[10] [10]

Pubmedqa: A dataset for biomedical research question answering

Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP); 2019; 2019. p. 2567-2577

2019

[11] [11]

Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust

Charon R. Rita Charon,“Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust” The Journal of the American Medical Association 286 (2001), 1897-1902. Readings in the Development of the Medical Humanities 2015:207

2001

[12] [12]

Accuracy of a generative artificial intelligence model in a complex diagnostic challenge

Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama 2023;330:78-80

2023

[13] [13]

Medical large language models are vulnerable to data-poisoning attacks

Alber DA, Yang Z, Alyakin A, Yang E, Rai S, Valliani AA, Zhang J, et al. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine 2025;31:618-626

2025

[14] [14]

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health 2023;2:e0000198

2023

[15] [15]

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases

Dinc MT, Bardak AE, Bahar F, Noronha C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA open 2025;8:ooaf055

2025

[16] [16]

Adversarial examples for evaluating reading comprehension systems

Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems. In: Proceedings of the 2017 conference on empirical methods in natural language processing; 2017

2017

[17] [17]

The clinician and dataset shift in artificial intelligence

Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, et al. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 2021;385:283- 286

2021

[18] [18]

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Ntinopoulos V, Biefer HRC, Tudorache I, Papadopoulos N, Odavic D, Risteski P, Haeussler A, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ health & care informatics 2025;32:e101139

2025

[19] [19]

Clinical entity augmented retrieval for clinical information extraction

Lopez I, Swaminathan A, Vedula K, Narayanan S, Nateghi Haredasht F, Ma SP, Liang AS, et al. Clinical entity augmented retrieval for clinical information extraction. NPJ digital medicine 2025;8:45

2025

[20] [20]

Dissecting racial bias in an algorithm used to manage the health of populations

Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447-453

2019

[21] [21]

Hidden in plain sight—reconsidering the use of race correction in clinical algorithms

Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. In: Mass Medical Soc; 2020. p. 874-882

2020

[22] [22]

Liver cirrhosis

Tsochatzis EA, Bosch J, Burroughs AK. Liver cirrhosis. The Lancet 2014;383:1749-1761

2014

[23] [23]

A multisociety Delphi consensus statement on new fatty liver disease nomenclature

Rinella ME, Lazarus JV, Ratziu V, Francque SM, Sanyal AJ, Kanwal F, Romero D, et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. Hepatology 2023;78:1966-1986

2023

[24] [24]

Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models

Bouguettaya A, Stuart EM, Aboujaoude E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digital Medicine 2025;8:332. 28

2025

[25] [25]

Sociodemographic biases in medical decision making by large language models

Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, Charney AW, et al. Sociodemographic biases in medical decision making by large language models. Nature Medicine 2025;31:1873-1881

2025

[26] [26]

ESC Guidelines for the diagnosis and management of chronic coronary syndromes

Knuuti J, Wijns W, Saraste A, Capodanno D, Barbato E, Funck-Brentano C, Prescott E, et al. ESC Guidelines for the diagnosis and management of chronic coronary syndromes. ESC Scientific Document Group. Eur. Heart J 2019:425-431

2019

[27] [27]

Ethics and governance of artificial intelligence for health: large multi- modal models

Organization WH. Ethics and governance of artificial intelligence for health: large multi- modal models. WHO guidance: World Health Organization, 2024

2024

[28] [28]

EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis

Angeli P, Bernardi M, Villanueva C, Francoz C, Mookerjee RP, Trebicka J, Krag A, et al. EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis. Journal of hepatology 2018;69:406-460

2018

[29] [29]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022;35:27730-27744

2022

[30] [30]

A General Language Assistant as a Laboratory for Alignment

Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, Jones A, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

The fallacy of AI functionality

Raji ID, Kumar IE, Horowitz A, Selbst A. The fallacy of AI functionality. In: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency; 2022; 2022. p. 959-972

2022

[32] [32]

Multimodal biomedical AI

Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nature medicine 2022;28:1773-1784

2022

[33] [33]

Judging llm-as-a- judge with mt-bench and chatbot arena

Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. Judging llm-as-a- judge with mt-bench and chatbot arena. Advances in neural information processing systems 2023;36:46595-46623

2023

[34] [34]

Towards Deep Learning Models Resistant to Adversarial Attacks

Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 2020;33:9459-9474

2020

[36] [36]

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 2024;7:20

2024

[37] [37]

Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50

Greene JA, Podolsky SH. Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50. The New England journal of medicine 2012;367:1481

2012

[38] [38]

Diabetes is extremely well- controlled, blood sugar never high

Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical trials: Springer, 2015. 29 Appendix Supplementary Tables Supplementary Table 1 | Representative baseline–stress case variant pairs for each probe category. One illustrative probe from each of the six categories (P1–P5, CFI), showing the clean baseline prompt and the cor...

2015