Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
Pith reviewed 2026-06-27 20:08 UTC · model grok-4.3
The pith
Narrative stress testing uncovers latent safety failures in medical LLMs missed by accuracy benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension.
What carries the argument
AI-MASLD, the stress-audit framework that applies six narrative perturbation probes and three indices (metabolic index, perturbation flip rate, counterfactual fairness index) to detect safety-relevant failures in clinical LLMs.
If this is right
- Clean baseline accuracy does not predict performance under narrative stress.
- Quantized models can exhibit low flip rates while undergoing functional collapse.
- Medical supervised fine-tuning reduces logical stability, fairness, and information extraction.
- An open-weight model can match or exceed proprietary models on all measured safety indices.
- Narrative stress auditing detects safety issues invisible to accuracy-only evaluation.
Where Pith is reading between the lines
- Deployment decisions for clinical LLMs should require stress-audit results in addition to benchmark scores.
- The same narrative stress approach could be adapted to evaluate LLMs in other regulated domains such as law or finance.
- Fine-tuning methods for medical LLMs may need redesign to avoid trading stability and fairness for domain accuracy.
- Perturbation flip rate alone is insufficient as a safety metric because it can mask collapse in quantized models.
Load-bearing premise
The six narrative perturbation probes and the three indices validly measure safety-relevant failure modes that would appear in actual clinical deployment.
What would settle it
Direct observation in real clinical use showing that models with high stress-test failure rates produce no more errors or harms than models with low failure rates.
read the original abstract
Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the AI-MASLD stress-audit framework for evaluating clinical large language models by adapting metabolic stress testing from hepatology. Using 240 clinical cases and six narrative perturbation probes, seven models are subjected to double-stress testing and evaluated on three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). The paper reports that while all models perform well under clean conditions, under realistic narrative stress, performance diverges into two phenotypes: quantized models exhibit pseudonormalization where low flip rates mask functional collapse, and medical supervised fine-tuning systematically degrades logical stability, fairness, and information extraction. An open-weight model matches or exceeds proprietary models on safety dimensions.
Significance. If the narrative stress probes and derived indices validly identify safety-relevant failure modes that standard accuracy benchmarks miss, this work would be significant for the responsible deployment of LLMs in medicine by establishing stress auditing as a necessary evaluation complement. The identification of pseudonormalization in quantized models and degradation from medical SFT could inform model selection and fine-tuning practices.
major comments (3)
- [Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.
- [Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.
- [Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.
minor comments (2)
- [Abstract] Specify the exact seven models tested and name the open-weight model that performed competitively.
- [Methods] Clarify selection criteria and domain coverage for the 240 clinical cases.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions that will be incorporated.
read point-by-point responses
-
Referee: [Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.
Authors: We acknowledge the omission of these details in the submitted version. The revised manuscript will include the complete mathematical definitions, formulas, and step-by-step operational procedures for MI, PFR, and CFI within the Methods section, explicitly linking each index to the constructs of logical stability, fairness, and information extraction. revision: yes
-
Referee: [Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.
Authors: The perturbation probes were derived from established patterns of narrative variation in clinical documentation. While the study did not include new clinician validation, the reported phenotypes are directly evidenced by the divergence between clean and stressed conditions across models. We will add a Discussion subsection that maps the probes and phenotypes to documented clinical error modes drawn from the existing literature on LLM deployment risks in medicine. revision: partial
-
Referee: [Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.
Authors: We agree that quantitative support should be strengthened. The revised Results section will report the appropriate statistical tests, include error bars and effect sizes for all phenotype comparisons, and provide access to the underlying raw data via supplementary materials. revision: yes
Circularity Check
No significant circularity; empirical evaluation without derivations or self-referential definitions.
full rationale
The paper presents an empirical stress-audit framework (AI-MASLD) applied to 240 clinical cases across six narrative perturbation probes and three indices (MI, PFR, CFI). All claims rest on observed performance divergences under baseline vs. stressed conditions for seven models, with no equations, derivations, fitted parameters renamed as predictions, or self-citation chains supporting uniqueness theorems. The abstract and description contain no self-definitional steps, ansatzes smuggled via citation, or renaming of known results. This is a standard empirical comparison study whose central claims do not reduce to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of clinical cases =
240
- Number of narrative perturbation probes =
6
axioms (1)
- domain assumption Narrative perturbations on clinical cases simulate realistic stresses that would affect LLM safety in deployment
Reference graph
Works this paper leans on
-
[1]
Large language models in medicine
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature medicine 2023;29:1930-1940
2023
-
[2]
Creation and adoption of large language models in medicine
Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. Jama 2023;330:866-869
2023
-
[3]
High-performance medicine: the convergence of human and artificial intelligence
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44-56
2019
-
[4]
Large language models encode clinical knowledge
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180
2023
-
[5]
A large language model for electronic health records
Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, et al. A large language model for electronic health records. NPJ digital medicine 2022;5:194
2022
-
[6]
Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 2023;183:589-596. 27
2023
-
[7]
Testing and evaluation of health care applications of large language models: a systematic review
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, et al. Testing and evaluation of health care applications of large language models: a systematic review. Jama 2025;333:319-328
2025
-
[8]
Toward expert- level medical question answering with large language models
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, et al. Toward expert- level medical question answering with large language models. Nature medicine 2025;31:943-950
2025
-
[9]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 2021;11:6421
2021
-
[10]
Pubmedqa: A dataset for biomedical research question answering
Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP); 2019; 2019. p. 2567-2577
2019
-
[11]
Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust
Charon R. Rita Charon,“Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust” The Journal of the American Medical Association 286 (2001), 1897-1902. Readings in the Development of the Medical Humanities 2015:207
2001
-
[12]
Accuracy of a generative artificial intelligence model in a complex diagnostic challenge
Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama 2023;330:78-80
2023
-
[13]
Medical large language models are vulnerable to data-poisoning attacks
Alber DA, Yang Z, Alyakin A, Yang E, Rai S, Valliani AA, Zhang J, et al. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine 2025;31:618-626
2025
-
[14]
Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health 2023;2:e0000198
2023
-
[15]
Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases
Dinc MT, Bardak AE, Bahar F, Noronha C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA open 2025;8:ooaf055
2025
-
[16]
Adversarial examples for evaluating reading comprehension systems
Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems. In: Proceedings of the 2017 conference on empirical methods in natural language processing; 2017
2017
-
[17]
The clinician and dataset shift in artificial intelligence
Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, et al. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 2021;385:283- 286
2021
-
[18]
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
Ntinopoulos V, Biefer HRC, Tudorache I, Papadopoulos N, Odavic D, Risteski P, Haeussler A, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ health & care informatics 2025;32:e101139
2025
-
[19]
Clinical entity augmented retrieval for clinical information extraction
Lopez I, Swaminathan A, Vedula K, Narayanan S, Nateghi Haredasht F, Ma SP, Liang AS, et al. Clinical entity augmented retrieval for clinical information extraction. NPJ digital medicine 2025;8:45
2025
-
[20]
Dissecting racial bias in an algorithm used to manage the health of populations
Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447-453
2019
-
[21]
Hidden in plain sight—reconsidering the use of race correction in clinical algorithms
Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. In: Mass Medical Soc; 2020. p. 874-882
2020
-
[22]
Liver cirrhosis
Tsochatzis EA, Bosch J, Burroughs AK. Liver cirrhosis. The Lancet 2014;383:1749-1761
2014
-
[23]
A multisociety Delphi consensus statement on new fatty liver disease nomenclature
Rinella ME, Lazarus JV, Ratziu V, Francque SM, Sanyal AJ, Kanwal F, Romero D, et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. Hepatology 2023;78:1966-1986
2023
-
[24]
Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models
Bouguettaya A, Stuart EM, Aboujaoude E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digital Medicine 2025;8:332. 28
2025
-
[25]
Sociodemographic biases in medical decision making by large language models
Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, Charney AW, et al. Sociodemographic biases in medical decision making by large language models. Nature Medicine 2025;31:1873-1881
2025
-
[26]
ESC Guidelines for the diagnosis and management of chronic coronary syndromes
Knuuti J, Wijns W, Saraste A, Capodanno D, Barbato E, Funck-Brentano C, Prescott E, et al. ESC Guidelines for the diagnosis and management of chronic coronary syndromes. ESC Scientific Document Group. Eur. Heart J 2019:425-431
2019
-
[27]
Ethics and governance of artificial intelligence for health: large multi- modal models
Organization WH. Ethics and governance of artificial intelligence for health: large multi- modal models. WHO guidance: World Health Organization, 2024
2024
-
[28]
EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis
Angeli P, Bernardi M, Villanueva C, Francoz C, Mookerjee RP, Trebicka J, Krag A, et al. EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis. Journal of hepatology 2018;69:406-460
2018
-
[29]
Training language models to follow instructions with human feedback
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022;35:27730-27744
2022
-
[30]
A General Language Assistant as a Laboratory for Alignment
Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, Jones A, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
The fallacy of AI functionality
Raji ID, Kumar IE, Horowitz A, Selbst A. The fallacy of AI functionality. In: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency; 2022; 2022. p. 959-972
2022
-
[32]
Multimodal biomedical AI
Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nature medicine 2022;28:1773-1784
2022
-
[33]
Judging llm-as-a- judge with mt-bench and chatbot arena
Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. Judging llm-as-a- judge with mt-bench and chatbot arena. Advances in neural information processing systems 2023;36:46595-46623
2023
-
[34]
Towards Deep Learning Models Resistant to Adversarial Attacks
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Retrieval- augmented generation for knowledge-intensive nlp tasks
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 2020;33:9459-9474
2020
-
[36]
Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine
Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 2024;7:20
2024
-
[37]
Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50
Greene JA, Podolsky SH. Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50. The New England journal of medicine 2012;367:1481
2012
-
[38]
Diabetes is extremely well- controlled, blood sugar never high
Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical trials: Springer, 2015. 29 Appendix Supplementary Tables Supplementary Table 1 | Representative baseline–stress case variant pairs for each probe category. One illustrative probe from each of the six categories (P1–P5, CFI), showing the clean baseline prompt and the cor...
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.