pith. sign in

arxiv: 2606.07929 · v1 · pith:3KCLQBMYnew · submitted 2026-06-06 · 💻 cs.AI

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Pith reviewed 2026-06-27 20:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords medical LLMsstress testingsafety evaluationnarrative perturbationquantized modelssupervised fine-tuningclinical AIAI safety
0
0 comments X

The pith

Narrative stress testing uncovers latent safety failures in medical LLMs missed by accuracy benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AI-MASLD, a stress-audit framework adapted from metabolic stress testing in hepatology, to evaluate clinical LLMs. It applies six narrative perturbation probes to 240 clinical cases across seven models and tracks results through three indices: metabolic index, perturbation flip rate, and counterfactual fairness index. All models perform similarly under clean baseline conditions, but under realistic narrative stress their results diverge into distinct phenotypes. Quantized models display pseudonormalization where low flip rates conceal functional collapse, and medical supervised fine-tuning degrades logical stability, fairness, and information extraction. One open-weight model matches or exceeds proprietary models on every safety dimension.

Core claim

Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension.

What carries the argument

AI-MASLD, the stress-audit framework that applies six narrative perturbation probes and three indices (metabolic index, perturbation flip rate, counterfactual fairness index) to detect safety-relevant failures in clinical LLMs.

If this is right

  • Clean baseline accuracy does not predict performance under narrative stress.
  • Quantized models can exhibit low flip rates while undergoing functional collapse.
  • Medical supervised fine-tuning reduces logical stability, fairness, and information extraction.
  • An open-weight model can match or exceed proprietary models on all measured safety indices.
  • Narrative stress auditing detects safety issues invisible to accuracy-only evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment decisions for clinical LLMs should require stress-audit results in addition to benchmark scores.
  • The same narrative stress approach could be adapted to evaluate LLMs in other regulated domains such as law or finance.
  • Fine-tuning methods for medical LLMs may need redesign to avoid trading stability and fairness for domain accuracy.
  • Perturbation flip rate alone is insufficient as a safety metric because it can mask collapse in quantized models.

Load-bearing premise

The six narrative perturbation probes and the three indices validly measure safety-relevant failure modes that would appear in actual clinical deployment.

What would settle it

Direct observation in real clinical use showing that models with high stress-test failure rates produce no more errors or harms than models with low failure rates.

read the original abstract

Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the AI-MASLD stress-audit framework for evaluating clinical large language models by adapting metabolic stress testing from hepatology. Using 240 clinical cases and six narrative perturbation probes, seven models are subjected to double-stress testing and evaluated on three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). The paper reports that while all models perform well under clean conditions, under realistic narrative stress, performance diverges into two phenotypes: quantized models exhibit pseudonormalization where low flip rates mask functional collapse, and medical supervised fine-tuning systematically degrades logical stability, fairness, and information extraction. An open-weight model matches or exceeds proprietary models on safety dimensions.

Significance. If the narrative stress probes and derived indices validly identify safety-relevant failure modes that standard accuracy benchmarks miss, this work would be significant for the responsible deployment of LLMs in medicine by establishing stress auditing as a necessary evaluation complement. The identification of pseudonormalization in quantized models and degradation from medical SFT could inform model selection and fine-tuning practices.

major comments (3)
  1. [Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.
  2. [Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.
  3. [Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.
minor comments (2)
  1. [Abstract] Specify the exact seven models tested and name the open-weight model that performed competitively.
  2. [Methods] Clarify selection criteria and domain coverage for the 240 clinical cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Methods] The definitions, formulas, and operational details for the metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI) are absent from the manuscript, preventing verification of whether these quantities measure logical stability, fairness, or information extraction as claimed.

    Authors: We acknowledge the omission of these details in the submitted version. The revised manuscript will include the complete mathematical definitions, formulas, and step-by-step operational procedures for MI, PFR, and CFI within the Methods section, explicitly linking each index to the constructs of logical stability, fairness, and information extraction. revision: yes

  2. Referee: [Results] No mapping, clinician validation, or comparison to documented clinical error modes is provided to anchor the six narrative perturbation probes or the reported phenotypes to actual deployment safety failures; the central claim that these detect latent pathology therefore rests on an untested premise.

    Authors: The perturbation probes were derived from established patterns of narrative variation in clinical documentation. While the study did not include new clinician validation, the reported phenotypes are directly evidenced by the divergence between clean and stressed conditions across models. We will add a Discussion subsection that maps the probes and phenotypes to documented clinical error modes drawn from the existing literature on LLM deployment risks in medicine. revision: partial

  3. Referee: [Results] The statements that medical SFT 'systematically degraded' performance and that quantized models exhibit 'functional collapse' are presented without statistical tests, error bars, raw data, or effect sizes, undermining the quantitative support for the two stress-response phenotypes.

    Authors: We agree that quantitative support should be strengthened. The revised Results section will report the appropriate statistical tests, include error bars and effect sizes for all phenotype comparisons, and provide access to the underlying raw data via supplementary materials. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation without derivations or self-referential definitions.

full rationale

The paper presents an empirical stress-audit framework (AI-MASLD) applied to 240 clinical cases across six narrative perturbation probes and three indices (MI, PFR, CFI). All claims rest on observed performance divergences under baseline vs. stressed conditions for seven models, with no equations, derivations, fitted parameters renamed as predictions, or self-citation chains supporting uniqueness theorems. The abstract and description contain no self-definitional steps, ansatzes smuggled via citation, or renaming of known results. This is a standard empirical comparison study whose central claims do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only; the framework introduces three new indices whose computation rules are not stated, implying unlisted choices in how perturbations are generated and scores aggregated. Experimental design parameters are chosen by hand.

free parameters (2)
  • Number of clinical cases = 240
    240 cases selected for the study; no derivation given.
  • Number of narrative perturbation probes = 6
    Six probes chosen to create stress conditions; selection criteria not stated.
axioms (1)
  • domain assumption Narrative perturbations on clinical cases simulate realistic stresses that would affect LLM safety in deployment
    This premise underpins the entire stress-audit claim and is invoked when interpreting the divergence in performance.

pith-pipeline@v0.9.1-grok · 5696 in / 1405 out tokens · 30936 ms · 2026-06-27T20:08:50.969945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Large language models in medicine

    Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature medicine 2023;29:1930-1940

  2. [2]

    Creation and adoption of large language models in medicine

    Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. Jama 2023;330:866-869

  3. [3]

    High-performance medicine: the convergence of human and artificial intelligence

    Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44-56

  4. [4]

    Large language models encode clinical knowledge

    Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180

  5. [5]

    A large language model for electronic health records

    Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, et al. A large language model for electronic health records. NPJ digital medicine 2022;5:194

  6. [6]

    Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum

    Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 2023;183:589-596. 27

  7. [7]

    Testing and evaluation of health care applications of large language models: a systematic review

    Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, et al. Testing and evaluation of health care applications of large language models: a systematic review. Jama 2025;333:319-328

  8. [8]

    Toward expert- level medical question answering with large language models

    Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, et al. Toward expert- level medical question answering with large language models. Nature medicine 2025;31:943-950

  9. [9]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 2021;11:6421

  10. [10]

    Pubmedqa: A dataset for biomedical research question answering

    Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. Pubmedqa: A dataset for biomedical research question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP); 2019; 2019. p. 2567-2577

  11. [11]

    Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust

    Charon R. Rita Charon,“Narrative Medicine: A Model for Empathy, Reflection, Profession, and Trust” The Journal of the American Medical Association 286 (2001), 1897-1902. Readings in the Development of the Medical Humanities 2015:207

  12. [12]

    Accuracy of a generative artificial intelligence model in a complex diagnostic challenge

    Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama 2023;330:78-80

  13. [13]

    Medical large language models are vulnerable to data-poisoning attacks

    Alber DA, Yang Z, Alyakin A, Yang E, Rai S, Valliani AA, Zhang J, et al. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine 2025;31:618-626

  14. [14]

    Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

    Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health 2023;2:e0000198

  15. [15]

    Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases

    Dinc MT, Bardak AE, Bahar F, Noronha C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA open 2025;8:ooaf055

  16. [16]

    Adversarial examples for evaluating reading comprehension systems

    Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems. In: Proceedings of the 2017 conference on empirical methods in natural language processing; 2017

  17. [17]

    The clinician and dataset shift in artificial intelligence

    Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, et al. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 2021;385:283- 286

  18. [18]

    Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

    Ntinopoulos V, Biefer HRC, Tudorache I, Papadopoulos N, Odavic D, Risteski P, Haeussler A, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ health & care informatics 2025;32:e101139

  19. [19]

    Clinical entity augmented retrieval for clinical information extraction

    Lopez I, Swaminathan A, Vedula K, Narayanan S, Nateghi Haredasht F, Ma SP, Liang AS, et al. Clinical entity augmented retrieval for clinical information extraction. NPJ digital medicine 2025;8:45

  20. [20]

    Dissecting racial bias in an algorithm used to manage the health of populations

    Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447-453

  21. [21]

    Hidden in plain sight—reconsidering the use of race correction in clinical algorithms

    Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. In: Mass Medical Soc; 2020. p. 874-882

  22. [22]

    Liver cirrhosis

    Tsochatzis EA, Bosch J, Burroughs AK. Liver cirrhosis. The Lancet 2014;383:1749-1761

  23. [23]

    A multisociety Delphi consensus statement on new fatty liver disease nomenclature

    Rinella ME, Lazarus JV, Ratziu V, Francque SM, Sanyal AJ, Kanwal F, Romero D, et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. Hepatology 2023;78:1966-1986

  24. [24]

    Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models

    Bouguettaya A, Stuart EM, Aboujaoude E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digital Medicine 2025;8:332. 28

  25. [25]

    Sociodemographic biases in medical decision making by large language models

    Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, Charney AW, et al. Sociodemographic biases in medical decision making by large language models. Nature Medicine 2025;31:1873-1881

  26. [26]

    ESC Guidelines for the diagnosis and management of chronic coronary syndromes

    Knuuti J, Wijns W, Saraste A, Capodanno D, Barbato E, Funck-Brentano C, Prescott E, et al. ESC Guidelines for the diagnosis and management of chronic coronary syndromes. ESC Scientific Document Group. Eur. Heart J 2019:425-431

  27. [27]

    Ethics and governance of artificial intelligence for health: large multi- modal models

    Organization WH. Ethics and governance of artificial intelligence for health: large multi- modal models. WHO guidance: World Health Organization, 2024

  28. [28]

    EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis

    Angeli P, Bernardi M, Villanueva C, Francoz C, Mookerjee RP, Trebicka J, Krag A, et al. EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis. Journal of hepatology 2018;69:406-460

  29. [29]

    Training language models to follow instructions with human feedback

    Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 2022;35:27730-27744

  30. [30]

    A General Language Assistant as a Laboratory for Alignment

    Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, Jones A, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 2021

  31. [31]

    The fallacy of AI functionality

    Raji ID, Kumar IE, Horowitz A, Selbst A. The fallacy of AI functionality. In: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency; 2022; 2022. p. 959-972

  32. [32]

    Multimodal biomedical AI

    Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nature medicine 2022;28:1773-1784

  33. [33]

    Judging llm-as-a- judge with mt-bench and chatbot arena

    Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. Judging llm-as-a- judge with mt-bench and chatbot arena. Advances in neural information processing systems 2023;36:46595-46623

  34. [34]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 2017

  35. [35]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 2020;33:9459-9474

  36. [36]

    Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

    Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 2024;7:20

  37. [37]

    Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50

    Greene JA, Podolsky SH. Reform, regulation, and pharmaceuticals— the Kefauver–Harris Amendments at 50. The New England journal of medicine 2012;367:1481

  38. [38]

    Diabetes is extremely well- controlled, blood sugar never high

    Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical trials: Springer, 2015. 29 Appendix Supplementary Tables Supplementary Table 1 | Representative baseline–stress case variant pairs for each probe category. One illustrative probe from each of the six categories (P1–P5, CFI), showing the clean baseline prompt and the cor...