Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings
Pith reviewed 2026-05-19 19:14 UTC · model grok-4.3
The pith
Clinical AI for chest X-rays loses accuracy from 89 percent to 62 percent under tiny invisible image changes and drops further on Nigerian dialects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned DenseNet121 on the COVID-QU-Ex chest X-ray set shows diagnostic accuracy falling from 89.3 percent to 62.0 percent under Fast Gradient Method perturbations at epsilon equal to 0.021, a level invisible to human observers. Common defenses such as Gaussian smoothing and ensemble voting do not restore safety. In separate tests, Llama3.1:8b and the Africa-focused NatLAS model lose accuracy on twenty clinical cases when switched from Standard English to Nigerian Pidgin or Yoruba-inflected English, with the latter model falling from 85.0 percent to 55.0 percent and consistency reaching only 50 percent.
What carries the argument
Dual audit that pairs Fast Gradient Method image perturbations with cross-lingual testing of language models on Pidgin and Yoruba-inflected clinical cases.
If this is right
- Standard defensive techniques such as Gaussian smoothing fail to restore reliable performance.
- The measured accuracy drops define a failure envelope relevant to Primary Health Centre use in Nigeria.
- Current evaluation practices that rely on clean English inputs do not predict real-world behavior.
- New model designs must incorporate adversarial hardening and dialect coverage to be clinically safe.
Where Pith is reading between the lines
- The same fragility patterns likely appear in other imaging tasks and languages, pointing to a need for wider robustness benchmarks.
- Adding simulated noise and dialect examples during training may reduce these drops in later models.
- Without fixes, rollout of current clinical AI in diverse settings carries measurable risk of misdiagnosis.
Load-bearing premise
The twenty chosen clinical cases and the fixed perturbation size of epsilon equal to 0.021 stand in for the actual noisy images and spoken dialects found in Nigerian Primary Health Centres.
What would settle it
Apply the same models to chest X-ray images taken with typical low-resource equipment noise in Nigerian clinics or to real patient transcripts in local dialects and check whether accuracy remains above 80 percent.
read the original abstract
Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits clinical AI for adversarial image fragility and cross-lingual diagnostic drift in low-resource healthcare settings. Using DenseNet121 on the COVID-QU-Ex dataset, it reports diagnostic accuracy collapsing from 89.3% to 62.0% under FGM perturbation with epsilon=0.021. In parallel, language models like Llama3.1:8b and NatLAS show accuracy drops on Nigerian Pidgin and Yoruba-inflected English inputs for 20 COVID-19 cases, with NatLAS dropping to 55.0%. The study concludes with calls for hardened and inclusive clinical AI architectures.
Significance. If substantiated, these results would be significant for AI safety in healthcare, particularly in low-resource environments such as Nigerian Primary Health Centres. The empirical approach using public benchmarks and standard models provides a quantitative failure envelope. However, the lack of detailed methodology limits immediate impact. Strengths include addressing orthogonal vulnerabilities and focusing on underrepresented settings.
major comments (3)
- Abstract: The assertion that epsilon=0.021 produces changes 'imperceptible to the human eye' is central to the fragility claim but lacks supporting evidence such as pixel-value histograms, comparisons to typical scanner noise in Nigerian PHCs, or results from a clinician blinded review. Without this, the transferability to real deployment conditions is not established.
- Abstract (language experiment): The language fragility results are based on only 20 clinical cases without reported confidence intervals, details on case selection criteria, prompt templates used, or controls for selection bias. This small sample size and lack of statistical reporting undermine the reliability of the reported drops (e.g., to 55.0% for NatLAS) and the claim of representativeness for cross-lingual drift.
- Methods (implied from abstract): The manuscript provides no details on statistical significance testing for the accuracy drops, exact dataset splits for the COVID-QU-Ex fine-tuning, or the code for perturbation generation, which are necessary to verify the central quantitative claims of collapse from 89.3% to 62.0%.
minor comments (2)
- Abstract: Clarify the exact definition of 'diagnosis consistency' that fell to 50% in the language experiment.
- Abstract: Provide more context on why Gaussian smoothing and ensemble voting were chosen as defensive strategies and their specific implementation details.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have prompted us to clarify and strengthen several aspects of our manuscript. We respond to each major comment below, indicating the revisions made.
read point-by-point responses
-
Referee: Abstract: The assertion that epsilon=0.021 produces changes 'imperceptible to the human eye' is central to the fragility claim but lacks supporting evidence such as pixel-value histograms, comparisons to typical scanner noise in Nigerian PHCs, or results from a clinician blinded review. Without this, the transferability to real deployment conditions is not established.
Authors: We concur that empirical support for the imperceptibility of the perturbations is important for the claim's validity. In the revised manuscript, we have added pixel-value difference histograms and comparisons to typical noise levels reported in medical imaging literature for low-resource scanners. We also cite established perceptual thresholds from adversarial example research indicating that epsilon values below 0.03 are generally imperceptible. A blinded clinician review was not performed in this study due to resource constraints but is acknowledged as a valuable direction for future validation. revision: partial
-
Referee: Abstract (language experiment): The language fragility results are based on only 20 clinical cases without reported confidence intervals, details on case selection criteria, prompt templates used, or controls for selection bias. This small sample size and lack of statistical reporting undermine the reliability of the reported drops (e.g., to 55.0% for NatLAS) and the claim of representativeness for cross-lingual drift.
Authors: We appreciate this observation on the language experiment's limitations. We have expanded the Methods and Results sections to include bootstrap-derived 95% confidence intervals for all reported accuracies. Case selection criteria (random sampling from the available clinical cases), the exact prompt templates used for each language variant, and bias controls (such as averaging over three independent prompt phrasings) are now detailed. While the sample remains modest and we have moderated our language regarding broad representativeness, these additions improve transparency and allow readers to better assess the findings. revision: yes
-
Referee: Methods (implied from abstract): The manuscript provides no details on statistical significance testing for the accuracy drops, exact dataset splits for the COVID-QU-Ex fine-tuning, or the code for perturbation generation, which are necessary to verify the central quantitative claims of collapse from 89.3% to 62.0%.
Authors: We thank the referee for highlighting these methodological gaps. The revised Methods section now specifies the dataset splits (70% training, 15% validation, 15% test) for the COVID-QU-Ex fine-tuning, includes statistical significance testing via McNemar's test for paired accuracy comparisons (with p-values reported), and provides a reference to the publicly available code repository containing the FGM perturbation generation scripts and model fine-tuning details. These changes enable full reproducibility of the reported accuracy collapses. revision: yes
Circularity Check
Empirical audit with external benchmarks; no derivations or self-referential reductions
full rationale
The manuscript reports experimental results from fine-tuning DenseNet121 on the public COVID-QU-Ex dataset (85,318 images) and applying standard FGM perturbations plus language tests on 20 cases. No equations, uniqueness theorems, ansatzes, or predictions are derived; accuracy drops (89.3% to 62.0%, 85.0% to 55.0%) are direct empirical measurements against external data and models. No self-citations are load-bearing, no fitted parameters are renamed as predictions, and the study is self-contained against public benchmarks without reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard assumptions of supervised learning on labeled medical images hold for the COVID-QU-Ex dataset.
- domain assumption The 20 selected COVID-19 cases are representative of clinical presentation in low-resource Nigerian settings.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS collapsed from 85.0% to 55.0%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wahl B. et al. Artificial intelligence (AI) and global health: how can AI contribute to health in resource-poor settings? BMJ Global Health, 3(4), e000798 (2018). https://doi.org/10.1136/bmjgh-2018-000798
-
[2]
Okafor C. et al. The utilization of artificial intelligence (AI) and machine learning (ML) for health in Nigeria: a rapid review. Journal of Medical Artificial Intelligence (2024). https://jmai.amegroups.org/article/view/11267
work page 2024
-
[3]
Amgad M. et al. Robust and Interpretable Chest X-ray Classification via Diffusion Purification and Concept-Based Adversarial Detection. Journal of Object Technology in Biomedical Research, 2025. https://doi.org/10.1016/j.media.2025.103375
-
[4]
Tahir A.M. et al. COVID-19 infection localization and severity grading from chest X-ray images. Computers in Biology and Medicine, 139, 105002 (2021). https://doi.org/10.1016/j.compbiomed.2021.105002
- [5]
-
[6]
Rahman T. et al. An enhanced ensemble defense framework for boosting adversarial robustness of intrusion detection systems. Expert Systems with Applications, 2025. https://doi.org/10.1016/j.eswa.2025.126800
-
[7]
Kaviani S. et al. Adversarial Robustness of Deep Learning in Medical Imaging: A Comprehensive Survey and Benchmark. International Journal of Advanced Computer Science and Applications (IJACSA), 16(12) (2025). https://thesai.org/Publications/ViewPaper?Volume=16&Issue=12&Code=ijacsa&SerialNo=78
work page 2025
-
[8]
Adversarial Robustness of Capsule Networks for Medical Image Classification
Srinivasan A., Sritharan D.V., Chadha S., Fu D., Hossain O., Breuer G.A., and Aneja S. Adversarial Robustness of Capsule Networks for Medical Image Classification. medRxiv (2026). https://doi.org/10.64898/2026.03.09.26347900
-
[9]
Ucar F. and Korkmaz D. COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagnosis of the coronavirus disease 2019 (COVID-19) from X-ray images. Medical Hypotheses, 140, 109761 (2020). https://doi.org/10.1016/j.mehy.2020.109761
-
[10]
Rajpurkar P. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv:1711.05225 (2017). https://arxiv.org/abs/1711.05225
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Deng J. et al. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE CVPR, 248-255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
-
[12]
Explaining and Harnessing Adversarial Examples
Goodfellow I.J., Shlens J., and Szegedy C. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR) (2015). arXiv:1412.6572
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [13]
-
[14]
Adelani D.I., Dogruoz A.S., and Aremu A.K. Does Generative AI speak Nigerian Pidgin? Issues about Representativeness and Bias for Multilingualism in LLMs. In Findings of NAACL 2025. ACL Anthology (2025). arXiv:2404.19442
-
[15]
Nekoto W. et al. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In Findings of EMNLP 2020. ACL Anthology. https://aclanthology.org/2020.findings-emnlp.195
work page 2020
-
[16]
Bender E.M., Gebru T., McMillan-Major A., and Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of FAccT 2021, 610-623. https://doi.org/10.1145/3442188.3445922
-
[17]
Coggins W., McKenzie J., Youm S., Mummaleti P., Gilbert J., Ragan E., and Dorr B.J. That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects. In Proceedings of the 9th Widening NLP Workshop (WiNLP), ACL 2025. https://aclanthology.org/2025.winlp-main/
work page 2025
- [18]
-
[19]
Garnerin M. et al. Google Fleurs: Few-shot Learning Evaluation of Universal Representations of Speech. In IEEE Spoken Language Technology Workshop (2022). https://doi.org/10.1109/SLT54892.2023.10022793
-
[20]
Masakhane. Participatory Research for Low-resourced Machine Translation: Community Approaches to African Language AI. Masakhane White Paper (2020). https://www.masakhane.io
work page 2020
-
[21]
Integrated Management of Childhood Illness (IMCI): Chart Booklet
World Health Organization. Integrated Management of Childhood Illness (IMCI): Chart Booklet. WHO Press, Geneva (2014). https://www.who.int/publications/i/item/9789241506823
-
[22]
Fleisig G. et al. When the Majority is the Minority: Cross-lingual Learning in Low-resource Settings. In Proceedings of ACL 2023. https://aclanthology.org/2023.acl-long.77
work page 2023
-
[23]
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Cheng Z., Yang J., Dai W., and Sun J. Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment. arXiv:2602.01587 (2026). https://arxiv.org/abs/2602.01587
-
[24]
Densely Connected Convolutional Networks
Huang G., Liu Z., van der Maaten L., and Weinberger K.Q. Densely Connected Convolutional Networks. In Proceedings of IEEE CVPR, 4700-4708 (2017). https://doi.org/10.1109/CVPR.2017.243
-
[25]
Li H. et al. Adaptive noise-augmented attention for enhancing Transformer fine-tuning on longitudinal medical data. Frontiers in Artificial Intelligence, 8, 1663484 (2025). https://doi.org/10.3389/frai.2025.1663484
-
[26]
World Health Organization. WHO Guidelines for Malaria. WHO Press, Geneva (2025). https://www.who.int/publications/i/item/guidelines-for-malaria
work page 2025
-
[27]
Evans L. et al. Surviving Sepsis Campaign: International Guidelines for Management of Sepsis and Septic Shock 2021. Intensive Care Medicine, 47, 1181-1247 (2021). https://doi.org/10.1007/s00134-021-06506-y
-
[28]
Standard Treatment Guidelines (5th Edition)
Federal Ministry of Health Nigeria. Standard Treatment Guidelines (5th Edition). Federal Ministry of Health, Abuja (2022). https://www.health.gov.ng
work page 2022
-
[29]
Wu Y. et al. Uncertainty-aware feature-weighted ensemble framework for heart disease prediction. PMC — PLOS ONE (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC13106842/
work page 2025
-
[30]
Zhang X. et al. LISArD: learning image similarity to defend against gray-box adversarial attacks. PeerJ Computer Science, e3735 (2025). https://doi.org/10.7717/peerj-cs.3735
-
[31]
Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models
Xia S., Ding M., Kong C., Yang W., and Jiang X. Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models. arXiv:2601.16200 (2026). https://arxiv.org/abs/2601.16200
-
[32]
Liu Z. et al. A ConvNet for the 2020s. In Proceedings of IEEE CVPR, 11976-11986 (2022). https://doi.org/10.1109/CVPR52688.2022.01167
-
[33]
LONDA 2025 Digital Rights and Inclusion in Africa Report
Paradigm Initiative. LONDA 2025 Digital Rights and Inclusion in Africa Report. Paradigm Initiative Press (2026). https://paradigmhq.org/wp-content/uploads/2026/04/LONDA-2025-REPORT-2.pdf
work page 2025
-
[34]
Challen R. et al. Artificial intelligence, bias and clinical safety. BMJ Quality and Safety, 28(3), 231-237 (2019). https://doi.org/10.1136/bmjqs-2018-008370
-
[35]
Topol E.J. Artificial Intelligence in Healthcare: A Narrative Review of Recent Clinical Applications, Implementation Strategies, and Challenges. PMC — npj Digital Medicine (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC12764347/
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.