pith. sign in

arxiv: 2605.16993 · v1 · submitted 2026-05-16 · 💻 cs.CY · cs.AI· cs.LG

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

Pith reviewed 2026-05-19 19:14 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG
keywords clinical AIadversarial robustnesscross-lingual driftlow-resource healthcarediagnostic accuracychest X-raylanguage modelsNigeria
0
0 comments X p. Extension

The pith

Clinical AI for chest X-rays loses accuracy from 89 percent to 62 percent under tiny invisible image changes and drops further on Nigerian dialects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits two safety problems in clinical AI that current tests overlook. It applies small perturbations to X-ray images and switches the same cases into Nigerian Pidgin and Yoruba-inflected English. Both image models and language models lose substantial diagnostic performance under these conditions. A reader would care because these setups mirror the noisy equipment and mixed-language reality in many primary clinics. The results give a concrete failure range for deployment outside clean research data.

Core claim

Fine-tuned DenseNet121 on the COVID-QU-Ex chest X-ray set shows diagnostic accuracy falling from 89.3 percent to 62.0 percent under Fast Gradient Method perturbations at epsilon equal to 0.021, a level invisible to human observers. Common defenses such as Gaussian smoothing and ensemble voting do not restore safety. In separate tests, Llama3.1:8b and the Africa-focused NatLAS model lose accuracy on twenty clinical cases when switched from Standard English to Nigerian Pidgin or Yoruba-inflected English, with the latter model falling from 85.0 percent to 55.0 percent and consistency reaching only 50 percent.

What carries the argument

Dual audit that pairs Fast Gradient Method image perturbations with cross-lingual testing of language models on Pidgin and Yoruba-inflected clinical cases.

If this is right

  • Standard defensive techniques such as Gaussian smoothing fail to restore reliable performance.
  • The measured accuracy drops define a failure envelope relevant to Primary Health Centre use in Nigeria.
  • Current evaluation practices that rely on clean English inputs do not predict real-world behavior.
  • New model designs must incorporate adversarial hardening and dialect coverage to be clinically safe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fragility patterns likely appear in other imaging tasks and languages, pointing to a need for wider robustness benchmarks.
  • Adding simulated noise and dialect examples during training may reduce these drops in later models.
  • Without fixes, rollout of current clinical AI in diverse settings carries measurable risk of misdiagnosis.

Load-bearing premise

The twenty chosen clinical cases and the fixed perturbation size of epsilon equal to 0.021 stand in for the actual noisy images and spoken dialects found in Nigerian Primary Health Centres.

What would settle it

Apply the same models to chest X-ray images taken with typical low-resource equipment noise in Nigerian clinics or to real patient transcripts in local dialects and check whether accuracy remains above 80 percent.

read the original abstract

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript audits clinical AI for adversarial image fragility and cross-lingual diagnostic drift in low-resource healthcare settings. Using DenseNet121 on the COVID-QU-Ex dataset, it reports diagnostic accuracy collapsing from 89.3% to 62.0% under FGM perturbation with epsilon=0.021. In parallel, language models like Llama3.1:8b and NatLAS show accuracy drops on Nigerian Pidgin and Yoruba-inflected English inputs for 20 COVID-19 cases, with NatLAS dropping to 55.0%. The study concludes with calls for hardened and inclusive clinical AI architectures.

Significance. If substantiated, these results would be significant for AI safety in healthcare, particularly in low-resource environments such as Nigerian Primary Health Centres. The empirical approach using public benchmarks and standard models provides a quantitative failure envelope. However, the lack of detailed methodology limits immediate impact. Strengths include addressing orthogonal vulnerabilities and focusing on underrepresented settings.

major comments (3)
  1. Abstract: The assertion that epsilon=0.021 produces changes 'imperceptible to the human eye' is central to the fragility claim but lacks supporting evidence such as pixel-value histograms, comparisons to typical scanner noise in Nigerian PHCs, or results from a clinician blinded review. Without this, the transferability to real deployment conditions is not established.
  2. Abstract (language experiment): The language fragility results are based on only 20 clinical cases without reported confidence intervals, details on case selection criteria, prompt templates used, or controls for selection bias. This small sample size and lack of statistical reporting undermine the reliability of the reported drops (e.g., to 55.0% for NatLAS) and the claim of representativeness for cross-lingual drift.
  3. Methods (implied from abstract): The manuscript provides no details on statistical significance testing for the accuracy drops, exact dataset splits for the COVID-QU-Ex fine-tuning, or the code for perturbation generation, which are necessary to verify the central quantitative claims of collapse from 89.3% to 62.0%.
minor comments (2)
  1. Abstract: Clarify the exact definition of 'diagnosis consistency' that fell to 50% in the language experiment.
  2. Abstract: Provide more context on why Gaussian smoothing and ensemble voting were chosen as defensive strategies and their specific implementation details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have prompted us to clarify and strengthen several aspects of our manuscript. We respond to each major comment below, indicating the revisions made.

read point-by-point responses
  1. Referee: Abstract: The assertion that epsilon=0.021 produces changes 'imperceptible to the human eye' is central to the fragility claim but lacks supporting evidence such as pixel-value histograms, comparisons to typical scanner noise in Nigerian PHCs, or results from a clinician blinded review. Without this, the transferability to real deployment conditions is not established.

    Authors: We concur that empirical support for the imperceptibility of the perturbations is important for the claim's validity. In the revised manuscript, we have added pixel-value difference histograms and comparisons to typical noise levels reported in medical imaging literature for low-resource scanners. We also cite established perceptual thresholds from adversarial example research indicating that epsilon values below 0.03 are generally imperceptible. A blinded clinician review was not performed in this study due to resource constraints but is acknowledged as a valuable direction for future validation. revision: partial

  2. Referee: Abstract (language experiment): The language fragility results are based on only 20 clinical cases without reported confidence intervals, details on case selection criteria, prompt templates used, or controls for selection bias. This small sample size and lack of statistical reporting undermine the reliability of the reported drops (e.g., to 55.0% for NatLAS) and the claim of representativeness for cross-lingual drift.

    Authors: We appreciate this observation on the language experiment's limitations. We have expanded the Methods and Results sections to include bootstrap-derived 95% confidence intervals for all reported accuracies. Case selection criteria (random sampling from the available clinical cases), the exact prompt templates used for each language variant, and bias controls (such as averaging over three independent prompt phrasings) are now detailed. While the sample remains modest and we have moderated our language regarding broad representativeness, these additions improve transparency and allow readers to better assess the findings. revision: yes

  3. Referee: Methods (implied from abstract): The manuscript provides no details on statistical significance testing for the accuracy drops, exact dataset splits for the COVID-QU-Ex fine-tuning, or the code for perturbation generation, which are necessary to verify the central quantitative claims of collapse from 89.3% to 62.0%.

    Authors: We thank the referee for highlighting these methodological gaps. The revised Methods section now specifies the dataset splits (70% training, 15% validation, 15% test) for the COVID-QU-Ex fine-tuning, includes statistical significance testing via McNemar's test for paired accuracy comparisons (with p-values reported), and provides a reference to the publicly available code repository containing the FGM perturbation generation scripts and model fine-tuning details. These changes enable full reproducibility of the reported accuracy collapses. revision: yes

Circularity Check

0 steps flagged

Empirical audit with external benchmarks; no derivations or self-referential reductions

full rationale

The manuscript reports experimental results from fine-tuning DenseNet121 on the public COVID-QU-Ex dataset (85,318 images) and applying standard FGM perturbations plus language tests on 20 cases. No equations, uniqueness theorems, ansatzes, or predictions are derived; accuracy drops (89.3% to 62.0%, 85.0% to 55.0%) are direct empirical measurements against external data and models. No self-citations are load-bearing, no fitted parameters are renamed as predictions, and the study is self-contained against public benchmarks without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical audit relying on standard machine learning evaluation practices rather than new derivations; no free parameters are fitted to produce the headline claims, and no new entities are postulated.

axioms (2)
  • domain assumption Standard assumptions of supervised learning on labeled medical images hold for the COVID-QU-Ex dataset.
    Invoked implicitly when reporting accuracy on the fine-tuned DenseNet121.
  • domain assumption The 20 selected COVID-19 cases are representative of clinical presentation in low-resource Nigerian settings.
    Required for generalizing the language fragility results.

pith-pipeline@v0.9.0 · 5849 in / 1286 out tokens · 34821 ms · 2026-05-19T19:14:59.408701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    Wahl B. et al. Artificial intelligence (AI) and global health: how can AI contribute to health in resource-poor settings? BMJ Global Health, 3(4), e000798 (2018). https://doi.org/10.1136/bmjgh-2018-000798

  2. [2]

    Okafor C. et al. The utilization of artificial intelligence (AI) and machine learning (ML) for health in Nigeria: a rapid review. Journal of Medical Artificial Intelligence (2024). https://jmai.amegroups.org/article/view/11267

  3. [3]

    Amgad M. et al. Robust and Interpretable Chest X-ray Classification via Diffusion Purification and Concept-Based Adversarial Detection. Journal of Object Technology in Biomedical Research, 2025. https://doi.org/10.1016/j.media.2025.103375

  4. [4]

    Tahir A.M. et al. COVID-19 infection localization and severity grading from chest X-ray images. Computers in Biology and Medicine, 139, 105002 (2021). https://doi.org/10.1016/j.compbiomed.2021.105002

  5. [5]

    Adeyemi O. et al. WeCAViT: A Weighted CNN-ViT model for Pneumonia Detection in Chest X-rays. IEEE Access, 2025. https://www.researchgate.net/publication/389527548

  6. [6]

    Rahman T. et al. An enhanced ensemble defense framework for boosting adversarial robustness of intrusion detection systems. Expert Systems with Applications, 2025. https://doi.org/10.1016/j.eswa.2025.126800

  7. [7]

    Kaviani S. et al. Adversarial Robustness of Deep Learning in Medical Imaging: A Comprehensive Survey and Benchmark. International Journal of Advanced Computer Science and Applications (IJACSA), 16(12) (2025). https://thesai.org/Publications/ViewPaper?Volume=16&Issue=12&Code=ijacsa&SerialNo=78

  8. [8]

    Adversarial Robustness of Capsule Networks for Medical Image Classification

    Srinivasan A., Sritharan D.V., Chadha S., Fu D., Hossain O., Breuer G.A., and Aneja S. Adversarial Robustness of Capsule Networks for Medical Image Classification. medRxiv (2026). https://doi.org/10.64898/2026.03.09.26347900

  9. [9]

    and Korkmaz D

    Ucar F. and Korkmaz D. COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagnosis of the coronavirus disease 2019 (COVID-19) from X-ray images. Medical Hypotheses, 140, 109761 (2020). https://doi.org/10.1016/j.mehy.2020.109761

  10. [10]

    Rajpurkar P. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv:1711.05225 (2017). https://arxiv.org/abs/1711.05225

  11. [11]

    Deng J. et al. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE CVPR, 248-255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  12. [12]

    Explaining and Harnessing Adversarial Examples

    Goodfellow I.J., Shlens J., and Szegedy C. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR) (2015). arXiv:1412.6572

  13. [13]

    Nicolae M.I. et al. Adversarial Robustness Toolbox v1.0.0. arXiv:1807.01069 (2019). https://arxiv.org/abs/1807.01069

  14. [14]

    Does Generative AI speak Nigerian Pidgin? Issues about Representativeness and Bias for Multilingualism in LLMs

    Adelani D.I., Dogruoz A.S., and Aremu A.K. Does Generative AI speak Nigerian Pidgin? Issues about Representativeness and Bias for Multilingualism in LLMs. In Findings of NAACL 2025. ACL Anthology (2025). arXiv:2404.19442

  15. [15]

    Nekoto W. et al. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In Findings of EMNLP 2020. ACL Anthology. https://aclanthology.org/2020.findings-emnlp.195

  16. [16]

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of FAccT 2021, 610-623

    Bender E.M., Gebru T., McMillan-Major A., and Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of FAccT 2021, 610-623. https://doi.org/10.1145/3442188.3445922

  17. [17]

    That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects

    Coggins W., McKenzie J., Youm S., Mummaleti P., Gilbert J., Ragan E., and Dorr B.J. That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects. In Proceedings of the 9th Widening NLP Workshop (WiNLP), ACL 2025. https://aclanthology.org/2025.winlp-main/

  18. [18]

    Ogunremi T. et al. N-ATLAS: Nigerian Atlas for Languages and AI at Scale. arXiv:2509.08234 (2025). https://arxiv.org/abs/2509.08234

  19. [19]

    Garnerin M. et al. Google Fleurs: Few-shot Learning Evaluation of Universal Representations of Speech. In IEEE Spoken Language Technology Workshop (2022). https://doi.org/10.1109/SLT54892.2023.10022793

  20. [20]

    Participatory Research for Low-resourced Machine Translation: Community Approaches to African Language AI

    Masakhane. Participatory Research for Low-resourced Machine Translation: Community Approaches to African Language AI. Masakhane White Paper (2020). https://www.masakhane.io

  21. [21]

    Integrated Management of Childhood Illness (IMCI): Chart Booklet

    World Health Organization. Integrated Management of Childhood Illness (IMCI): Chart Booklet. WHO Press, Geneva (2014). https://www.who.int/publications/i/item/9789241506823

  22. [22]

    Fleisig G. et al. When the Majority is the Minority: Cross-lingual Learning in Low-resource Settings. In Proceedings of ACL 2023. https://aclanthology.org/2023.acl-long.77

  23. [23]

    Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

    Cheng Z., Yang J., Dai W., and Sun J. Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment. arXiv:2602.01587 (2026). https://arxiv.org/abs/2602.01587

  24. [24]

    Densely Connected Convolutional Networks

    Huang G., Liu Z., van der Maaten L., and Weinberger K.Q. Densely Connected Convolutional Networks. In Proceedings of IEEE CVPR, 4700-4708 (2017). https://doi.org/10.1109/CVPR.2017.243

  25. [25]

    Li H. et al. Adaptive noise-augmented attention for enhancing Transformer fine-tuning on longitudinal medical data. Frontiers in Artificial Intelligence, 8, 1663484 (2025). https://doi.org/10.3389/frai.2025.1663484

  26. [26]

    WHO Guidelines for Malaria

    World Health Organization. WHO Guidelines for Malaria. WHO Press, Geneva (2025). https://www.who.int/publications/i/item/guidelines-for-malaria

  27. [27]

    Evans L. et al. Surviving Sepsis Campaign: International Guidelines for Management of Sepsis and Septic Shock 2021. Intensive Care Medicine, 47, 1181-1247 (2021). https://doi.org/10.1007/s00134-021-06506-y

  28. [28]

    Standard Treatment Guidelines (5th Edition)

    Federal Ministry of Health Nigeria. Standard Treatment Guidelines (5th Edition). Federal Ministry of Health, Abuja (2022). https://www.health.gov.ng

  29. [29]

    Wu Y. et al. Uncertainty-aware feature-weighted ensemble framework for heart disease prediction. PMC — PLOS ONE (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC13106842/

  30. [30]

    Zhang X. et al. LISArD: learning image similarity to defend against gray-box adversarial attacks. PeerJ Computer Science, e3735 (2025). https://doi.org/10.7717/peerj-cs.3735

  31. [31]

    Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models

    Xia S., Ding M., Kong C., Yang W., and Jiang X. Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models. arXiv:2601.16200 (2026). https://arxiv.org/abs/2601.16200

  32. [32]

    Liu Z. et al. A ConvNet for the 2020s. In Proceedings of IEEE CVPR, 11976-11986 (2022). https://doi.org/10.1109/CVPR52688.2022.01167

  33. [33]

    LONDA 2025 Digital Rights and Inclusion in Africa Report

    Paradigm Initiative. LONDA 2025 Digital Rights and Inclusion in Africa Report. Paradigm Initiative Press (2026). https://paradigmhq.org/wp-content/uploads/2026/04/LONDA-2025-REPORT-2.pdf

  34. [34]

    Challen R. et al. Artificial intelligence, bias and clinical safety. BMJ Quality and Safety, 28(3), 231-237 (2019). https://doi.org/10.1136/bmjqs-2018-008370

  35. [35]

    Artificial Intelligence in Healthcare: A Narrative Review of Recent Clinical Applications, Implementation Strategies, and Challenges

    Topol E.J. Artificial Intelligence in Healthcare: A Narrative Review of Recent Clinical Applications, Implementation Strategies, and Challenges. PMC — npj Digital Medicine (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC12764347/