pith. sign in

arxiv: 2605.24806 · v1 · pith:GZ4YKI23new · submitted 2026-05-24 · 💻 cs.SD · cs.AI· eess.AS

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Pith reviewed 2026-06-30 00:13 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords Parkinson's disease detectionzero-shot learningspeech analysislarge language modelsaudio modelsinput modalitiesmultilingual evaluationacoustic features
0
0 comments X

The pith

Experiments show handcrafted acoustic features deliver steadier zero-shot Parkinson's detection from speech than raw audio in low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the form of speech input changes how well large models can detect Parkinson's disease without any training examples. It pits handcrafted acoustic features fed into a general language model against raw audio waveforms fed into audio-specialized models. Results across four languages indicate that accuracy shifts with the input type, the speaking task, and the language. Handcrafted features give more consistent results in a low-resource language like Bengali, while raw audio produces gains only on certain datasets. This distinction matters because reliable zero-shot methods could support diagnosis in languages where labeled medical speech data is scarce.

Core claim

The paper establishes that zero-shot Parkinson's disease detection from speech yields performance that depends on input modality, with handcrafted acoustic features analyzed by a general-purpose LLM providing more stable results in low-resource languages such as Bengali, while direct waveform input to audio models produces dataset-dependent improvements.

What carries the argument

The systematic comparison of two input modalities for zero-shot inference: handcrafted acoustic features processed by a general-purpose LLM versus raw audio waveforms processed by audio-capable large models.

If this is right

  • Zero-shot detection accuracy is not fixed but changes with the choice between handcrafted features and raw audio.
  • In low-resource languages, handcrafted acoustic features produce more reliable outcomes than raw waveforms.
  • The benefit of each modality also depends on which speech task is recorded.
  • Cross-lingual evaluation is required to determine when zero-shot methods can be applied safely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar modality comparisons could be tested on other neurological conditions detectable from speech.
  • Preprocessing pipelines that convert speech to features may be the safer starting point for multilingual medical screening tools.
  • Dataset-specific tuning of audio models might reduce the observed variability if applied consistently.

Load-bearing premise

The measured performance gaps between modalities stem solely from the input format itself rather than from differences in the underlying models, dataset biases, or recording conditions.

What would settle it

Re-running the four-language experiments using the exact same model architecture for both feature-based and waveform inputs on matched datasets would falsify the claim if performance differences disappear.

Figures

Figures reproduced from arXiv: 2605.24806 by Muhammad Ashad Kabir, Sirajam Munira.

Figure 1
Figure 1. Figure 1: Schematic overview of the zero-shot pipeline for PD detection using LLMs and LALMs LLMs or directly providing raw audio to LALMs. Despite these developments, it remains unclear how such input choices influence zero-shot LLM performance in PD de￾tection. Most existing PD detection research relies on su￾pervised machine learning, whereas LLM-based studies have largely been applied to care systems and clinica… view at source ↗
read the original abstract

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that zero-shot PD detection from speech using large models shows performance varying across input modalities (handcrafted acoustic features fed to a general-purpose LLM versus raw waveform input to audio-capable models), speech tasks, and languages. Handcrafted features yield more stable results in low-resource languages such as Bengali, while audio input produces dataset-dependent gains, based on experiments across four languages.

Significance. If the central empirical claims hold after controlling for confounds, the work would contribute to understanding modality selection for zero-shot speech-based clinical detection tasks, with particular value for low-resource language stability. The study is an empirical comparison without fitted parameters or derivations, allowing direct falsification via replication.

major comments (1)
  1. [Experimental Setup] The experimental design routes handcrafted features through a general-purpose LLM while routing waveforms through separate audio models, without an ablation that holds the underlying model fixed and varies only the input representation. This confounds attribution of differences (including Bengali stability) to modality alone rather than model family, pretraining, or tokenization. This is load-bearing for the abstract claim that performance varies across input modalities.
minor comments (2)
  1. [Abstract] The abstract states findings without details on sample sizes, statistical tests, recording condition controls, or exact model versions used.
  2. Add a summary table of all metrics, tasks, and languages with confidence intervals to improve clarity of the cross-condition comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important aspect of our experimental design. Below we respond directly to the major comment and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Experimental Setup] The experimental design routes handcrafted features through a general-purpose LLM while routing waveforms through separate audio models, without an ablation that holds the underlying model fixed and varies only the input representation. This confounds attribution of differences (including Bengali stability) to modality alone rather than model family, pretraining, or tokenization. This is load-bearing for the abstract claim that performance varies across input modalities.

    Authors: We agree that the design compares two distinct practical pipelines rather than isolating input representation while holding the model constant. Handcrafted acoustic features are conventionally processed by text-based LLMs, while raw waveforms require audio-specific models; a direct swap is not straightforward without additional engineering that would itself introduce new variables. Our goal was to evaluate these commonly deployed approaches for zero-shot PD detection. In the revised manuscript we will (1) rephrase the abstract and introduction to describe the comparison as being between the two pipelines, (2) add an explicit limitations subsection noting that model family, pretraining data, and tokenization are confounded with modality, and (3) qualify the Bengali stability result as an observation within the handcrafted-feature + LLM pipeline rather than a pure modality effect. These textual changes will be made; no new experiments are planned for this revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical modality comparison with no derivations or self-referential reductions

full rationale

The paper is an empirical study that reports experimental results from comparing handcrafted acoustic features fed to an LLM versus raw waveforms fed to audio models on PD speech datasets in four languages. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described claims. Performance differences are presented as observed outcomes rather than derived quantities, so no step reduces to its own inputs by construction. The central claim remains independent of any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning study with no free parameters, axioms, or invented entities in a mathematical sense; all claims are based on experimental observations.

pith-pipeline@v0.9.1-grok · 5680 in / 1197 out tokens · 41897 ms · 2026-06-30T00:13:50.780520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

    Introduction Parkinson’s disease (PD) is a progressive neurodegenerative disorder characterized by both motor and non-motor impair- ments, including bradykinesia, rigidity, tremor, cognitive de- cline, mood disorders, and autonomic dysfunction [1]. Glob- ally, PD affects more than 10 million people [2] and represents the fastest-growing neurological disor...

  2. [2]

    The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference

    Methodology Figure 1 provides an overview of the proposed zero-shot pipeline for PD Screening using LLMs and LALMs. The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference. 2.1. Datasets To investigate how input modality influences the zero-shot in...

  3. [3]

    Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework

    Experiments and Results 3.1. Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework. LLaMA 3 (8B) 1 and Qwen2-Audio (7B-Instruct)2 were obtained from the Hugging Face repository. Pengi 3 and Audio-Reasoner 4 were imple- mented from their official repositories with default inference configurations. All ex...

  4. [4]

    handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals

    Discussion The results from this study suggest that input modalities (di- rect audio waveform vs. handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals. Rather than reflecting a uniform perfor- mance hierarchy, the observed patterns indicate that different input modalities interact differ...

  5. [5]

    Conclusion This study examines the impacts of zero-shot speech-based PD detection, comparing handcrafted acoustic features analyzed by a text-based LLM with raw waveform input processed by LALMs. Across four datasets in four different languages under a unified evaluation protocol, we observed that model perfor- mance is modality-dependent: feature-based p...

  6. [6]

    Generative AI Use Disclosure ChatGPT (version 5.2, OpenAI) was used for language editing and refinement of the manuscript

  7. [7]

    Parkinson’s disease,

    B. R. Bloem, M. S. Okun, and C. Klein, “Parkinson’s disease,” The Lancet, vol. 397, no. 10291, pp. 2284–2303, 2021

  8. [8]

    Statistics on parkinson’s disease,

    P. Foundation, “Statistics on parkinson’s disease,” 2022. [Online]. Available: https://www.parkinson.org/understanding-parkinsons/ statistics

  9. [9]

    Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,

    J. D. Steinmetz, K. M. Seeher, N. Schiess, E. Nichols, B. Cao, C. Servili, V . Cavallera, E. Cousin, H. Hagins, M. E. Moberg et al., “Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,”The Lancet Neurology, vol. 23, no. 4, pp. 344–381, 2024

  10. [10]

    Mds clinical diagnostic criteria for parkinson’s disease,

    R. B. Postuma, D. Berg, M. Stern, W. Poewe, C. W. Olanow, W. Oertel, J. Obeso, K. Marek, I. Litvan, A. E. Langet al., “Mds clinical diagnostic criteria for parkinson’s disease,”Move- ment disorders, vol. 30, no. 12, pp. 1591–1601, 2015

  11. [11]

    Speech treatment for parkin- son’s disease,

    L. O. Ramig, C. Fox, and S. Sapir, “Speech treatment for parkin- son’s disease,”Expert review of neurotherapeutics, vol. 8, no. 2, pp. 297–309, 2008

  12. [12]

    Speech impairment in a large sample of patients with parkin- son’s disease,

    A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, and S. Gates, “Speech impairment in a large sample of patients with parkin- son’s disease,”Behavioural neurology, vol. 11, no. 3, pp. 131– 137, 1999

  13. [13]

    Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,

    M. Little, P. McSharry, E. Hunter, J. Spielman, and L. Ramig, “Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,”Nature Precedings, pp. 1–1, 2008

  14. [14]

    Speech rate and rhythm in parkin- son’s disease,

    S. Skodda and U. Schlegel, “Speech rate and rhythm in parkin- son’s disease,”Movement disorders: official journal of the Move- ment Disorder Society, vol. 23, no. 7, pp. 985–992, 2008

  15. [15]

    Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,

    A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and L. O. Ramig, “Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,”IEEE transactions on biomedical engineering, vol. 59, no. 5, pp. 1264–1271, 2012

  16. [16]

    Machine learning for the diagnosis of parkinson’s disease: a review of literature,

    J. Mei, C. Desrosiers, and J. Frasnelli, “Machine learning for the diagnosis of parkinson’s disease: a review of literature,”Frontiers in aging neuroscience, vol. 13, p. 633752, 2021

  17. [17]

    Machine learning models for parkinson disease: systematic re- view,

    T. Tabashum, R. C. Snyder, M. K. O’Brien, and M. V . Albert, “Machine learning models for parkinson disease: systematic re- view,”JMIR medical informatics, vol. 12, no. 1, p. e50117, 2024

  18. [18]

    A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,

    C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin, “A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,”Applied Soft Computing, vol. 74, pp. 255–263, 2019

  19. [19]

    Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,

    J. R. Orozco-Arroyave, F. H ¨onig, J. Arias-Londo ˜no, J. Vargas- Bonilla, K. Daqrouq, S. Skodda, J. Rusz, and E. N ¨oth, “Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,”The Journal of the Acoustical Soci- ety of America, vol. 139, no. 1, pp. 481–500, 2016

  20. [20]

    Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,

    P. Farag ´o, S.-A. S, tef˘anig˘a, C.-G. Cordos,, L.-I. Mih˘ail˘a, S. Hintea, A.-S. Pes,tean, M. Beyer, L. Perju-Dumbrav ˘a, and R. R. Iles ,an, “Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,”Bioengineering, vol. 10, no. 5, p. 531, 2023

  21. [21]

    A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease

    J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, and E. N ¨oth, “A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease.” inInter- speech, 2018, pp. 456–460

  22. [22]

    Multimodal assessment of parkinson’s disease: a deep learning approach,

    J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, B. Eskofier, J. Klucken, and E. N¨oth, “Multimodal assessment of parkinson’s disease: a deep learning approach,”IEEE journal of biomedical and health informatics, vol. 23, no. 4, pp. 1618–1630, 2018

  23. [23]

    Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,

    J. C. V ´asquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. Noeth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,”Journal of communication disorders, vol. 76, pp. 21–36, 2018

  24. [24]

    Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,

    J. Rusz, R. Cmejla, H. Ruzickova, and E. Ruzicka, “Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,”The journal of the Acoustical Society of America, vol. 129, no. 1, pp. 350–367, 2011

  25. [25]

    Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,

    J. Hlavnicka, R. Cmejla, T. Tykalova, K. Sonka, E. Ruzicka, and J. Rusz, “Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,”Scientific reports, vol. 7, no. 1, p. 12, 2017

  26. [26]

    New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease

    J. R. Orozco-Arroyave, J. D. Arias-Londo ˜no, J. F. Vargas-Bonilla, M. C. Gonzalez-R ´ativa, and E. N ¨oth, “New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease.” inLrec, vol. 14, 2014, pp. 342–347

  27. [27]

    BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,

    R. Hossain, M. A. Kabir, A. I. G. Mowla, A. C. Roy, and R. K. Ghosh, “BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,”arXiv preprint arXiv:2505.12192, 2025

  28. [28]

    Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,

    H.-J. Shin, Y . J. Jeong, S. Jun, and D.-Y . Kang, “Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,”JMIR Medical Informatics, vol. 14, p. e77561, 2026

  29. [29]

    A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,

    Z. Gao, Q. Ni, W. Liu, and L. Zhang, “A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,” in2024 7th International conference on algorithms, computing and artificial intelligence (ACAI). IEEE, 2024, pp. 1–5

  30. [30]

    Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,

    M. Castelli, M. Sousa, I. V ojtech, M. Single, D. Amstutz, M. E. Maradan-Gachet, A. D. Magalh ˜aes, I. Debove, J. Rusz, P. Martinez-Martinet al., “Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,”npj Parkinson’s Disease, vol. 11, no. 1, p. 79, 2025

  31. [31]

    Leveraging large lan- guage models for personalized parkinson’s disease treatment,

    R. Zhang, G. Xie, J. Ying, and Z. Hua, “Leveraging large lan- guage models for personalized parkinson’s disease treatment,” IEEE journal of biomedical and health informatics, 2025

  32. [32]

    Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,

    K. S. Bhalala and H. Mansoor, “Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,”Bioengineering, vol. 12, no. 10, p. 1059, 2025

  33. [33]

    Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,

    L. Cardenas, K. Parajes, M. Zhu, and S. Zhai, “Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,” in2024 IEEE 14th an- nual computing and communication workshop and conference (CCWC). IEEE, 2024, pp. 0375–0379

  34. [34]

    Llms for the engineering of a parkinson disease monitoring and alerting ontology

    G. Bouchouras, P. Bitilis, K. Kotis, and G. A. V ouros, “Llms for the engineering of a parkinson disease monitoring and alerting ontology.” inESWC workshops, 2024

  35. [35]

    Zero-shot cognitive im- pairment detection from speech using audiollm,

    M. Shahin, B. Ahmed, and J. Epps, “Zero-shot cognitive im- pairment detection from speech using audiollm,”arXiv preprint arXiv:2506.17351, 2025

  36. [36]

    A survey on speech large language mod- els for understanding,

    J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language mod- els for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

  37. [37]

    Opensmile: the mu- nich versatile and fast open-source audio feature extractor,

    F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Mul- timedia, 2010, pp. 1459–1462

  38. [38]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012

  39. [39]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

  40. [40]

    Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,

    H. Jaeger, D. Trivedi, and M. Stadtschnitzer, “Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,” Zenodo, 2019

  41. [41]

    Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,

    G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Gi- rardi, “Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,”Ieee Access, vol. 5, pp. 22 199– 22 208, 2017

  42. [42]

    Neurovoz: a castillian spanish corpus of parkinsonian speech,

    J. Mendes-Laureano, J. A. G ´omez-Garc´ıa, A. Guerrero-L ´opez, E. Luque-Buzo, J. D. Arias-Londo ˜no, F. J. Grandas-P ´erez, and J. I. Godino-Llorente, “Neurovoz: a castillian spanish corpus of parkinsonian speech,”Scientific Data, vol. 11, no. 1, p. 1367, 2024

  43. [43]

    Introducing meta llama 3: The most capable openly available llm to date,

    Meta AI, “Introducing meta llama 3: The most capable openly available llm to date,” 2024, technical report. [Online]. Available: https://ai.meta.com/blog/meta-llama-3/

  44. [44]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  45. [45]

    Pengi: An audio language model for audio tasks,

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

  46. [46]

    Audio- reasoner: Improving reasoning capability in large audio language models,

    Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

  47. [47]

    Ethics of large language models in medicine and medical research,

    H. Li, J. T. Moon, S. Purkayastha, L. A. Celi, H. Trivedi, and J. W. Gichoya, “Ethics of large language models in medicine and medical research,”The Lancet Digital Health, vol. 5, no. 6, pp. e333–e335, 2023

  48. [48]

    A future role for health applications of large language models depends on reg- ulators enforcing safety standards,

    O. Freyer, I. C. Wiest, J. N. Kather, and S. Gilbert, “A future role for health applications of large language models depends on reg- ulators enforcing safety standards,”The Lancet Digital Health, vol. 6, no. 9, pp. e662–e672, 2024

  49. [49]

    From text to treatment: the crucial role of validation for generative large language models in health care,

    A. de Hond, T. Leeuwenberg, R. Bartels, M. van Buchem, I. Kant, K. G. Moons, and M. van Smeden, “From text to treatment: the crucial role of validation for generative large language models in health care,”The Lancet Digital Health, vol. 6, no. 7, pp. e441– e443, 2024

  50. [50]

    Ethical and regulatory challenges of large lan- guage models in medicine,

    J. C. L. Ong, S. Y .-H. Chang, W. William, A. J. Butte, N. H. Shah, L. S. T. Chew, N. Liu, F. Doshi-Velez, W. Lu, J. Savulescu, and D. S. W. Ting, “Ethical and regulatory challenges of large lan- guage models in medicine,”The Lancet Digital Health, vol. 6, no. 6, pp. e428–e432, 2024