Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Muhammad Ashad Kabir; Sirajam Munira

arxiv: 2605.24806 · v1 · pith:GZ4YKI23new · submitted 2026-05-24 · 💻 cs.SD · cs.AI· eess.AS

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Muhammad Ashad Kabir , Sirajam Munira This is my paper

Pith reviewed 2026-06-30 00:13 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords Parkinson's disease detectionzero-shot learningspeech analysislarge language modelsaudio modelsinput modalitiesmultilingual evaluationacoustic features

0 comments

The pith

Experiments show handcrafted acoustic features deliver steadier zero-shot Parkinson's detection from speech than raw audio in low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the form of speech input changes how well large models can detect Parkinson's disease without any training examples. It pits handcrafted acoustic features fed into a general language model against raw audio waveforms fed into audio-specialized models. Results across four languages indicate that accuracy shifts with the input type, the speaking task, and the language. Handcrafted features give more consistent results in a low-resource language like Bengali, while raw audio produces gains only on certain datasets. This distinction matters because reliable zero-shot methods could support diagnosis in languages where labeled medical speech data is scarce.

Core claim

The paper establishes that zero-shot Parkinson's disease detection from speech yields performance that depends on input modality, with handcrafted acoustic features analyzed by a general-purpose LLM providing more stable results in low-resource languages such as Bengali, while direct waveform input to audio models produces dataset-dependent improvements.

What carries the argument

The systematic comparison of two input modalities for zero-shot inference: handcrafted acoustic features processed by a general-purpose LLM versus raw audio waveforms processed by audio-capable large models.

If this is right

Zero-shot detection accuracy is not fixed but changes with the choice between handcrafted features and raw audio.
In low-resource languages, handcrafted acoustic features produce more reliable outcomes than raw waveforms.
The benefit of each modality also depends on which speech task is recorded.
Cross-lingual evaluation is required to determine when zero-shot methods can be applied safely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar modality comparisons could be tested on other neurological conditions detectable from speech.
Preprocessing pipelines that convert speech to features may be the safer starting point for multilingual medical screening tools.
Dataset-specific tuning of audio models might reduce the observed variability if applied consistently.

Load-bearing premise

The measured performance gaps between modalities stem solely from the input format itself rather than from differences in the underlying models, dataset biases, or recording conditions.

What would settle it

Re-running the four-language experiments using the exact same model architecture for both feature-based and waveform inputs on matched datasets would falsify the claim if performance differences disappear.

Figures

Figures reproduced from arXiv: 2605.24806 by Muhammad Ashad Kabir, Sirajam Munira.

**Figure 1.** Figure 1: Schematic overview of the zero-shot pipeline for PD detection using LLMs and LALMs LLMs or directly providing raw audio to LALMs. Despite these developments, it remains unclear how such input choices influence zero-shot LLM performance in PD detection. Most existing PD detection research relies on supervised machine learning, whereas LLM-based studies have largely been applied to care systems and clinica… view at source ↗

read the original abstract

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares handcrafted features via LLM against raw audio via audio models for zero-shot PD detection across languages, but the setup mixes modality with model family differences.

read the letter

The main thing to know is that this paper runs an empirical comparison of two input modalities for zero-shot Parkinson's detection from speech in four languages, reporting that handcrafted acoustic features give more stable results in Bengali while raw audio shows dataset-dependent gains.

What is new is the systematic cross-language look at these specific input choices for this task, including the low-resource language angle. The work does a straightforward job of testing the modalities on multiple datasets and speech tasks, and that empirical angle on modality stability is a modest but real addition.

The soft spot is exactly the one in the stress-test note. Features go through a general LLM while waveforms go through separate audio models, so any performance differences cannot be cleanly pinned on input type alone. Model pretraining, tokenization, and capabilities are confounded with the modality variable. Cross-language recording conditions and dataset biases add further unseparated factors. The abstract also gives no sample sizes, statistical tests, or setup details, which leaves the claims hard to evaluate.

This is for researchers working on zero-shot speech diagnostics or modality choices in audio ML. Someone already focused on PD detection or low-resource languages might pull a useful data point from it, but the confounding limits how much to rely on the modality conclusions.

I would send it to peer review because the topic is practical and the experiments are accessible, provided the full paper supplies the missing controls and stats.

Referee Report

1 major / 2 minor

Summary. The paper claims that zero-shot PD detection from speech using large models shows performance varying across input modalities (handcrafted acoustic features fed to a general-purpose LLM versus raw waveform input to audio-capable models), speech tasks, and languages. Handcrafted features yield more stable results in low-resource languages such as Bengali, while audio input produces dataset-dependent gains, based on experiments across four languages.

Significance. If the central empirical claims hold after controlling for confounds, the work would contribute to understanding modality selection for zero-shot speech-based clinical detection tasks, with particular value for low-resource language stability. The study is an empirical comparison without fitted parameters or derivations, allowing direct falsification via replication.

major comments (1)

[Experimental Setup] The experimental design routes handcrafted features through a general-purpose LLM while routing waveforms through separate audio models, without an ablation that holds the underlying model fixed and varies only the input representation. This confounds attribution of differences (including Bengali stability) to modality alone rather than model family, pretraining, or tokenization. This is load-bearing for the abstract claim that performance varies across input modalities.

minor comments (2)

[Abstract] The abstract states findings without details on sample sizes, statistical tests, recording condition controls, or exact model versions used.
Add a summary table of all metrics, tasks, and languages with confidence intervals to improve clarity of the cross-condition comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important aspect of our experimental design. Below we respond directly to the major comment and outline the revisions we will make.

read point-by-point responses

Referee: [Experimental Setup] The experimental design routes handcrafted features through a general-purpose LLM while routing waveforms through separate audio models, without an ablation that holds the underlying model fixed and varies only the input representation. This confounds attribution of differences (including Bengali stability) to modality alone rather than model family, pretraining, or tokenization. This is load-bearing for the abstract claim that performance varies across input modalities.

Authors: We agree that the design compares two distinct practical pipelines rather than isolating input representation while holding the model constant. Handcrafted acoustic features are conventionally processed by text-based LLMs, while raw waveforms require audio-specific models; a direct swap is not straightforward without additional engineering that would itself introduce new variables. Our goal was to evaluate these commonly deployed approaches for zero-shot PD detection. In the revised manuscript we will (1) rephrase the abstract and introduction to describe the comparison as being between the two pipelines, (2) add an explicit limitations subsection noting that model family, pretraining data, and tokenization are confounded with modality, and (3) qualify the Bengali stability result as an observation within the handcrafted-feature + LLM pipeline rather than a pure modality effect. These textual changes will be made; no new experiments are planned for this revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical modality comparison with no derivations or self-referential reductions

full rationale

The paper is an empirical study that reports experimental results from comparing handcrafted acoustic features fed to an LLM versus raw waveforms fed to audio models on PD speech datasets in four languages. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described claims. Performance differences are presented as observed outcomes rather than derived quantities, so no step reduces to its own inputs by construction. The central claim remains independent of any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning study with no free parameters, axioms, or invented entities in a mathematical sense; all claims are based on experimental observations.

pith-pipeline@v0.9.1-grok · 5680 in / 1197 out tokens · 41897 ms · 2026-06-30T00:13:50.780520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Introduction Parkinson’s disease (PD) is a progressive neurodegenerative disorder characterized by both motor and non-motor impair- ments, including bradykinesia, rigidity, tremor, cognitive de- cline, mood disorders, and autonomic dysfunction [1]. Glob- ally, PD affects more than 10 million people [2] and represents the fastest-growing neurological disor...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference

Methodology Figure 1 provides an overview of the proposed zero-shot pipeline for PD Screening using LLMs and LALMs. The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference. 2.1. Datasets To investigate how input modality influences the zero-shot in...
[3]

Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework

Experiments and Results 3.1. Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework. LLaMA 3 (8B) 1 and Qwen2-Audio (7B-Instruct)2 were obtained from the Hugging Face repository. Pengi 3 and Audio-Reasoner 4 were imple- mented from their official repositories with default inference configurations. All ex...

2070
[4]

handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals

Discussion The results from this study suggest that input modalities (di- rect audio waveform vs. handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals. Rather than reflecting a uniform perfor- mance hierarchy, the observed patterns indicate that different input modalities interact differ...
[5]

Conclusion This study examines the impacts of zero-shot speech-based PD detection, comparing handcrafted acoustic features analyzed by a text-based LLM with raw waveform input processed by LALMs. Across four datasets in four different languages under a unified evaluation protocol, we observed that model perfor- mance is modality-dependent: feature-based p...
[6]

Generative AI Use Disclosure ChatGPT (version 5.2, OpenAI) was used for language editing and refinement of the manuscript
[7]

Parkinson’s disease,

B. R. Bloem, M. S. Okun, and C. Klein, “Parkinson’s disease,” The Lancet, vol. 397, no. 10291, pp. 2284–2303, 2021

2021
[8]

Statistics on parkinson’s disease,

P. Foundation, “Statistics on parkinson’s disease,” 2022. [Online]. Available: https://www.parkinson.org/understanding-parkinsons/ statistics

2022
[9]

Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,

J. D. Steinmetz, K. M. Seeher, N. Schiess, E. Nichols, B. Cao, C. Servili, V . Cavallera, E. Cousin, H. Hagins, M. E. Moberg et al., “Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,”The Lancet Neurology, vol. 23, no. 4, pp. 344–381, 2024

1990
[10]

Mds clinical diagnostic criteria for parkinson’s disease,

R. B. Postuma, D. Berg, M. Stern, W. Poewe, C. W. Olanow, W. Oertel, J. Obeso, K. Marek, I. Litvan, A. E. Langet al., “Mds clinical diagnostic criteria for parkinson’s disease,”Move- ment disorders, vol. 30, no. 12, pp. 1591–1601, 2015

2015
[11]

Speech treatment for parkin- son’s disease,

L. O. Ramig, C. Fox, and S. Sapir, “Speech treatment for parkin- son’s disease,”Expert review of neurotherapeutics, vol. 8, no. 2, pp. 297–309, 2008

2008
[12]

Speech impairment in a large sample of patients with parkin- son’s disease,

A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, and S. Gates, “Speech impairment in a large sample of patients with parkin- son’s disease,”Behavioural neurology, vol. 11, no. 3, pp. 131– 137, 1999

1999
[13]

Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,

M. Little, P. McSharry, E. Hunter, J. Spielman, and L. Ramig, “Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,”Nature Precedings, pp. 1–1, 2008

2008
[14]

Speech rate and rhythm in parkin- son’s disease,

S. Skodda and U. Schlegel, “Speech rate and rhythm in parkin- son’s disease,”Movement disorders: official journal of the Move- ment Disorder Society, vol. 23, no. 7, pp. 985–992, 2008

2008
[15]

Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,

A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and L. O. Ramig, “Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,”IEEE transactions on biomedical engineering, vol. 59, no. 5, pp. 1264–1271, 2012

2012
[16]

Machine learning for the diagnosis of parkinson’s disease: a review of literature,

J. Mei, C. Desrosiers, and J. Frasnelli, “Machine learning for the diagnosis of parkinson’s disease: a review of literature,”Frontiers in aging neuroscience, vol. 13, p. 633752, 2021

2021
[17]

Machine learning models for parkinson disease: systematic re- view,

T. Tabashum, R. C. Snyder, M. K. O’Brien, and M. V . Albert, “Machine learning models for parkinson disease: systematic re- view,”JMIR medical informatics, vol. 12, no. 1, p. e50117, 2024

2024
[18]

A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,

C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin, “A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,”Applied Soft Computing, vol. 74, pp. 255–263, 2019

2019
[19]

Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,

J. R. Orozco-Arroyave, F. H ¨onig, J. Arias-Londo ˜no, J. Vargas- Bonilla, K. Daqrouq, S. Skodda, J. Rusz, and E. N ¨oth, “Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,”The Journal of the Acoustical Soci- ety of America, vol. 139, no. 1, pp. 481–500, 2016

2016
[20]

Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,

P. Farag ´o, S.-A. S, tef˘anig˘a, C.-G. Cordos,, L.-I. Mih˘ail˘a, S. Hintea, A.-S. Pes,tean, M. Beyer, L. Perju-Dumbrav ˘a, and R. R. Iles ,an, “Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,”Bioengineering, vol. 10, no. 5, p. 531, 2023

2023
[21]

A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, and E. N ¨oth, “A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease.” inInter- speech, 2018, pp. 456–460

2018
[22]

Multimodal assessment of parkinson’s disease: a deep learning approach,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, B. Eskofier, J. Klucken, and E. N¨oth, “Multimodal assessment of parkinson’s disease: a deep learning approach,”IEEE journal of biomedical and health informatics, vol. 23, no. 4, pp. 1618–1630, 2018

2018
[23]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,

J. C. V ´asquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. Noeth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,”Journal of communication disorders, vol. 76, pp. 21–36, 2018

2018
[24]

Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,

J. Rusz, R. Cmejla, H. Ruzickova, and E. Ruzicka, “Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,”The journal of the Acoustical Society of America, vol. 129, no. 1, pp. 350–367, 2011

2011
[25]

Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,

J. Hlavnicka, R. Cmejla, T. Tykalova, K. Sonka, E. Ruzicka, and J. Rusz, “Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,”Scientific reports, vol. 7, no. 1, p. 12, 2017

2017
[26]

New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease

J. R. Orozco-Arroyave, J. D. Arias-Londo ˜no, J. F. Vargas-Bonilla, M. C. Gonzalez-R ´ativa, and E. N ¨oth, “New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease.” inLrec, vol. 14, 2014, pp. 342–347

2014
[27]

BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,

R. Hossain, M. A. Kabir, A. I. G. Mowla, A. C. Roy, and R. K. Ghosh, “BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,”arXiv preprint arXiv:2505.12192, 2025

work page arXiv 2025
[28]

Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,

H.-J. Shin, Y . J. Jeong, S. Jun, and D.-Y . Kang, “Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,”JMIR Medical Informatics, vol. 14, p. e77561, 2026

2026
[29]

A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,

Z. Gao, Q. Ni, W. Liu, and L. Zhang, “A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,” in2024 7th International conference on algorithms, computing and artificial intelligence (ACAI). IEEE, 2024, pp. 1–5

2024
[30]

Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,

M. Castelli, M. Sousa, I. V ojtech, M. Single, D. Amstutz, M. E. Maradan-Gachet, A. D. Magalh ˜aes, I. Debove, J. Rusz, P. Martinez-Martinet al., “Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,”npj Parkinson’s Disease, vol. 11, no. 1, p. 79, 2025

2025
[31]

Leveraging large lan- guage models for personalized parkinson’s disease treatment,

R. Zhang, G. Xie, J. Ying, and Z. Hua, “Leveraging large lan- guage models for personalized parkinson’s disease treatment,” IEEE journal of biomedical and health informatics, 2025

2025
[32]

Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,

K. S. Bhalala and H. Mansoor, “Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,”Bioengineering, vol. 12, no. 10, p. 1059, 2025

2025
[33]

Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,

L. Cardenas, K. Parajes, M. Zhu, and S. Zhai, “Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,” in2024 IEEE 14th an- nual computing and communication workshop and conference (CCWC). IEEE, 2024, pp. 0375–0379

2024
[34]

Llms for the engineering of a parkinson disease monitoring and alerting ontology

G. Bouchouras, P. Bitilis, K. Kotis, and G. A. V ouros, “Llms for the engineering of a parkinson disease monitoring and alerting ontology.” inESWC workshops, 2024

2024
[35]

Zero-shot cognitive im- pairment detection from speech using audiollm,

M. Shahin, B. Ahmed, and J. Epps, “Zero-shot cognitive im- pairment detection from speech using audiollm,”arXiv preprint arXiv:2506.17351, 2025

work page arXiv 2025
[36]

A survey on speech large language mod- els for understanding,

J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language mod- els for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

2025
[37]

Opensmile: the mu- nich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Mul- timedia, 2010, pp. 1459–1462

2010
[38]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012

2012
[39]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020
[40]

Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,

H. Jaeger, D. Trivedi, and M. Stadtschnitzer, “Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,” Zenodo, 2019

2019
[41]

Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,

G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Gi- rardi, “Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,”Ieee Access, vol. 5, pp. 22 199– 22 208, 2017

2017
[42]

Neurovoz: a castillian spanish corpus of parkinsonian speech,

J. Mendes-Laureano, J. A. G ´omez-Garc´ıa, A. Guerrero-L ´opez, E. Luque-Buzo, J. D. Arias-Londo ˜no, F. J. Grandas-P ´erez, and J. I. Godino-Llorente, “Neurovoz: a castillian spanish corpus of parkinsonian speech,”Scientific Data, vol. 11, no. 1, p. 1367, 2024

2024
[43]

Introducing meta llama 3: The most capable openly available llm to date,

Meta AI, “Introducing meta llama 3: The most capable openly available llm to date,” 2024, technical report. [Online]. Available: https://ai.meta.com/blog/meta-llama-3/

2024
[44]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023
[46]

Audio- reasoner: Improving reasoning capability in large audio language models,

Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025
[47]

Ethics of large language models in medicine and medical research,

H. Li, J. T. Moon, S. Purkayastha, L. A. Celi, H. Trivedi, and J. W. Gichoya, “Ethics of large language models in medicine and medical research,”The Lancet Digital Health, vol. 5, no. 6, pp. e333–e335, 2023

2023
[48]

A future role for health applications of large language models depends on reg- ulators enforcing safety standards,

O. Freyer, I. C. Wiest, J. N. Kather, and S. Gilbert, “A future role for health applications of large language models depends on reg- ulators enforcing safety standards,”The Lancet Digital Health, vol. 6, no. 9, pp. e662–e672, 2024

2024
[49]

From text to treatment: the crucial role of validation for generative large language models in health care,

A. de Hond, T. Leeuwenberg, R. Bartels, M. van Buchem, I. Kant, K. G. Moons, and M. van Smeden, “From text to treatment: the crucial role of validation for generative large language models in health care,”The Lancet Digital Health, vol. 6, no. 7, pp. e441– e443, 2024

2024
[50]

Ethical and regulatory challenges of large lan- guage models in medicine,

J. C. L. Ong, S. Y .-H. Chang, W. William, A. J. Butte, N. H. Shah, L. S. T. Chew, N. Liu, F. Doshi-Velez, W. Lu, J. Savulescu, and D. S. W. Ting, “Ethical and regulatory challenges of large lan- guage models in medicine,”The Lancet Digital Health, vol. 6, no. 6, pp. e428–e432, 2024

2024

[1] [1]

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Introduction Parkinson’s disease (PD) is a progressive neurodegenerative disorder characterized by both motor and non-motor impair- ments, including bradykinesia, rigidity, tremor, cognitive de- cline, mood disorders, and autonomic dysfunction [1]. Glob- ally, PD affects more than 10 million people [2] and represents the fastest-growing neurological disor...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference

Methodology Figure 1 provides an overview of the proposed zero-shot pipeline for PD Screening using LLMs and LALMs. The work- flow consists of four main steps: (i) dataset preprocessing, (ii) extracting handcrafted features, (iii) prompt construction, and (iv) zero-shot inference. 2.1. Datasets To investigate how input modality influences the zero-shot in...

[3] [3]

Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework

Experiments and Results 3.1. Experiments We evaluated four large-language and audio-language models under a unified zero-shot framework. LLaMA 3 (8B) 1 and Qwen2-Audio (7B-Instruct)2 were obtained from the Hugging Face repository. Pengi 3 and Audio-Reasoner 4 were imple- mented from their official repositories with default inference configurations. All ex...

2070

[4] [4]

handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals

Discussion The results from this study suggest that input modalities (di- rect audio waveform vs. handcrafted acoustic features) influ- ence how zero-shot LLM systems process and interpret speech- based clinical signals. Rather than reflecting a uniform perfor- mance hierarchy, the observed patterns indicate that different input modalities interact differ...

[5] [5]

Conclusion This study examines the impacts of zero-shot speech-based PD detection, comparing handcrafted acoustic features analyzed by a text-based LLM with raw waveform input processed by LALMs. Across four datasets in four different languages under a unified evaluation protocol, we observed that model perfor- mance is modality-dependent: feature-based p...

[6] [6]

Generative AI Use Disclosure ChatGPT (version 5.2, OpenAI) was used for language editing and refinement of the manuscript

[7] [7]

Parkinson’s disease,

B. R. Bloem, M. S. Okun, and C. Klein, “Parkinson’s disease,” The Lancet, vol. 397, no. 10291, pp. 2284–2303, 2021

2021

[8] [8]

Statistics on parkinson’s disease,

P. Foundation, “Statistics on parkinson’s disease,” 2022. [Online]. Available: https://www.parkinson.org/understanding-parkinsons/ statistics

2022

[9] [9]

Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,

J. D. Steinmetz, K. M. Seeher, N. Schiess, E. Nichols, B. Cao, C. Servili, V . Cavallera, E. Cousin, H. Hagins, M. E. Moberg et al., “Global, regional, and national burden of disorders affect- ing the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021,”The Lancet Neurology, vol. 23, no. 4, pp. 344–381, 2024

1990

[10] [10]

Mds clinical diagnostic criteria for parkinson’s disease,

R. B. Postuma, D. Berg, M. Stern, W. Poewe, C. W. Olanow, W. Oertel, J. Obeso, K. Marek, I. Litvan, A. E. Langet al., “Mds clinical diagnostic criteria for parkinson’s disease,”Move- ment disorders, vol. 30, no. 12, pp. 1591–1601, 2015

2015

[11] [11]

Speech treatment for parkin- son’s disease,

L. O. Ramig, C. Fox, and S. Sapir, “Speech treatment for parkin- son’s disease,”Expert review of neurotherapeutics, vol. 8, no. 2, pp. 297–309, 2008

2008

[12] [12]

Speech impairment in a large sample of patients with parkin- son’s disease,

A. K. Ho, R. Iansek, C. Marigliani, J. L. Bradshaw, and S. Gates, “Speech impairment in a large sample of patients with parkin- son’s disease,”Behavioural neurology, vol. 11, no. 3, pp. 131– 137, 1999

1999

[13] [13]

Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,

M. Little, P. McSharry, E. Hunter, J. Spielman, and L. Ramig, “Suitability of dysphonia measurements for telemonitoring of parkinson’s disease,”Nature Precedings, pp. 1–1, 2008

2008

[14] [14]

Speech rate and rhythm in parkin- son’s disease,

S. Skodda and U. Schlegel, “Speech rate and rhythm in parkin- son’s disease,”Movement disorders: official journal of the Move- ment Disorder Society, vol. 23, no. 7, pp. 985–992, 2008

2008

[15] [15]

Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,

A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and L. O. Ramig, “Novel speech signal processing algorithms for high- accuracy classification of parkinson’s disease,”IEEE transactions on biomedical engineering, vol. 59, no. 5, pp. 1264–1271, 2012

2012

[16] [16]

Machine learning for the diagnosis of parkinson’s disease: a review of literature,

J. Mei, C. Desrosiers, and J. Frasnelli, “Machine learning for the diagnosis of parkinson’s disease: a review of literature,”Frontiers in aging neuroscience, vol. 13, p. 633752, 2021

2021

[17] [17]

Machine learning models for parkinson disease: systematic re- view,

T. Tabashum, R. C. Snyder, M. K. O’Brien, and M. V . Albert, “Machine learning models for parkinson disease: systematic re- view,”JMIR medical informatics, vol. 12, no. 1, p. e50117, 2024

2024

[18] [18]

A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,

C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin, “A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q- factor wavelet transform,”Applied Soft Computing, vol. 74, pp. 255–263, 2019

2019

[19] [19]

Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,

J. R. Orozco-Arroyave, F. H ¨onig, J. Arias-Londo ˜no, J. Vargas- Bonilla, K. Daqrouq, S. Skodda, J. Rusz, and E. N ¨oth, “Auto- matic detection of parkinson’s disease in running speech spoken in three different languages,”The Journal of the Acoustical Soci- ety of America, vol. 139, no. 1, pp. 481–500, 2016

2016

[20] [20]

Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,

P. Farag ´o, S.-A. S, tef˘anig˘a, C.-G. Cordos,, L.-I. Mih˘ail˘a, S. Hintea, A.-S. Pes,tean, M. Beyer, L. Perju-Dumbrav ˘a, and R. R. Iles ,an, “Cnn-based identification of parkinson’s disease from continuous speech in noisy environments,”Bioengineering, vol. 10, no. 5, p. 531, 2023

2023

[21] [21]

A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, and E. N ¨oth, “A multitask learning approach to assess the dysarthria severity in patients with parkinson’s disease.” inInter- speech, 2018, pp. 456–460

2018

[22] [22]

Multimodal assessment of parkinson’s disease: a deep learning approach,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, B. Eskofier, J. Klucken, and E. N¨oth, “Multimodal assessment of parkinson’s disease: a deep learning approach,”IEEE journal of biomedical and health informatics, vol. 23, no. 4, pp. 1618–1630, 2018

2018

[23] [23]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,

J. C. V ´asquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. Noeth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,”Journal of communication disorders, vol. 76, pp. 21–36, 2018

2018

[24] [24]

Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,

J. Rusz, R. Cmejla, H. Ruzickova, and E. Ruzicka, “Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease,”The journal of the Acoustical Society of America, vol. 129, no. 1, pp. 350–367, 2011

2011

[25] [25]

Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,

J. Hlavnicka, R. Cmejla, T. Tykalova, K. Sonka, E. Ruzicka, and J. Rusz, “Automated analysis of connected speech reveals early biomarkers of parkinson’s disease in patients with rapid eye move- ment sleep behaviour disorder,”Scientific reports, vol. 7, no. 1, p. 12, 2017

2017

[26] [26]

New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease

J. R. Orozco-Arroyave, J. D. Arias-Londo ˜no, J. F. Vargas-Bonilla, M. C. Gonzalez-R ´ativa, and E. N ¨oth, “New spanish speech cor- pus database for the analysis of people suffering from parkinson’s disease.” inLrec, vol. 14, 2014, pp. 342–347

2014

[27] [27]

BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,

R. Hossain, M. A. Kabir, A. I. G. Mowla, A. C. Roy, and R. K. Ghosh, “BenSParX: A robust explainable machine learning framework for parkinson’s disease detection from bengali conver- sational speech,”arXiv preprint arXiv:2505.12192, 2025

work page arXiv 2025

[28] [28]

Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,

H.-J. Shin, Y . J. Jeong, S. Jun, and D.-Y . Kang, “Prompting and fine-tuning large language models for parkinson disease diag- nosis: Comparative evaluation study using the ppmi structured dataset,”JMIR Medical Informatics, vol. 14, p. e77561, 2026

2026

[29] [29]

A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,

Z. Gao, Q. Ni, W. Liu, and L. Zhang, “A llms-assisted frame- work for parkinson’s disease assessment based on ppmi dataset,” in2024 7th International conference on algorithms, computing and artificial intelligence (ACAI). IEEE, 2024, pp. 1–5

2024

[30] [30]

Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,

M. Castelli, M. Sousa, I. V ojtech, M. Single, D. Amstutz, M. E. Maradan-Gachet, A. D. Magalh ˜aes, I. Debove, J. Rusz, P. Martinez-Martinet al., “Detecting neuropsychiatric fluctua- tions in parkinson’s disease using patients’ own words: the poten- tial of large language models,”npj Parkinson’s Disease, vol. 11, no. 1, p. 79, 2025

2025

[31] [31]

Leveraging large lan- guage models for personalized parkinson’s disease treatment,

R. Zhang, G. Xie, J. Ying, and Z. Hua, “Leveraging large lan- guage models for personalized parkinson’s disease treatment,” IEEE journal of biomedical and health informatics, 2025

2025

[32] [32]

Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,

K. S. Bhalala and H. Mansoor, “Parka ai: A sensor-integrated mobile application for parkinson’s disease monitoring and self- management,”Bioengineering, vol. 12, no. 10, p. 1059, 2025

2025

[33] [33]

Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,

L. Cardenas, K. Parajes, M. Zhu, and S. Zhai, “Autohealth: Advanced llm-empowered wearable personalized medical butler for parkinson’s disease management,” in2024 IEEE 14th an- nual computing and communication workshop and conference (CCWC). IEEE, 2024, pp. 0375–0379

2024

[34] [34]

Llms for the engineering of a parkinson disease monitoring and alerting ontology

G. Bouchouras, P. Bitilis, K. Kotis, and G. A. V ouros, “Llms for the engineering of a parkinson disease monitoring and alerting ontology.” inESWC workshops, 2024

2024

[35] [35]

Zero-shot cognitive im- pairment detection from speech using audiollm,

M. Shahin, B. Ahmed, and J. Epps, “Zero-shot cognitive im- pairment detection from speech using audiollm,”arXiv preprint arXiv:2506.17351, 2025

work page arXiv 2025

[36] [36]

A survey on speech large language mod- els for understanding,

J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language mod- els for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

2025

[37] [37]

Opensmile: the mu- nich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Mul- timedia, 2010, pp. 1459–1462

2010

[38] [38]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012

2012

[39] [39]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020

[40] [40]

Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,

H. Jaeger, D. Trivedi, and M. Stadtschnitzer, “Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls,” Zenodo, 2019

2019

[41] [41]

Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,

G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Gi- rardi, “Assessment of speech intelligibility in parkinson’s disease using a speech-to-text system,”Ieee Access, vol. 5, pp. 22 199– 22 208, 2017

2017

[42] [42]

Neurovoz: a castillian spanish corpus of parkinsonian speech,

J. Mendes-Laureano, J. A. G ´omez-Garc´ıa, A. Guerrero-L ´opez, E. Luque-Buzo, J. D. Arias-Londo ˜no, F. J. Grandas-P ´erez, and J. I. Godino-Llorente, “Neurovoz: a castillian spanish corpus of parkinsonian speech,”Scientific Data, vol. 11, no. 1, p. 1367, 2024

2024

[43] [43]

Introducing meta llama 3: The most capable openly available llm to date,

Meta AI, “Introducing meta llama 3: The most capable openly available llm to date,” 2024, technical report. [Online]. Available: https://ai.meta.com/blog/meta-llama-3/

2024

[44] [44]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023

[46] [46]

Audio- reasoner: Improving reasoning capability in large audio language models,

Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025

[47] [47]

Ethics of large language models in medicine and medical research,

H. Li, J. T. Moon, S. Purkayastha, L. A. Celi, H. Trivedi, and J. W. Gichoya, “Ethics of large language models in medicine and medical research,”The Lancet Digital Health, vol. 5, no. 6, pp. e333–e335, 2023

2023

[48] [48]

A future role for health applications of large language models depends on reg- ulators enforcing safety standards,

O. Freyer, I. C. Wiest, J. N. Kather, and S. Gilbert, “A future role for health applications of large language models depends on reg- ulators enforcing safety standards,”The Lancet Digital Health, vol. 6, no. 9, pp. e662–e672, 2024

2024

[49] [49]

From text to treatment: the crucial role of validation for generative large language models in health care,

A. de Hond, T. Leeuwenberg, R. Bartels, M. van Buchem, I. Kant, K. G. Moons, and M. van Smeden, “From text to treatment: the crucial role of validation for generative large language models in health care,”The Lancet Digital Health, vol. 6, no. 7, pp. e441– e443, 2024

2024

[50] [50]

Ethical and regulatory challenges of large lan- guage models in medicine,

J. C. L. Ong, S. Y .-H. Chang, W. William, A. J. Butte, N. H. Shah, L. S. T. Chew, N. Liu, F. Doshi-Velez, W. Lu, J. Savulescu, and D. S. W. Ting, “Ethical and regulatory challenges of large lan- guage models in medicine,”The Lancet Digital Health, vol. 6, no. 6, pp. e428–e432, 2024

2024