SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Aina Merchant; Alex Mariakakis; Eyal de Lara; Larry Kieu; Sejal Bhalla

arxiv: 2606.17339 · v1 · pith:3YG364OUnew · submitted 2026-06-15 · 💻 cs.AI · cs.CL· cs.SD

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Sejal Bhalla , Larry Kieu , Aina Merchant , Eyal de Lara , Alex Mariakakis This is my paper

Pith reviewed 2026-06-27 03:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.SD

keywords clinical speech AIbenchmarkgeneralizationaudio encodersspeech production stagesmulti-task evaluationzero-shot transferhealth conditions

0 comments

The pith

No current audio representation generalizes reliably across the clinical speech landscape.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpeechDx as a benchmark covering 12 datasets and 27 tasks to move clinical speech AI beyond isolated, condition-specific studies. Tasks are grouped by the stage of speech production disrupted—conceptualization, formulation, or articulation—so that evaluation can target shared mechanisms rather than dataset quirks. Systematic testing of 12 audio encoders reveals that large-scale speech models give the best overall results, domain-specific models help only on closely matched tasks, and no representation performs consistently across conditions or datasets. This setup lets researchers track whether models are capturing clinically meaningful patterns instead of artefacts. If the claim holds, progress depends on building representations that transfer across the full range of health conditions rather than excelling on single tasks.

Core claim

SpeechDx shows that large-scale speech models are the strongest baselines overall, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape when tested on tasks with limited labels and the same condition across multiple datasets.

What carries the argument

SpeechDx benchmark, which organizes 27 tasks across 12 datasets by the stage of speech production disrupted (conceptualization, formulation, articulation) to test shared clinical mechanisms.

If this is right

Large-scale pretrained speech models should serve as the default starting point for new clinical speech applications.
Domain-specific adaptation yields gains only when the target task closely matches the adaptation data.
Cross-condition and cross-dataset transfer remains an open requirement for any general-purpose clinical speech system.
Future models can be compared directly on the same 27-task suite to measure genuine generalization gains.
The stage-based structure provides a way to diagnose which parts of the speech pipeline a representation fails to capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that succeed on this benchmark could support diagnostic tools usable across multiple neurological and motor conditions without retraining per disease.
The benchmark could guide development of hybrid encoders that combine broad pretraining with targeted adaptation for underrepresented stages like conceptualization.
If generalization improves, clinical speech AI could shift from single-condition tools to integrated systems that monitor overall speech health.
Extending the benchmark with more languages or recording conditions would test whether the current generalization failures are language- or hardware-specific.

Load-bearing premise

Grouping tasks by the stage of speech production they disrupt lets evaluation measure shared clinical mechanisms rather than dataset-specific artefacts.

What would settle it

A single audio encoder achieving consistently high performance on all 27 tasks, including zero-shot cross-condition transfer and repeated conditions across different datasets, without performance dropping on any subset.

Figures

Figures reproduced from arXiv: 2606.17339 by Aina Merchant, Alex Mariakakis, Eyal de Lara, Larry Kieu, Sejal Bhalla.

**Figure 2.** Figure 2: The evaluation of zero-shot transfer for classification tasks. The top four grids show [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The data efficiency of Qwen3-TTS-Tokenizer [ [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

read the original abstract

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpeechDx pulls together a useful multi-task benchmark for clinical speech but the stage grouping lacks validation and the results reporting is thin on details.

read the letter

The main point is that SpeechDx organizes 12 datasets into 27 tasks grouped by speech production stage (conceptualization, formulation, articulation) and runs 12 audio encoders across them, including zero-shot cross-condition tests. The headline result is that large pretrained models perform best overall, domain-specific ones only help on close matches, and nothing generalizes reliably across the set.

What stands out is the scale and the cross-dataset structure for the same conditions plus the limited-data setups. This moves past the usual single-condition papers and gives a practical way to check if models are picking up clinical signals or just dataset quirks. The systematic encoder comparison supplies a baseline that others can build on.

The soft spot is the stage grouping. The paper treats it as a way to evaluate shared mechanisms, but there is no description of how tasks were assigned to stages or any check that within-stage tasks show more similar patterns than across-stage ones. Without that, the generalization failures could trace to differences in conditions, recording setups, or label distributions instead. The abstract also presents conclusions without splits, error bars, or statistical tests, so the strength of the claims is difficult to assess.

This is for groups working on clinical speech AI or health monitoring from voice data who need a shared testbed. A reader focused on benchmarks or generalization would get concrete value from the task list and model rankings. It deserves peer review because the benchmark scope is substantial even if the stage validation and reporting need work.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpeechDx, a benchmark spanning 12 datasets and 27 tasks across health conditions, structured by speech production stages (conceptualization, formulation, articulation) to evaluate generalization of audio encoders. It systematically tests 12 state-of-the-art encoders under in-domain and zero-shot cross-condition settings, concluding that large-scale models are strongest overall, domain-specific models help only on matched tasks, and no current representation generalizes reliably across the clinical speech landscape.

Significance. If the benchmark's task grouping and evaluation protocol hold, the work would provide a valuable shared framework for clinical speech AI, moving beyond isolated condition-specific studies and enabling tracking of progress toward general-purpose representations. The scale (12 datasets, 27 tasks) and systematic comparison of multiple encoders are strengths that could influence future model development.

major comments (2)

[Abstract and §4] Abstract and §4: The abstract states evaluation results and conclusions but supplies no details on data splits, statistical tests, error bars, or exclusion criteria, so it is impossible to verify whether the data supports the generalization claims.
[§2.3] §2.3: The stage-based task grouping (conceptualization, formulation, articulation) is presented without explicit mapping to clinical literature or ablation studies confirming that within-stage tasks exhibit more similar performance patterns than across-stage tasks; this grouping is load-bearing for the central claim that observed failures reflect representation limitations rather than dataset artefacts.

minor comments (2)

[Table 1] Table 1: Dataset and task metadata would benefit from explicit column headers for recording conditions and label distributions to aid reproducibility.
[Figure 2] Figure 2: Axis labels on the performance heatmaps are small and could be enlarged for readability in the published version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve transparency and grounding in the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The abstract states evaluation results and conclusions but supplies no details on data splits, statistical tests, error bars, or exclusion criteria, so it is impossible to verify whether the data supports the generalization claims.

Authors: We agree that the abstract would benefit from greater transparency on the evaluation protocol. In the revised manuscript we will add a concise clause to the abstract noting the use of dataset-specific train/test splits, bootstrap-derived 95% confidence intervals, and audio-quality-based exclusion criteria. Section 4 already specifies the splits, the bootstrap procedure for error bars, and the exclusion rules; we will add an explicit summary paragraph at the start of §4 that cross-references these elements and reports the statistical tests (paired t-tests with Bonferroni correction) used for model comparisons. These changes will make the support for the generalization claims directly verifiable from the abstract and §4. revision: yes
Referee: [§2.3] §2.3: The stage-based task grouping (conceptualization, formulation, articulation) is presented without explicit mapping to clinical literature or ablation studies confirming that within-stage tasks exhibit more similar performance patterns than across-stage tasks; this grouping is load-bearing for the central claim that observed failures reflect representation limitations rather than dataset artefacts.

Authors: The grouping follows established clinical models of speech production (Levelt 1989; Duffy 2019 on motor speech disorders). We will revise §2.3 to include a table that explicitly maps each of the 27 tasks to one of the three stages together with primary clinical citations. We did not perform an intra- versus inter-stage performance correlation ablation; the paper’s cross-condition zero-shot experiments already separate representation limitations from dataset-specific artefacts. We will therefore add the literature mapping and table (revision_made = partial) but note that a full ablation would require substantial additional compute and is not required to support the main claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces SpeechDx as a multi-task benchmark spanning datasets and tasks, structures them by speech production stages as a methodological choice, and reports empirical evaluations of audio encoders. No equations, parameter fitting, predictions, or first-principles derivations are present; the central claim about generalization failures is an empirical observation from the benchmark results rather than a reduction to inputs or self-citations. The stage-based grouping is asserted for enabling cross-mechanism evaluation but is not derived from or equivalent to the results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5722 in / 997 out tokens · 44461 ms · 2026-06-27T03:05:16.350256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Call, and Guy Fagherazzi

Vesna Despotovic, Mohamad Ismael, Marc Cornil, Romain M. Call, and Guy Fagherazzi. Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results.Computer Biology and Medicine, 138:104944, 2021

2021
[2]

Speech as a biomarker for covid-19 detection using machine learning

Mohammed Usman, Vinit Kumar Gunjan, Mohd Wajid, Mohammed Zubair, and Kazy Noor- e-alam Siddiquee. Speech as a biomarker for covid-19 detection using machine learning. Computational Intelligence and Neuroscience, 2022(1):6093613, 2022

2022
[3]

Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih

David H. Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih. Dysarthria speech detection using convolutional neural networks with gated recurrent unit.Healthcare, 10(10):1956, 2022

1956
[4]

Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R

Carlos D. Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R. Orozco-Arroyave. Automatic classification of hypokinetic and hyperkinetic dysarthria based on gmm-supervectors. In Proceedings of INTERSPEECH 2023. ISCA, 2023

2023
[6]

Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

Mittapalle Kiran Reddy and Paavo Alku. Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

2023
[7]

Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

Rania Khaskhoussy and Yosra Ben Ayed. Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

2023
[8]

Gomez-Garcia, Juan D

Laura Moro-Velazquez, Juan A. Gomez-Garcia, Juan D. Arias-Londoño, Najim Dehak, and Juan I. Godino-Llorente. Advances in parkinson’s disease detection and assessment using voice and speech: A review of the articulatory and phonatory aspects.Biomedical Signal Processing and Control, 66:102418, 2021

2021
[9]

Mahsa Zolnoori, Arash Zolnour, and Maxim Topaz. Adscreen: A speech processing-based screening system for automatic identification of patients with alzheimer’s disease and related dementia.Artificial Intelligence in Medicine, 143:102624, 2023

2023
[10]

Israel Martínez-Nicolás, Thide E Llorente, Francisco Martínez-Sánchez, and Juan José G Meilán. Ten years of research on automatic voice and speech analysis of people with alzheimer’s disease and mild cognitive impairment: a systematic review article.Frontiers in Psychology, 12:620251, 2021

2021
[11]

Classifying dementia in the presence of depression: A cross-corpus study

Felix Braun, Maria Schuster, Florian Honig, Elmar Noeth, and Juan Rafael Orozco-Arroyave. Classifying dementia in the presence of depression: A cross-corpus study. InProceedings of INTERSPEECH 2023. ISCA, 2023

2023
[12]

Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

Pingping Wu, Ruihao Wang, Han Lin, Fanlong Zhang, Juan Tu, and Miao Sun. Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

2023
[13]

Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

Sanne Koops, Sanne G Brederoo, Janna N De Boer, Femke G Nadema, Alban E V oppel, and Iris E Sommer. Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

2023
[14]

A review of depression and suicide risk assessment using speech analysis

Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech communication, 71:10–49, 2015

2015
[15]

Sung, and Philip C

Guo-Shiang Liu, Nikola Jovanovic, Chang K. Sung, and Philip C. Doyle. A scoping review of artificial intelligence detection of voice pathology: Challenges and opportunities. Otolaryngology–Head and Neck Surgery, 171(3):658–666, 2024. 10

2024
[16]

MVP: Multi-source V oice Pathology detection

Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, and Elena Baralis. MVP: Multi-source V oice Pathology detection. InInterspeech 2025, pages 3548–3552, 2025

2025
[17]

The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stuttering, activity, & mosquitoes

Björn Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastien Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stutte...

2022
[18]

Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick

Richard C. Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. The post-stroke speech transcription (psst) challenge. InProceedings of the LREC 2022 RaPID-4 Workshop, pages 41–55, 2022

2022
[19]

Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge

Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, and Brian MacWhinney. Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge. InInterspeech 2020, pages 2172–2176, 2020

2020
[20]

The 1st speechwellness challenge: Detecting suicide risk among adolescents

Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, and Chao Zhang. The 1st speechwellness challenge: Detecting suicide risk among adolescents. InInterspeech 2025, pages 399–403. ISCA, 2025

2025
[21]

Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, and Viral Nanda. Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

2021
[22]

Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

2023
[23]

Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457

Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, and Julie Liss. Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457. ISCA, 09 2022

2022
[24]

Guilherme Schu, Parvaneh Janbakhshi, and Ina Kodrasi. On using the ua-speech and torgo databases to validate automatic dysarthric speech classification approaches.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022

2023
[25]

Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms

Yi Zhu, Mohamed Imoussaine, Carolyn Côté-Lussier, and Tiago Falk. Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms. pages 46–54, 08 2023

2023
[26]

Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

Visar Berisha and Julie Liss. Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

2024
[27]

Rosen, and Ryan C

Katherine Verdolini, Clark A. Rosen, and Ryan C. Branski, editors.Classification Manual for Voice Disorders-I. Psychology Press, 1 edition, 2006

2006
[28]

V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

Guy Fagherazzi, Aurélie Fischer, Muhannad Ismael, and Vladimir Despotovic. V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

2021
[29]

Harrison, Liam D

Jessica Robin, John E. Harrison, Liam D. Kaufman, Frank Rudzicz, William Simpson, and Maria Yancheva. Evaluation of speech-based digital biomarkers: Review and recommendations. Digital Biomarkers, 4(3):99–108, 10 2020

2020
[30]

Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022

Vikram Ramanarayanan, Adam Lammert, Hannah Rowe, Thomas Quatieri, and Jordan Green. Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022. 11

2022
[31]

An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

Si-Ioi Ng, Lingfeng Xu, Ingo Siegert, Nicholas Cummins, Nina R Benway, Julie Liss, and Visar Berisha. An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

2026
[32]

Opensmile: the munich versatile and fast open-source audio feature extractor

Florian Eyben, Martin Wollmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. InProceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY , USA, 2010. Association for Computing Machinery

2010
[33]

Covid-19 detection via fusion of modulation spectrum and linear prediction speech features

Yi Zhu, Abhishek Tiwari, João Monteiro, Shruti Kshirsagar, and Tiago Henrique Falk. Covid-19 detection via fusion of modulation spectrum and linear prediction speech features. IEEE/ACM transactions on audio, speech, and language processing, 31:1536–1549, 2023

2023
[34]

Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

Tomas Arias-Vergara, Juan Camilo Vasquez-Correa, and Juan Rafael Orozco-Arroyave. Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

2017
[35]

Effectiveness of voice quality features in detecting depression

Ahmed Afshan, Jian Guo, Seong Joon Park, Venkatesh Ravi, Jonathan Flint, and Abeer Alwan. Effectiveness of voice quality features in detecting depression. InProceedings of INTERSPEECH 2018, pages 1676–1680. ISCA, 2018

2018
[36]

Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

Sejal Bhalla, Deshang Kong, Salaar Liaqat, Daniyal Liaqat, Robert Wu, Andrea Gershon, Eyal de Lara, and Alex Mariakakis. Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

2025
[37]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease

Juan Camilo Vasquez-Correa, Juan Rafael Orozco-Arroyave, Tobias Bocklet, and Elmar Noeth. Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease. Journal of Communication Disorders, 76:21–36, 2018

2018
[38]

Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization

Chunying Fang, Haifeng Li, Lin Ma, and Mancai Zhang. Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization. Computational and Mathematical Methods in Medicine, 2017(1):2431573, 2017

2017
[39]

Schuller

Nicholas Cummins, Alice Baird, and Björn W. Schuller. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning.Methods, 151:41–54, 2018. Health Informatics and Translational Data Analytics

2018
[40]

Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

Qin Yang, Xin Li, Xinyun Ding, Feiyang Xu, and Zhenhua Ling. Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

2022
[41]

V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals

Rumana Islam, Esam Abdel-Raheem, and Mohammed Tarique. V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals. Computer Methods and Programs in Biomedicine Update, 2:100074, 2022

2022
[42]

End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

Harry Coppock, Alex Gaskell, Panagiotis Tzirakis, Alice Baird, Lyn Jones, and Björn Schuller. End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

2021
[43]

Automatic assessment of aphasic speech sensed by audio sensors for classification into aphasia severity levels to recommend speech therapies.Sensors, 22(18), 2022

Herath Mudiyanselage Dhammike Piyumal Madhurajith Herath, Weraniyagoda Arachchilage Sahanaka Anuththara Weraniyagoda, Rajapakshage Thilina Madhushan Rajapaksha, Patikiri Arachchige Don Shehan Nilmantha Wijesekara, Kalupahana Liyanage Kushan Sudheera, and Peter Han Joo Chong. Automatic assessment of aphasic speech sensed by audio sensors for classification...

2022
[44]

Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023

Ah Young Kim, Eun Hye Jang, Seung-Hwan Lee, Kwang-Yeon Choi, Jeon Gue Park, and Hyun-Chool Shin. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023. 12

2023
[45]

Yi Zhu, Alex Mariakakis, Eyal De Lara, and Tiago H. Falk. How generalizable and interpretable are speech-based covid-19 detection systems?: A comparative analysis and new system proposal. In2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 1–5, 2022

2022
[46]

Masked autoencoders are scalable vision learners, 2021

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021

2021
[47]

Dinov2: Learning robust visual features without supervision, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

2024
[48]

Self-supervised learning from images with a joint-embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023

2023
[49]

Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

2019
[50]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018

2018
[51]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

2023
[52]

wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

2020
[53]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

2021
[54]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, October 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Pr...

2022
[55]

Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

Farhad Javanmardi, Sudarsana Reddy Kadiri, and Paavo Alku. Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

2024
[56]

Gruia, Patrick A

Giulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, and Fatemeh Geranmayeh. When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. InInterspeech 2024, pages 1995–1999, 2024

2024
[57]

Exploring self-supervised models for depressive disorder detection: A study on speech corpora

Bubai Maji, Shazia Nasreen, Rajlakshmi Guha, Aurobinda Routray, Debabrata Majumdar, and Km Poonam. Exploring self-supervised models for depressive disorder detection: A study on speech corpora. In2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1–4, 2024

2024
[58]

Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. InInterspeech 2021, pages 3400–3404, 2021

2021
[59]

Probing whisper for dysarthric speech in detection and assessment, 2025

Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, and Erfan Loweimi. Probing whisper for dysarthric speech in detection and assessment, 2025. 13

2025
[60]

Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals

Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang, and Li-Rong Dai. Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 561–565. IEEE, May 2022

2022
[61]

Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

Yi Zhu and Tiago Falk. Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

2025
[62]

Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila

Sebastien Baur, Zaid Nabulsi, Wei-Hung Weng, Jake Garrison, Louis Blankemeier, Sam Fishman, Christina Chen, Sujay Kakarmath, Minyoi Maimbolwa, Nsala Sanjase, Brian Shuma, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila. Hear – health acoustic representations, 2024

2024
[63]

Lin, Andy T

Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei- Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchm...

2021
[64]

Schuller, Christian J

Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, and Yonat...

2022
[65]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Louis-Philippe Morency. The distress analysis interview corpus of human and computer interviews. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Mae...

2014
[66]

Livingstone and Frank A

Steven R. Livingstone and Frank A. Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PLOS ONE, 13(5):e0196391, 2018

2018
[67]

Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

2008
[68]

Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge

Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023

2023
[69]

Forbes, Davida Fromm, and Brian MacWhinney

Margaret M. Forbes, Davida Fromm, and Brian MacWhinney. Aphasiabank: A resource for clinicians.Aphasiology, 26(11):1281–1295, 2012

2012
[70]

The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

Frank Rudzicz, Aravind Namasivayam, and Talya Wolff. The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

2010
[71]

Huang, Kenneth Watkin, and Simone Frame

Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman, Jon Gunderson, Thomas S. Huang, Kenneth Watkin, and Simone Frame. Dysarthric speech database for universal access research. InInterspeech 2008, pages 1741–1744, 2008

2008
[72]

Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019

Hagen Jaeger, Dhaval Trivedi, and Michael Stadtschnitzer. Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019. 14

2019
[73]

Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering

Sebastian Peter Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Noeth, and Korbinian Riedhammer. Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering. InProceedings of the Language Resources and Evaluation Conference, pages 1780–1787, Marseille, France, June 2022. European Language Resources Association

2022
[74]

Covid-19 sounds: A large-scale audio dataset for digital respiratory screening

Tong Xia, Dimitris Spathis, Chloe Brown, Jagmohan Chauhan, Andreas Grammenos, Jing Han, Apinan Hasthanasombat, Erika Bondareva, Ting Dang, Andres Floto, Pietro Cicuta, and Cecilia Mascolo. Covid-19 sounds: A large-scale audio dataset for digital respiratory screening. InProceedings of the 35th Conference on Neural Information Processing Systems Datasets a...

2021
[75]

Chandrakiran, Sahiti Nori, K

Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, C. Chandrakiran, Sahiti Nori, K. K. Suhail, Sadhana Gonuguntla, and Murali Alagesan. Coswara: A respiratory sounds and symptoms dataset for remote screening of sars-cov-2 infection.Scientific Data, 10(1):397, 2023

2023
[76]

Jesus, Inês Belo, Jessica Machado, and Andreia Hall

Luis M.T. Jesus, Inês Belo, Jessica Machado, and Andreia Hall. The advanced voice function assessment databases (avfad): Tools for voice clinicians and speech research. InAdvances in Speech-language Pathology, chapter 14. IntechOpen, London, 2017

2017
[77]

Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

Lester Phillip Violeta, Wen-Chin Huang, and Tomoki Toda. Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

2022
[78]

V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

Jie Cai, Yuliang Song, Jianghao Wu, and Xiong Chen. V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

2024
[79]

Scaling speech technology to 1,000+ languages, 2023

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages, 2023

2023
[80]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[81]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

2022

Showing first 80 references.

[1] [1]

Call, and Guy Fagherazzi

Vesna Despotovic, Mohamad Ismael, Marc Cornil, Romain M. Call, and Guy Fagherazzi. Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results.Computer Biology and Medicine, 138:104944, 2021

2021

[2] [2]

Speech as a biomarker for covid-19 detection using machine learning

Mohammed Usman, Vinit Kumar Gunjan, Mohd Wajid, Mohammed Zubair, and Kazy Noor- e-alam Siddiquee. Speech as a biomarker for covid-19 detection using machine learning. Computational Intelligence and Neuroscience, 2022(1):6093613, 2022

2022

[3] [3]

Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih

David H. Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih. Dysarthria speech detection using convolutional neural networks with gated recurrent unit.Healthcare, 10(10):1956, 2022

1956

[4] [4]

Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R

Carlos D. Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R. Orozco-Arroyave. Automatic classification of hypokinetic and hyperkinetic dysarthria based on gmm-supervectors. In Proceedings of INTERSPEECH 2023. ISCA, 2023

2023

[5] [6]

Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

Mittapalle Kiran Reddy and Paavo Alku. Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

2023

[6] [7]

Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

Rania Khaskhoussy and Yosra Ben Ayed. Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

2023

[7] [8]

Gomez-Garcia, Juan D

Laura Moro-Velazquez, Juan A. Gomez-Garcia, Juan D. Arias-Londoño, Najim Dehak, and Juan I. Godino-Llorente. Advances in parkinson’s disease detection and assessment using voice and speech: A review of the articulatory and phonatory aspects.Biomedical Signal Processing and Control, 66:102418, 2021

2021

[8] [9]

Mahsa Zolnoori, Arash Zolnour, and Maxim Topaz. Adscreen: A speech processing-based screening system for automatic identification of patients with alzheimer’s disease and related dementia.Artificial Intelligence in Medicine, 143:102624, 2023

2023

[9] [10]

Israel Martínez-Nicolás, Thide E Llorente, Francisco Martínez-Sánchez, and Juan José G Meilán. Ten years of research on automatic voice and speech analysis of people with alzheimer’s disease and mild cognitive impairment: a systematic review article.Frontiers in Psychology, 12:620251, 2021

2021

[10] [11]

Classifying dementia in the presence of depression: A cross-corpus study

Felix Braun, Maria Schuster, Florian Honig, Elmar Noeth, and Juan Rafael Orozco-Arroyave. Classifying dementia in the presence of depression: A cross-corpus study. InProceedings of INTERSPEECH 2023. ISCA, 2023

2023

[11] [12]

Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

Pingping Wu, Ruihao Wang, Han Lin, Fanlong Zhang, Juan Tu, and Miao Sun. Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

2023

[12] [13]

Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

Sanne Koops, Sanne G Brederoo, Janna N De Boer, Femke G Nadema, Alban E V oppel, and Iris E Sommer. Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

2023

[13] [14]

A review of depression and suicide risk assessment using speech analysis

Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech communication, 71:10–49, 2015

2015

[14] [15]

Sung, and Philip C

Guo-Shiang Liu, Nikola Jovanovic, Chang K. Sung, and Philip C. Doyle. A scoping review of artificial intelligence detection of voice pathology: Challenges and opportunities. Otolaryngology–Head and Neck Surgery, 171(3):658–666, 2024. 10

2024

[15] [16]

MVP: Multi-source V oice Pathology detection

Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, and Elena Baralis. MVP: Multi-source V oice Pathology detection. InInterspeech 2025, pages 3548–3552, 2025

2025

[16] [17]

The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stuttering, activity, & mosquitoes

Björn Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastien Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stutte...

2022

[17] [18]

Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick

Richard C. Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. The post-stroke speech transcription (psst) challenge. InProceedings of the LREC 2022 RaPID-4 Workshop, pages 41–55, 2022

2022

[18] [19]

Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge

Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, and Brian MacWhinney. Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge. InInterspeech 2020, pages 2172–2176, 2020

2020

[19] [20]

The 1st speechwellness challenge: Detecting suicide risk among adolescents

Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, and Chao Zhang. The 1st speechwellness challenge: Detecting suicide risk among adolescents. InInterspeech 2025, pages 399–403. ISCA, 2025

2025

[20] [21]

Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, and Viral Nanda. Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

2021

[21] [22]

Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

2023

[22] [23]

Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457

Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, and Julie Liss. Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457. ISCA, 09 2022

2022

[23] [24]

Guilherme Schu, Parvaneh Janbakhshi, and Ina Kodrasi. On using the ua-speech and torgo databases to validate automatic dysarthric speech classification approaches.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022

2023

[24] [25]

Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms

Yi Zhu, Mohamed Imoussaine, Carolyn Côté-Lussier, and Tiago Falk. Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms. pages 46–54, 08 2023

2023

[25] [26]

Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

Visar Berisha and Julie Liss. Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

2024

[26] [27]

Rosen, and Ryan C

Katherine Verdolini, Clark A. Rosen, and Ryan C. Branski, editors.Classification Manual for Voice Disorders-I. Psychology Press, 1 edition, 2006

2006

[27] [28]

V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

Guy Fagherazzi, Aurélie Fischer, Muhannad Ismael, and Vladimir Despotovic. V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

2021

[28] [29]

Harrison, Liam D

Jessica Robin, John E. Harrison, Liam D. Kaufman, Frank Rudzicz, William Simpson, and Maria Yancheva. Evaluation of speech-based digital biomarkers: Review and recommendations. Digital Biomarkers, 4(3):99–108, 10 2020

2020

[29] [30]

Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022

Vikram Ramanarayanan, Adam Lammert, Hannah Rowe, Thomas Quatieri, and Jordan Green. Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022. 11

2022

[30] [31]

An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

Si-Ioi Ng, Lingfeng Xu, Ingo Siegert, Nicholas Cummins, Nina R Benway, Julie Liss, and Visar Berisha. An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

2026

[31] [32]

Opensmile: the munich versatile and fast open-source audio feature extractor

Florian Eyben, Martin Wollmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. InProceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY , USA, 2010. Association for Computing Machinery

2010

[32] [33]

Covid-19 detection via fusion of modulation spectrum and linear prediction speech features

Yi Zhu, Abhishek Tiwari, João Monteiro, Shruti Kshirsagar, and Tiago Henrique Falk. Covid-19 detection via fusion of modulation spectrum and linear prediction speech features. IEEE/ACM transactions on audio, speech, and language processing, 31:1536–1549, 2023

2023

[33] [34]

Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

Tomas Arias-Vergara, Juan Camilo Vasquez-Correa, and Juan Rafael Orozco-Arroyave. Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

2017

[34] [35]

Effectiveness of voice quality features in detecting depression

Ahmed Afshan, Jian Guo, Seong Joon Park, Venkatesh Ravi, Jonathan Flint, and Abeer Alwan. Effectiveness of voice quality features in detecting depression. InProceedings of INTERSPEECH 2018, pages 1676–1680. ISCA, 2018

2018

[35] [36]

Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

Sejal Bhalla, Deshang Kong, Salaar Liaqat, Daniyal Liaqat, Robert Wu, Andrea Gershon, Eyal de Lara, and Alex Mariakakis. Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

2025

[36] [37]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease

Juan Camilo Vasquez-Correa, Juan Rafael Orozco-Arroyave, Tobias Bocklet, and Elmar Noeth. Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease. Journal of Communication Disorders, 76:21–36, 2018

2018

[37] [38]

Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization

Chunying Fang, Haifeng Li, Lin Ma, and Mancai Zhang. Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization. Computational and Mathematical Methods in Medicine, 2017(1):2431573, 2017

2017

[38] [39]

Schuller

Nicholas Cummins, Alice Baird, and Björn W. Schuller. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning.Methods, 151:41–54, 2018. Health Informatics and Translational Data Analytics

2018

[39] [40]

Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

Qin Yang, Xin Li, Xinyun Ding, Feiyang Xu, and Zhenhua Ling. Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

2022

[40] [41]

V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals

Rumana Islam, Esam Abdel-Raheem, and Mohammed Tarique. V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals. Computer Methods and Programs in Biomedicine Update, 2:100074, 2022

2022

[41] [42]

End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

Harry Coppock, Alex Gaskell, Panagiotis Tzirakis, Alice Baird, Lyn Jones, and Björn Schuller. End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

2021

[42] [43]

Automatic assessment of aphasic speech sensed by audio sensors for classification into aphasia severity levels to recommend speech therapies.Sensors, 22(18), 2022

Herath Mudiyanselage Dhammike Piyumal Madhurajith Herath, Weraniyagoda Arachchilage Sahanaka Anuththara Weraniyagoda, Rajapakshage Thilina Madhushan Rajapaksha, Patikiri Arachchige Don Shehan Nilmantha Wijesekara, Kalupahana Liyanage Kushan Sudheera, and Peter Han Joo Chong. Automatic assessment of aphasic speech sensed by audio sensors for classification...

2022

[43] [44]

Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023

Ah Young Kim, Eun Hye Jang, Seung-Hwan Lee, Kwang-Yeon Choi, Jeon Gue Park, and Hyun-Chool Shin. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023. 12

2023

[44] [45]

Yi Zhu, Alex Mariakakis, Eyal De Lara, and Tiago H. Falk. How generalizable and interpretable are speech-based covid-19 detection systems?: A comparative analysis and new system proposal. In2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 1–5, 2022

2022

[45] [46]

Masked autoencoders are scalable vision learners, 2021

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021

2021

[46] [47]

Dinov2: Learning robust visual features without supervision, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

2024

[47] [48]

Self-supervised learning from images with a joint-embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023

2023

[48] [49]

Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

2019

[49] [50]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018

2018

[50] [51]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

2023

[51] [52]

wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

2020

[52] [53]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

2021

[53] [54]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, October 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Pr...

2022

[54] [55]

Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

Farhad Javanmardi, Sudarsana Reddy Kadiri, and Paavo Alku. Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

2024

[55] [56]

Gruia, Patrick A

Giulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, and Fatemeh Geranmayeh. When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. InInterspeech 2024, pages 1995–1999, 2024

2024

[56] [57]

Exploring self-supervised models for depressive disorder detection: A study on speech corpora

Bubai Maji, Shazia Nasreen, Rajlakshmi Guha, Aurobinda Routray, Debabrata Majumdar, and Km Poonam. Exploring self-supervised models for depressive disorder detection: A study on speech corpora. In2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1–4, 2024

2024

[57] [58]

Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. InInterspeech 2021, pages 3400–3404, 2021

2021

[58] [59]

Probing whisper for dysarthric speech in detection and assessment, 2025

Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, and Erfan Loweimi. Probing whisper for dysarthric speech in detection and assessment, 2025. 13

2025

[59] [60]

Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals

Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang, and Li-Rong Dai. Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 561–565. IEEE, May 2022

2022

[60] [61]

Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

Yi Zhu and Tiago Falk. Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

2025

[61] [62]

Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila

Sebastien Baur, Zaid Nabulsi, Wei-Hung Weng, Jake Garrison, Louis Blankemeier, Sam Fishman, Christina Chen, Sujay Kakarmath, Minyoi Maimbolwa, Nsala Sanjase, Brian Shuma, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila. Hear – health acoustic representations, 2024

2024

[62] [63]

Lin, Andy T

Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei- Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchm...

2021

[63] [64]

Schuller, Christian J

Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, and Yonat...

2022

[64] [65]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Louis-Philippe Morency. The distress analysis interview corpus of human and computer interviews. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Mae...

2014

[65] [66]

Livingstone and Frank A

Steven R. Livingstone and Frank A. Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PLOS ONE, 13(5):e0196391, 2018

2018

[66] [67]

Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

2008

[67] [68]

Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge

Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023

2023

[68] [69]

Forbes, Davida Fromm, and Brian MacWhinney

Margaret M. Forbes, Davida Fromm, and Brian MacWhinney. Aphasiabank: A resource for clinicians.Aphasiology, 26(11):1281–1295, 2012

2012

[69] [70]

The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

Frank Rudzicz, Aravind Namasivayam, and Talya Wolff. The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

2010

[70] [71]

Huang, Kenneth Watkin, and Simone Frame

Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman, Jon Gunderson, Thomas S. Huang, Kenneth Watkin, and Simone Frame. Dysarthric speech database for universal access research. InInterspeech 2008, pages 1741–1744, 2008

2008

[71] [72]

Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019

Hagen Jaeger, Dhaval Trivedi, and Michael Stadtschnitzer. Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019. 14

2019

[72] [73]

Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering

Sebastian Peter Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Noeth, and Korbinian Riedhammer. Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering. InProceedings of the Language Resources and Evaluation Conference, pages 1780–1787, Marseille, France, June 2022. European Language Resources Association

2022

[73] [74]

Covid-19 sounds: A large-scale audio dataset for digital respiratory screening

Tong Xia, Dimitris Spathis, Chloe Brown, Jagmohan Chauhan, Andreas Grammenos, Jing Han, Apinan Hasthanasombat, Erika Bondareva, Ting Dang, Andres Floto, Pietro Cicuta, and Cecilia Mascolo. Covid-19 sounds: A large-scale audio dataset for digital respiratory screening. InProceedings of the 35th Conference on Neural Information Processing Systems Datasets a...

2021

[74] [75]

Chandrakiran, Sahiti Nori, K

Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, C. Chandrakiran, Sahiti Nori, K. K. Suhail, Sadhana Gonuguntla, and Murali Alagesan. Coswara: A respiratory sounds and symptoms dataset for remote screening of sars-cov-2 infection.Scientific Data, 10(1):397, 2023

2023

[75] [76]

Jesus, Inês Belo, Jessica Machado, and Andreia Hall

Luis M.T. Jesus, Inês Belo, Jessica Machado, and Andreia Hall. The advanced voice function assessment databases (avfad): Tools for voice clinicians and speech research. InAdvances in Speech-language Pathology, chapter 14. IntechOpen, London, 2017

2017

[76] [77]

Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

Lester Phillip Violeta, Wen-Chin Huang, and Tomoki Toda. Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

2022

[77] [78]

V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

Jie Cai, Yuliang Song, Jianghao Wu, and Xiong Chen. V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

2024

[78] [79]

Scaling speech technology to 1,000+ languages, 2023

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages, 2023

2023

[79] [80]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[80] [81]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

2022