pith. sign in

arxiv: 2606.17339 · v1 · pith:3YG364OUnew · submitted 2026-06-15 · 💻 cs.AI · cs.CL· cs.SD

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Pith reviewed 2026-06-27 03:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.SD
keywords clinical speech AIbenchmarkgeneralizationaudio encodersspeech production stagesmulti-task evaluationzero-shot transferhealth conditions
0
0 comments X

The pith

No current audio representation generalizes reliably across the clinical speech landscape.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpeechDx as a benchmark covering 12 datasets and 27 tasks to move clinical speech AI beyond isolated, condition-specific studies. Tasks are grouped by the stage of speech production disrupted—conceptualization, formulation, or articulation—so that evaluation can target shared mechanisms rather than dataset quirks. Systematic testing of 12 audio encoders reveals that large-scale speech models give the best overall results, domain-specific models help only on closely matched tasks, and no representation performs consistently across conditions or datasets. This setup lets researchers track whether models are capturing clinically meaningful patterns instead of artefacts. If the claim holds, progress depends on building representations that transfer across the full range of health conditions rather than excelling on single tasks.

Core claim

SpeechDx shows that large-scale speech models are the strongest baselines overall, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape when tested on tasks with limited labels and the same condition across multiple datasets.

What carries the argument

SpeechDx benchmark, which organizes 27 tasks across 12 datasets by the stage of speech production disrupted (conceptualization, formulation, articulation) to test shared clinical mechanisms.

If this is right

  • Large-scale pretrained speech models should serve as the default starting point for new clinical speech applications.
  • Domain-specific adaptation yields gains only when the target task closely matches the adaptation data.
  • Cross-condition and cross-dataset transfer remains an open requirement for any general-purpose clinical speech system.
  • Future models can be compared directly on the same 27-task suite to measure genuine generalization gains.
  • The stage-based structure provides a way to diagnose which parts of the speech pipeline a representation fails to capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that succeed on this benchmark could support diagnostic tools usable across multiple neurological and motor conditions without retraining per disease.
  • The benchmark could guide development of hybrid encoders that combine broad pretraining with targeted adaptation for underrepresented stages like conceptualization.
  • If generalization improves, clinical speech AI could shift from single-condition tools to integrated systems that monitor overall speech health.
  • Extending the benchmark with more languages or recording conditions would test whether the current generalization failures are language- or hardware-specific.

Load-bearing premise

Grouping tasks by the stage of speech production they disrupt lets evaluation measure shared clinical mechanisms rather than dataset-specific artefacts.

What would settle it

A single audio encoder achieving consistently high performance on all 27 tasks, including zero-shot cross-condition transfer and repeated conditions across different datasets, without performance dropping on any subset.

Figures

Figures reproduced from arXiv: 2606.17339 by Aina Merchant, Alex Mariakakis, Eyal de Lara, Larry Kieu, Sejal Bhalla.

Figure 1
Figure 1. Figure 1: The benchmark evaluation of 12 audio encoders across 27 clinical speech AI tasks, grouped [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The evaluation of zero-shot transfer for classification tasks. The top four grids show [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The data efficiency of Qwen3-TTS-Tokenizer [ [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
read the original abstract

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpeechDx, a benchmark spanning 12 datasets and 27 tasks across health conditions, structured by speech production stages (conceptualization, formulation, articulation) to evaluate generalization of audio encoders. It systematically tests 12 state-of-the-art encoders under in-domain and zero-shot cross-condition settings, concluding that large-scale models are strongest overall, domain-specific models help only on matched tasks, and no current representation generalizes reliably across the clinical speech landscape.

Significance. If the benchmark's task grouping and evaluation protocol hold, the work would provide a valuable shared framework for clinical speech AI, moving beyond isolated condition-specific studies and enabling tracking of progress toward general-purpose representations. The scale (12 datasets, 27 tasks) and systematic comparison of multiple encoders are strengths that could influence future model development.

major comments (2)
  1. [Abstract and §4] Abstract and §4: The abstract states evaluation results and conclusions but supplies no details on data splits, statistical tests, error bars, or exclusion criteria, so it is impossible to verify whether the data supports the generalization claims.
  2. [§2.3] §2.3: The stage-based task grouping (conceptualization, formulation, articulation) is presented without explicit mapping to clinical literature or ablation studies confirming that within-stage tasks exhibit more similar performance patterns than across-stage tasks; this grouping is load-bearing for the central claim that observed failures reflect representation limitations rather than dataset artefacts.
minor comments (2)
  1. [Table 1] Table 1: Dataset and task metadata would benefit from explicit column headers for recording conditions and label distributions to aid reproducibility.
  2. [Figure 2] Figure 2: Axis labels on the performance heatmaps are small and could be enlarged for readability in the published version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve transparency and grounding in the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: The abstract states evaluation results and conclusions but supplies no details on data splits, statistical tests, error bars, or exclusion criteria, so it is impossible to verify whether the data supports the generalization claims.

    Authors: We agree that the abstract would benefit from greater transparency on the evaluation protocol. In the revised manuscript we will add a concise clause to the abstract noting the use of dataset-specific train/test splits, bootstrap-derived 95% confidence intervals, and audio-quality-based exclusion criteria. Section 4 already specifies the splits, the bootstrap procedure for error bars, and the exclusion rules; we will add an explicit summary paragraph at the start of §4 that cross-references these elements and reports the statistical tests (paired t-tests with Bonferroni correction) used for model comparisons. These changes will make the support for the generalization claims directly verifiable from the abstract and §4. revision: yes

  2. Referee: [§2.3] §2.3: The stage-based task grouping (conceptualization, formulation, articulation) is presented without explicit mapping to clinical literature or ablation studies confirming that within-stage tasks exhibit more similar performance patterns than across-stage tasks; this grouping is load-bearing for the central claim that observed failures reflect representation limitations rather than dataset artefacts.

    Authors: The grouping follows established clinical models of speech production (Levelt 1989; Duffy 2019 on motor speech disorders). We will revise §2.3 to include a table that explicitly maps each of the 27 tasks to one of the three stages together with primary clinical citations. We did not perform an intra- versus inter-stage performance correlation ablation; the paper’s cross-condition zero-shot experiments already separate representation limitations from dataset-specific artefacts. We will therefore add the literature mapping and table (revision_made = partial) but note that a full ablation would require substantial additional compute and is not required to support the main claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces SpeechDx as a multi-task benchmark spanning datasets and tasks, structures them by speech production stages as a methodological choice, and reports empirical evaluations of audio encoders. No equations, parameter fitting, predictions, or first-principles derivations are present; the central claim about generalization failures is an empirical observation from the benchmark results rather than a reduction to inputs or self-citations. The stage-based grouping is asserted for enabling cross-mechanism evaluation but is not derived from or equivalent to the results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5722 in / 997 out tokens · 44461 ms · 2026-06-27T03:05:16.350256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Call, and Guy Fagherazzi

    Vesna Despotovic, Mohamad Ismael, Marc Cornil, Romain M. Call, and Guy Fagherazzi. Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results.Computer Biology and Medicine, 138:104944, 2021

  2. [2]

    Speech as a biomarker for covid-19 detection using machine learning

    Mohammed Usman, Vinit Kumar Gunjan, Mohd Wajid, Mohammed Zubair, and Kazy Noor- e-alam Siddiquee. Speech as a biomarker for covid-19 detection using machine learning. Computational Intelligence and Neuroscience, 2022(1):6093613, 2022

  3. [3]

    Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih

    David H. Shih, Chih-Hao Liao, Tzu-Wei Wu, Xiao-Yu Xu, and Ming-Hsiang Shih. Dysarthria speech detection using convolutional neural networks with gated recurrent unit.Healthcare, 10(10):1956, 2022

  4. [4]

    Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R

    Carlos D. Ríos-Urrego, Jan Rusz, Elmar Nöth, and Juan R. Orozco-Arroyave. Automatic classification of hypokinetic and hyperkinetic dysarthria based on gmm-supervectors. In Proceedings of INTERSPEECH 2023. ISCA, 2023

  5. [6]

    Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

    Mittapalle Kiran Reddy and Paavo Alku. Exemplar-based sparse representations for detection of parkinson’s disease from speech.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1386–1396, 2023

  6. [7]

    Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

    Rania Khaskhoussy and Yosra Ben Ayed. Improving parkinson’s disease recognition through voice analysis using deep learning.Pattern Recognition Letters, 168:64–70, 2023

  7. [8]

    Gomez-Garcia, Juan D

    Laura Moro-Velazquez, Juan A. Gomez-Garcia, Juan D. Arias-Londoño, Najim Dehak, and Juan I. Godino-Llorente. Advances in parkinson’s disease detection and assessment using voice and speech: A review of the articulatory and phonatory aspects.Biomedical Signal Processing and Control, 66:102418, 2021

  8. [9]

    Mahsa Zolnoori, Arash Zolnour, and Maxim Topaz. Adscreen: A speech processing-based screening system for automatic identification of patients with alzheimer’s disease and related dementia.Artificial Intelligence in Medicine, 143:102624, 2023

  9. [10]

    Israel Martínez-Nicolás, Thide E Llorente, Francisco Martínez-Sánchez, and Juan José G Meilán. Ten years of research on automatic voice and speech analysis of people with alzheimer’s disease and mild cognitive impairment: a systematic review article.Frontiers in Psychology, 12:620251, 2021

  10. [11]

    Classifying dementia in the presence of depression: A cross-corpus study

    Felix Braun, Maria Schuster, Florian Honig, Elmar Noeth, and Juan Rafael Orozco-Arroyave. Classifying dementia in the presence of depression: A cross-corpus study. InProceedings of INTERSPEECH 2023. ISCA, 2023

  11. [12]

    Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

    Pingping Wu, Ruihao Wang, Han Lin, Fanlong Zhang, Juan Tu, and Miao Sun. Automatic depression recognition by intelligent speech signal processing: A systematic survey.CAAI Transactions on Intelligence Technology, 8(3):701–711, 2023

  12. [13]

    Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

    Sanne Koops, Sanne G Brederoo, Janna N De Boer, Femke G Nadema, Alban E V oppel, and Iris E Sommer. Speech as a biomarker for depression.CNS & Neurological Disorders-Drug Targets-CNS & Neurological Disorders), 22(2):152–160, 2023

  13. [14]

    A review of depression and suicide risk assessment using speech analysis

    Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech communication, 71:10–49, 2015

  14. [15]

    Sung, and Philip C

    Guo-Shiang Liu, Nikola Jovanovic, Chang K. Sung, and Philip C. Doyle. A scoping review of artificial intelligence detection of voice pathology: Challenges and opportunities. Otolaryngology–Head and Neck Surgery, 171(3):658–666, 2024. 10

  15. [16]

    MVP: Multi-source V oice Pathology detection

    Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, and Elena Baralis. MVP: Multi-source V oice Pathology detection. InInterspeech 2025, pages 3548–3552, 2025

  16. [17]

    The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stuttering, activity, & mosquitoes

    Björn Schuller, Anton Batliner, Shahin Amiriparian, Christian Bergler, Maurice Gerczuk, Natalie Holz, Pauline Larrouy-Maestri, Sebastien Bayerl, Korbinian Riedhammer, Adria Mallol-Ragolta, Maria Pateraki, Harry Coppock, Ivan Kiskin, Marianne Sinka, and Stephen Roberts. The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stutte...

  17. [18]

    Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick

    Richard C. Gale, Megan Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. The post-stroke speech transcription (psst) challenge. InProceedings of the LREC 2022 RaPID-4 Workshop, pages 41–55, 2022

  18. [19]

    Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge

    Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, and Brian MacWhinney. Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge. InInterspeech 2020, pages 2172–2176, 2020

  19. [20]

    The 1st speechwellness challenge: Detecting suicide risk among adolescents

    Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, and Chao Zhang. The 1st speechwellness challenge: Detecting suicide risk among adolescents. InInterspeech 2025, pages 399–403. ISCA, 2025

  20. [21]

    Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

    Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, and Viral Nanda. Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics, 2021

  21. [22]

    Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

    Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge, 2023

  22. [23]

    Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457

    Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, and Julie Liss. Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH 2022, pages 2453–2457. ISCA, 09 2022

  23. [24]

    Guilherme Schu, Parvaneh Janbakhshi, and Ina Kodrasi. On using the ua-speech and torgo databases to validate automatic dysarthric speech classification approaches.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022

  24. [25]

    Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms

    Yi Zhu, Mohamed Imoussaine, Carolyn Côté-Lussier, and Tiago Falk. Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms. pages 46–54, 08 2023

  25. [26]

    Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

    Visar Berisha and Julie Liss. Responsible development of clinical speech ai: Bridging the gap between clinical research and technology.npj Digital Medicine, 7, 08 2024

  26. [27]

    Rosen, and Ryan C

    Katherine Verdolini, Clark A. Rosen, and Ryan C. Branski, editors.Classification Manual for Voice Disorders-I. Psychology Press, 1 edition, 2006

  27. [28]

    V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

    Guy Fagherazzi, Aurélie Fischer, Muhannad Ismael, and Vladimir Despotovic. V oice for health: the use of vocal biomarkers from research to clinical practice.Digital biomarkers, 5(1):78–88, 2021

  28. [29]

    Harrison, Liam D

    Jessica Robin, John E. Harrison, Liam D. Kaufman, Frank Rudzicz, William Simpson, and Maria Yancheva. Evaluation of speech-based digital biomarkers: Review and recommendations. Digital Biomarkers, 4(3):99–108, 10 2020

  29. [30]

    Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022

    Vikram Ramanarayanan, Adam Lammert, Hannah Rowe, Thomas Quatieri, and Jordan Green. Speech as a biomarker: Opportunities, interpretability, and challenges.Perspectives of the ASHA Special Interest Groups, 7:276–283, 01 2022. 11

  30. [31]

    An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

    Si-Ioi Ng, Lingfeng Xu, Ingo Siegert, Nicholas Cummins, Nina R Benway, Julie Liss, and Visar Berisha. An end-to-end overview of clinical speech ai.IEEE Transactions on Audio, Speech and Language Processing, 34:1016–1048, 2026

  31. [32]

    Opensmile: the munich versatile and fast open-source audio feature extractor

    Florian Eyben, Martin Wollmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. InProceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY , USA, 2010. Association for Computing Machinery

  32. [33]

    Covid-19 detection via fusion of modulation spectrum and linear prediction speech features

    Yi Zhu, Abhishek Tiwari, João Monteiro, Shruti Kshirsagar, and Tiago Henrique Falk. Covid-19 detection via fusion of modulation spectrum and linear prediction speech features. IEEE/ACM transactions on audio, speech, and language processing, 31:1536–1549, 2023

  33. [34]

    Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

    Tomas Arias-Vergara, Juan Camilo Vasquez-Correa, and Juan Rafael Orozco-Arroyave. Parkinson’s disease and aging: Analysis of their effect in phonation and articulation of speech.Cognitive Computation, 9(6):731–748, 2017

  34. [35]

    Effectiveness of voice quality features in detecting depression

    Ahmed Afshan, Jian Guo, Seong Joon Park, Venkatesh Ravi, Jonathan Flint, and Abeer Alwan. Effectiveness of voice quality features in detecting depression. InProceedings of INTERSPEECH 2018, pages 1676–1680. ISCA, 2018

  35. [36]

    Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

    Sejal Bhalla, Deshang Kong, Salaar Liaqat, Daniyal Liaqat, Robert Wu, Andrea Gershon, Eyal de Lara, and Alex Mariakakis. Association of daily lung condition in copd patients with wearable speech and physiological data.Scientific reports., 15(1), 2025-12-29

  36. [37]

    Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease

    Juan Camilo Vasquez-Correa, Juan Rafael Orozco-Arroyave, Tobias Bocklet, and Elmar Noeth. Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease. Journal of Communication Disorders, 76:21–36, 2018

  37. [38]

    Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization

    Chunying Fang, Haifeng Li, Lin Ma, and Mancai Zhang. Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization. Computational and Mathematical Methods in Medicine, 2017(1):2431573, 2017

  38. [39]

    Schuller

    Nicholas Cummins, Alice Baird, and Björn W. Schuller. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning.Methods, 151:41–54, 2018. Health Informatics and Translational Data Analytics

  39. [40]

    Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

    Qin Yang, Xin Li, Xinyun Ding, Feiyang Xu, and Zhenhua Ling. Deep learning-based speech analysis for alzheimer’s disease detection: A literature review.Alzheimer’s Research & Therapy, 14(1):186, 2022

  40. [41]

    V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals

    Rumana Islam, Esam Abdel-Raheem, and Mohammed Tarique. V oice pathology detection using convolutional neural networks with electroglottographic (egg) and speech signals. Computer Methods and Programs in Biomedicine Update, 2:100074, 2022

  41. [42]

    End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

    Harry Coppock, Alex Gaskell, Panagiotis Tzirakis, Alice Baird, Lyn Jones, and Björn Schuller. End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study.BMJ Innovations, 7(2):356–362, 2021

  42. [43]

    Automatic assessment of aphasic speech sensed by audio sensors for classification into aphasia severity levels to recommend speech therapies.Sensors, 22(18), 2022

    Herath Mudiyanselage Dhammike Piyumal Madhurajith Herath, Weraniyagoda Arachchilage Sahanaka Anuththara Weraniyagoda, Rajapakshage Thilina Madhushan Rajapaksha, Patikiri Arachchige Don Shehan Nilmantha Wijesekara, Kalupahana Liyanage Kushan Sudheera, and Peter Han Joo Chong. Automatic assessment of aphasic speech sensed by audio sensors for classification...

  43. [44]

    Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023

    Ah Young Kim, Eun Hye Jang, Seung-Hwan Lee, Kwang-Yeon Choi, Jeon Gue Park, and Hyun-Chool Shin. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach.J Med Internet Res, 25:e34474, Jan 2023. 12

  44. [45]

    Yi Zhu, Alex Mariakakis, Eyal De Lara, and Tiago H. Falk. How generalizable and interpretable are speech-based covid-19 detection systems?: A comparative analysis and new system proposal. In2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 1–5, 2022

  45. [46]

    Masked autoencoders are scalable vision learners, 2021

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021

  46. [47]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  47. [48]

    Self-supervised learning from images with a joint-embedding predictive architecture, 2023

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023

  48. [49]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

  49. [50]

    Improving language understanding by generative pre-training

    Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018

  50. [51]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  51. [52]

    wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

    Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

  52. [53]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

  53. [54]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, October 2022

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Pr...

  54. [55]

    Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

    Farhad Javanmardi, Sudarsana Reddy Kadiri, and Paavo Alku. Pre-trained models for detection and severity level classification of dysarthria from speech.Speech Communication, 158:103047, 2024

  55. [56]

    Gruia, Patrick A

    Giulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, and Fatemeh Geranmayeh. When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. InInterspeech 2024, pages 1995–1999, 2024

  56. [57]

    Exploring self-supervised models for depressive disorder detection: A study on speech corpora

    Bubai Maji, Shazia Nasreen, Rajlakshmi Guha, Aurobinda Routray, Debabrata Majumdar, and Km Poonam. Exploring self-supervised models for depressive disorder detection: A study on speech corpora. In2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1–4, 2024

  57. [58]

    Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

    Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. InInterspeech 2021, pages 3400–3404, 2021

  58. [59]

    Probing whisper for dysarthric speech in detection and assessment, 2025

    Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, and Erfan Loweimi. Probing whisper for dysarthric speech in detection and assessment, 2025. 13

  59. [60]

    Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals

    Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang, and Li-Rong Dai. Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 561–565. IEEE, May 2022

  60. [61]

    Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

    Yi Zhu and Tiago Falk. Wavrx: A disease-agnostic, generalizable, and privacy-preserving speech health diagnostic model.IEEE Journal of Biomedical and Health Informatics, 29(9):6353–6365, 2025

  61. [62]

    Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila

    Sebastien Baur, Zaid Nabulsi, Wei-Hung Weng, Jake Garrison, Louis Blankemeier, Sam Fishman, Christina Chen, Sujay Kakarmath, Minyoi Maimbolwa, Nsala Sanjase, Brian Shuma, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Shruthi Prabhakara, Monde Muyoyeta, and Diego Ardila. Hear – health acoustic representations, 2024

  62. [63]

    Lin, Andy T

    Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei- Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchm...

  63. [64]

    Schuller, Christian J

    Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, and Yonat...

  64. [65]

    The distress analysis interview corpus of human and computer interviews

    Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Louis-Philippe Morency. The distress analysis interview corpus of human and computer interviews. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Mae...

  65. [66]

    Livingstone and Frank A

    Steven R. Livingstone and Frank A. Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PLOS ONE, 13(5):e0196391, 2018

  66. [67]

    Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42:335–359, 12 2008

  67. [68]

    Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge

    Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023

  68. [69]

    Forbes, Davida Fromm, and Brian MacWhinney

    Margaret M. Forbes, Davida Fromm, and Brian MacWhinney. Aphasiabank: A resource for clinicians.Aphasiology, 26(11):1281–1295, 2012

  69. [70]

    The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

    Frank Rudzicz, Aravind Namasivayam, and Talya Wolff. The torgo database of acoustic and articulatory speech from speakers with dysarthria.Language Resources and Evaluation, 46:1–19, 01 2010

  70. [71]

    Huang, Kenneth Watkin, and Simone Frame

    Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman, Jon Gunderson, Thomas S. Huang, Kenneth Watkin, and Simone Frame. Dysarthric speech database for universal access research. InInterspeech 2008, pages 1741–1744, 2008

  71. [72]

    Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019

    Hagen Jaeger, Dhaval Trivedi, and Michael Stadtschnitzer. Mobile device voice recordings at king’s college london (mdvr-kcl) from both early and advanced parkinson’s disease patients and healthy controls.Zenodo, 2019. 14

  72. [73]

    Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering

    Sebastian Peter Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Noeth, and Korbinian Riedhammer. Ksof: The kassel state of fluency dataset – a therapy centered dataset of stuttering. InProceedings of the Language Resources and Evaluation Conference, pages 1780–1787, Marseille, France, June 2022. European Language Resources Association

  73. [74]

    Covid-19 sounds: A large-scale audio dataset for digital respiratory screening

    Tong Xia, Dimitris Spathis, Chloe Brown, Jagmohan Chauhan, Andreas Grammenos, Jing Han, Apinan Hasthanasombat, Erika Bondareva, Ting Dang, Andres Floto, Pietro Cicuta, and Cecilia Mascolo. Covid-19 sounds: A large-scale audio dataset for digital respiratory screening. InProceedings of the 35th Conference on Neural Information Processing Systems Datasets a...

  74. [75]

    Chandrakiran, Sahiti Nori, K

    Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, C. Chandrakiran, Sahiti Nori, K. K. Suhail, Sadhana Gonuguntla, and Murali Alagesan. Coswara: A respiratory sounds and symptoms dataset for remote screening of sars-cov-2 infection.Scientific Data, 10(1):397, 2023

  75. [76]

    Jesus, Inês Belo, Jessica Machado, and Andreia Hall

    Luis M.T. Jesus, Inês Belo, Jessica Machado, and Andreia Hall. The advanced voice function assessment databases (avfad): Tools for voice clinicians and speech research. InAdvances in Speech-language Pathology, chapter 14. IntechOpen, London, 2017

  76. [77]

    Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

    Lester Phillip Violeta, Wen-Chin Huang, and Tomoki Toda. Investigating self-supervised pretraining frameworks for pathological speech recognition, 2022

  77. [78]

    V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

    Jie Cai, Yuliang Song, Jianghao Wu, and Xiong Chen. V oice disorder classification using wav2vec 2.0 feature extraction.Journal of Voice, 2024

  78. [79]

    Scaling speech technology to 1,000+ languages, 2023

    Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages, 2023

  79. [80]

    Qwen3-TTS Technical Report

    Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

  80. [81]

    Robust speech recognition via large-scale weak supervision, 2022

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

Showing first 80 references.