Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

Dominik Wagner; Korbinian Riedhammer; Natalie Engert; Tobias Bocklet

arxiv: 2604.21628 · v1 · submitted 2026-04-23 · 💻 cs.SD

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

Natalie Engert , Dominik Wagner , Korbinian Riedhammer , Tobias Bocklet This is my paper

Pith reviewed 2026-05-08 13:33 UTC · model grok-4.3

classification 💻 cs.SD

keywords dysarthriawav2vec 2.0speech descriptorslayer-wise aggregationtime-wise modelingintelligibilitypathological speech analysis

0 comments

The pith

Wav2vec 2.0 layer-wise features best predict intelligibility in dysarthric speech, while time-wise features suit imprecise consonants, harsh voice and monoloudness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests which parts of wav2vec 2.0's speech representations are most useful for predicting different problems in dysarthric speech from the Speech Accessibility Project. It compares pulling features from different layers of the model against pulling features across the time dimension of the audio. Results indicate that overall speech understandability is best predicted by layer information, pointing to how the model builds up higher-level understanding in its deeper layers. Problems with consonant precision, voice harshness, and loudness consistency improve when using time-based features, which may pick up on local timing and variation issues. Inappropriate silences did not favor one method over the other.

Core claim

Using attentive statistics pooling on wav2vec 2.0 features extracted from dysarthric speech, layer-wise aggregation yields the best regression performance for intelligibility scores, whereas time-wise aggregation performs better for imprecise consonants, harsh voice, and monoloudness. No advantage is found for either strategy when predicting inappropriate silences.

What carries the argument

Attentive statistics pooling applied either layer-wise or time-wise to the representations from a wav2vec 2.0 feature extractor.

Load-bearing premise

The layer-wise and time-wise aggregation strategies using attentive statistics pooling are able to isolate the most predictive cues for each descriptor without substantial loss of relevant information from the wav2vec 2.0 representations.

What would settle it

A replication study on an independent dysarthric speech corpus that finds time-wise aggregation superior for intelligibility prediction would falsify the claim that layer-wise representations are best for that descriptor.

read the original abstract

Wav2vec 2.0 (W2V2) has shown strong performance in pathological speech analysis by effectively capturing the characteristics of atypical speech. Despite its success, it remains unclear which components of its learned representations are most informative for specific downstream tasks. In this study, we address this question by investigating the regression of dysarthric speech descriptors using annotations from the Speech Accessibility Project dataset. We focus on five descriptors, each addressing a different aspect of speech or voice production: intelligibility, imprecise consonants, inappropriate silences, harsh voice and monoloudness. Speech representations are derived from a W2V2-based feature extractor, and we systematically compare layer-wise and time-wise aggregation strategies using attentive statistics pooling. Our results show that intelligibility is best captured through layer-wise representations, whereas imprecise consonants, harsh voice and monoloudness benefit from time-wise modeling. For inappropriate silences, no clear advantage could be observed for either approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows layer-wise pooling from wav2vec 2.0 predicts intelligibility better while time-wise helps more with imprecise consonants, harsh voice, and monoloudness on the Speech Accessibility Project data.

read the letter

The core finding is straightforward: when you hold the pooling method fixed and vary only whether you aggregate across layers or across time, the best choice flips depending on the descriptor. Intelligibility favors the layer-wise route, while imprecise consonants, harsh voice, and monoloudness favor time-wise; inappropriate silences show no clear winner. That split is the new piece, and it comes from a direct head-to-head on the same attentive statistics pooling applied to the same W2V2 extractor and the same five annotations.

Referee Report

2 major / 3 minor

Summary. The manuscript investigates the location of predictive information in wav2vec 2.0 representations for regressing five dysarthric speech descriptors (intelligibility, imprecise consonants, inappropriate silences, harsh voice, monoloudness) from the Speech Accessibility Project dataset. It compares layer-wise versus time-wise aggregation, both using attentive statistics pooling, and reports that layer-wise aggregation is superior for intelligibility while time-wise is better for imprecise consonants, harsh voice and monoloudness, with no clear advantage for inappropriate silences.

Significance. If the empirical rankings hold under proper statistical scrutiny, the work offers a useful descriptive map of where W2V2 encodes different aspects of dysarthria. The isolation of the aggregation dimension (layer vs. time) on a public dataset is a strength that supports reproducibility and can inform feature-extraction choices in pathological speech modeling.

major comments (2)

[Results] Results section: the abstract and reported directional findings lack any mention of statistical tests, p-values, confidence intervals, error bars, dataset size, or cross-validation procedure. Without these, it is impossible to determine whether the claimed advantages (e.g., layer-wise for intelligibility) are statistically reliable or could arise from sampling variability.
[§3] §3 (Methods): the claim that the comparison isolates the layer-vs-time dimension rests on the assumption that attentive statistics pooling introduces no differential bias across the two aggregation axes. No ablation with alternative pooling (mean, max, or learned attention variants) is described, leaving open the possibility that the observed preferences are pooling-specific rather than dimension-specific.

minor comments (3)

[§2] Provide the exact number of utterances and speakers used for each descriptor, along with any speaker-independent partitioning details.
[§3] Clarify whether the same W2V2 checkpoint and fine-tuning regime is used for all five descriptors or whether descriptor-specific adaptation occurs.
[Results] Table or figure presenting the regression metrics should include both the primary metric and a secondary one (e.g., Pearson r alongside MSE) to allow readers to judge effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses

Referee: [Results] Results section: the abstract and reported directional findings lack any mention of statistical tests, p-values, confidence intervals, error bars, dataset size, or cross-validation procedure. Without these, it is impossible to determine whether the claimed advantages (e.g., layer-wise for intelligibility) are statistically reliable or could arise from sampling variability.

Authors: We fully agree that the absence of statistical tests and related details weakens the interpretability of our results. Although the experiments were conducted using cross-validation on the Speech Accessibility Project dataset (with speaker-independent folds to avoid data leakage), we did not report p-values, confidence intervals, or error bars in the submitted version. In the revised manuscript, we will include these elements: dataset sizes per descriptor, a detailed description of the cross-validation procedure, error bars on the performance plots, and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing layer-wise and time-wise aggregations. This will allow readers to assess the reliability of the directional findings. revision: yes
Referee: [§3] §3 (Methods): the claim that the comparison isolates the layer-vs-time dimension rests on the assumption that attentive statistics pooling introduces no differential bias across the two aggregation axes. No ablation with alternative pooling (mean, max, or learned attention variants) is described, leaving open the possibility that the observed preferences are pooling-specific rather than dimension-specific.

Authors: This is a valid concern. Our experimental design holds the pooling function constant (attentive statistics pooling) across both aggregation strategies to focus on the layer versus time dimension. We selected this pooling method because it has proven effective in prior speech processing work for capturing relevant statistics. Nevertheless, we acknowledge that the results may depend on the choice of pooling. In the revision, we will expand the discussion in §3 and the conclusion to explicitly state this limitation and recommend that future studies explore alternative pooling methods (such as simple mean pooling or max pooling) to test the generalizability of the layer/time preferences. We believe this addresses the core issue without requiring extensive new experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical study that extracts representations from a pre-trained wav2vec 2.0 model, applies two standard aggregation strategies (layer-wise vs. time-wise with attentive statistics pooling), and regresses five dysarthric speech descriptors on the public Speech Accessibility Project annotations. No equations, derivations, or parameter-fitting steps are described that would reduce any reported result to its own inputs by construction. The central claims are comparative performance rankings obtained from the experimental design itself; these rankings are externally falsifiable on the named dataset and do not rely on self-citations, uniqueness theorems, or ansatzes imported from prior author work. The work therefore contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only, the paper relies on standard assumptions from self-supervised speech modeling literature without detailing new free parameters or invented entities; attentive statistics pooling is invoked as the aggregation method.

axioms (1)

domain assumption Attentive statistics pooling is a suitable method for aggregating W2V2 representations to capture predictive cues for dysarthric descriptors.
Used as the core aggregation strategy in the study but not justified or compared to alternatives in the abstract.

pith-pipeline@v0.9.0 · 5478 in / 1366 out tokens · 37646 ms · 2026-05-08T13:33:16.573021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Pretrained wav2vec 2.0 (W2V2) [1] models have become a popular backbone architecture for various pathological speech processing tasks [2, 3, 4, 5]. Beyond linguistic and speaker-related features, W2V2 embeddings capture paralinguistic properties such as voice quality, prosody and speaking style, making them well-suited for clinical applicatio...

work page
[2]

DA TA The Speech Accessibility Project [21] (SAP) dataset aims to facil- itate research and technological development in dysarthric speech recognition. We use the 2024-11-30 release, containing recordings from 430 participants with Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), cerebral palsy, stroke, or Down syndrome. These conditions are...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

METHODS For the regression tasks, we utilize a W2V2-based feature extrac- tor, followed by either a layer-wise or a time-wise attentive pooling mechanism. The extracted features are subsequently fed into a re- gression head, implemented as a standard fully connected feedfor- ward neural network with a ReLU activation function in the hidden layers. The out...

work page 2048
[4]

Besides the ASP meth- ods described earlier, mean pooling without attention is used as a baseline pooling mechanism on the mean layer representations (Exp

EXPERIMENTS All experiments were formulated as regression tasks targeting the speech descriptorsintelligibility,harsh voice,monoloudness,inap- propriate silencesandimprecise consonants. Besides the ASP meth- ods described earlier, mean pooling without attention is used as a baseline pooling mechanism on the mean layer representations (Exp

work page
[5]

and layer 12 representations (Exp. 2). All models were trained using the Adam [27] optimizer withβ 1 = 0.9andβ 2 = 0.999, early stopping after 15 epochs, a fixed learning rate of10 −5, and a batch size of 32. For each model configuration, attention heads ah ∈ {1,5,64,128}were evaluated. Model performance was as- sessed using mean squared error (MSE) and t...

work page
[6]

Overall, the comparison of ASP models with the baseline (Exp

RESULTS AND DISCUSSION Table 2 provides an overview of the experimental results comparing time- and layer-wise ASP mechanisms across various attention head configurations. Overall, the comparison of ASP models with the baseline (Exp. 1-2) demonstrates that the best performance is consis- tently achieved using an ASP method, with MSE values significantly l...

work page
[7]

CONCLUSION In this work, we investigated the impact of layer-wise and time- wise ASP mechanisms using features extracted from a W2V2-based model. Our results indicate thatintelligibilityis more effectively modeled with layer-wise representations, whereasimprecise conso- nants,harsh voiceandmonoloudnessare better captured with time- wise modeling. Forinapp...

work page
[8]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460

work page 2020
[9]

Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition,

S. Hu, X. Xie, Z. Jin, M. Geng, Y . Wang, M. Cui, J. Deng, X. Liu, and H. Meng, “Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition,” in2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[10]

Adversarial robustness analysis in automatic pathological speech detection approaches,

M. Amiri and I. Kodrasi, “Adversarial robustness analysis in automatic pathological speech detection approaches,” inInterspeech 2024, 2024, pp. 1415–1419

work page 2024
[11]

What can speech and language tell us about the working alliance in psychotherapy,

S. P. Bayerl, G. Roccabruna, S. A. Chowdhury, T. Ciulli, M. Danieli, K. Riedhammer, and G. Riccardi, “What can speech and language tell us about the working alliance in psychotherapy,” inInterspeech 2022, 2022, p. 2443–2447

work page 2022
[12]

Multi-class detection of pathological speech with latent features: How does it perform on unseen data?,

D. Wagner, I. Baumann, F. Braun, S. P. Bayerl, E. N ¨oth, K. Riedham- mer, and T. Bocklet, “Multi-class detection of pathological speech with latent features: How does it perform on unseen data?,” inInterspeech 2023, 2023, pp. 2318–2322

work page 2023
[13]

Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Re- sources and Evaluation (LREC-COLING 2024), 2024, pp. 6935–6944

work page 2024
[14]

wav2vec2-based speech rating system for children with speech sound disorder,

Y . Getman, R. Al-Ghezi, K. V oskoboinik, T. Gr ´osz, M. Kurimo, G. Salvi, T. Svendsen, and S. Str¨ombergsson, “wav2vec2-based speech rating system for children with speech sound disorder,” inInterspeech 2022, 2022, pp. 3618–3622

work page 2022
[15]

To- wards interpretability of automatic phoneme analysis in cleft lip and palate speech,

I. Baumann, D. Wagner, M. Schuster, E. N ¨oth, and T. Bocklet, “To- wards interpretability of automatic phoneme analysis in cleft lip and palate speech,” in2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12602–12606

work page 2024
[16]

Speaker adaptation for wav2vec2 based dysarthric ASR,

M. K. Baskar, T. Herzig, D. Nguyen, M. Diez, T. Polzehl, L. Burget, and J. ˇCernock´y, “Speaker adaptation for wav2vec2 based dysarthric ASR,” inInterspeech 2022, 2022, pp. 3403–3407

work page 2022
[17]

Wav2vec-based detection and severity level classification of dysarthria from speech,

F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-based detection and severity level classification of dysarthria from speech,” in2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5

work page 2023
[18]

V oice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect,

J. Narain, V . Kowtha, C. Lea, L. Tooley, D. Yee, V . Mitra, Z. Huang, M. Espi Marques, J. Huang, C. Avendano, and S. Ren, “V oice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect,” inInterspeech 2025, 2025, pp. 4628–4632

work page 2025
[19]

Exploring wav2vec 2.0 on speaker verification and language identification,

Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on speaker verification and language identification,” inInterspeech 2021, 2021, pp. 1509–1513

work page 2021
[20]

Large language models for dysfluency detection in stut- tered speech,

D. Wagner, S. P. Bayerl, I. Baumann, E. N ¨oth, K. Riedhammer, and T. Bocklet, “Large language models for dysfluency detection in stut- tered speech,” inInterspeech 2024, 2024, pp. 5118–5122

work page 2024
[21]

Personalized fine-tuning with controllable synthetic speech from LLM-generated transcripts for dysarthric speech recogni- tion,

D. Wagner, I. Baumann, N. Engert, S. Lee, E. N ¨oth, K. Riedhammer, and T. Bocklet, “Personalized fine-tuning with controllable synthetic speech from LLM-generated transcripts for dysarthric speech recogni- tion,” inInterspeech 2025, 2025, pp. 3294–3298

work page 2025
[22]

Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction,

D. A. Wiepert, R. L. Utianski, J. R. Duffy, J. L. Stricker, L. R. Barnard, D. T. Jones, and H. Botha, “Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction,” in Interspeech 2024, 2024, pp. 4618–4622

work page 2024
[23]

Layer-wise analysis of a self- supervised speech representation model,

A. Pasad, J.-C. Chou, and K Livescu, “Layer-wise analysis of a self- supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921

work page 2021
[24]

Librispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[25]

Ex- ploring ASR-based wav2vec2 for automated speech disorder assess- ment: Insights and analysis,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Ex- ploring ASR-based wav2vec2 for automated speech disorder assess- ment: Insights and analysis,” in2024 IEEE Spoken Language Technol- ogy Workshop (SLT), 2024, pp. 975–982

work page 2024
[26]

Dysarthric speech database for uni- versal access research,

H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. S. Huang, K. Watkin, and S. Frame, “Dysarthric speech database for uni- versal access research,” inInterspeech 2008, 2008, pp. 1741–1744

work page 2008
[27]

Towards self-attention understanding for automatic articu- latory processes analysis in cleft lip and palate speech,

I. Baumann, D. Wagner, M. Schuster, K. Riedhammer, E. N ¨oth, and T. Bocklet, “Towards self-attention understanding for automatic articu- latory processes analysis in cleft lip and palate speech,” inInterspeech 2024, 2024, pp. 2430–2434

work page 2024
[28]

Community-supported shared infrastructure in support of speech accessibility,

M. Hasegawa-Johnson, X. Zheng, H. Kim, C. Mendes, M. Dickinson, E. Hege, C. Zwilling, M. M. Channell, L Mattie, H. Hodges, L. Ramig, M. Bellard, M. Shebanek, L. Sariota, K. Kalgaonkar, D. Frerichs, J. P. Bigham, L. Findlater, C. Lea, S. Herrlinger, P. Korn, S. Abou-Zahra, R. Heywood, K. Tomanek, and B. MacDonald, “Community-supported shared infrastructure...

work page 2024
[29]

Differential diagnostic patterns of dysarthria,

F. L. Darley, A. E. Aronson, and J. R. Brown, “Differential diagnostic patterns of dysarthria,”Journal of Speech, Language, and Hearing Research, vol. 12, no. 2, pp. 246–269, 1969

work page 1969
[30]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Lan- guage Resources and Evaluation Conference, 2020, pp. 4218–4222

work page 2020
[31]

ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” inInterspeech 2020, 2020, pp. 3830–3834

work page 2020
[32]

Speechbrain: A general- purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “Speechbrain: A general- purpose speech toolkit,” 2021

work page 2021
[33]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”IEEE Trans- actions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989

work page 1989
[34]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations (ICLR), 2015. 5

work page 2015

[1] [1]

INTRODUCTION Pretrained wav2vec 2.0 (W2V2) [1] models have become a popular backbone architecture for various pathological speech processing tasks [2, 3, 4, 5]. Beyond linguistic and speaker-related features, W2V2 embeddings capture paralinguistic properties such as voice quality, prosody and speaking style, making them well-suited for clinical applicatio...

work page

[2] [2]

DA TA The Speech Accessibility Project [21] (SAP) dataset aims to facil- itate research and technological development in dysarthric speech recognition. We use the 2024-11-30 release, containing recordings from 430 participants with Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), cerebral palsy, stroke, or Down syndrome. These conditions are...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

METHODS For the regression tasks, we utilize a W2V2-based feature extrac- tor, followed by either a layer-wise or a time-wise attentive pooling mechanism. The extracted features are subsequently fed into a re- gression head, implemented as a standard fully connected feedfor- ward neural network with a ReLU activation function in the hidden layers. The out...

work page 2048

[4] [4]

Besides the ASP meth- ods described earlier, mean pooling without attention is used as a baseline pooling mechanism on the mean layer representations (Exp

EXPERIMENTS All experiments were formulated as regression tasks targeting the speech descriptorsintelligibility,harsh voice,monoloudness,inap- propriate silencesandimprecise consonants. Besides the ASP meth- ods described earlier, mean pooling without attention is used as a baseline pooling mechanism on the mean layer representations (Exp

work page

[5] [5]

and layer 12 representations (Exp. 2). All models were trained using the Adam [27] optimizer withβ 1 = 0.9andβ 2 = 0.999, early stopping after 15 epochs, a fixed learning rate of10 −5, and a batch size of 32. For each model configuration, attention heads ah ∈ {1,5,64,128}were evaluated. Model performance was as- sessed using mean squared error (MSE) and t...

work page

[6] [6]

Overall, the comparison of ASP models with the baseline (Exp

RESULTS AND DISCUSSION Table 2 provides an overview of the experimental results comparing time- and layer-wise ASP mechanisms across various attention head configurations. Overall, the comparison of ASP models with the baseline (Exp. 1-2) demonstrates that the best performance is consis- tently achieved using an ASP method, with MSE values significantly l...

work page

[7] [7]

CONCLUSION In this work, we investigated the impact of layer-wise and time- wise ASP mechanisms using features extracted from a W2V2-based model. Our results indicate thatintelligibilityis more effectively modeled with layer-wise representations, whereasimprecise conso- nants,harsh voiceandmonoloudnessare better captured with time- wise modeling. Forinapp...

work page

[8] [8]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460

work page 2020

[9] [9]

Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition,

S. Hu, X. Xie, Z. Jin, M. Geng, Y . Wang, M. Cui, J. Deng, X. Liu, and H. Meng, “Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition,” in2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[10] [10]

Adversarial robustness analysis in automatic pathological speech detection approaches,

M. Amiri and I. Kodrasi, “Adversarial robustness analysis in automatic pathological speech detection approaches,” inInterspeech 2024, 2024, pp. 1415–1419

work page 2024

[11] [11]

What can speech and language tell us about the working alliance in psychotherapy,

S. P. Bayerl, G. Roccabruna, S. A. Chowdhury, T. Ciulli, M. Danieli, K. Riedhammer, and G. Riccardi, “What can speech and language tell us about the working alliance in psychotherapy,” inInterspeech 2022, 2022, p. 2443–2447

work page 2022

[12] [12]

Multi-class detection of pathological speech with latent features: How does it perform on unseen data?,

D. Wagner, I. Baumann, F. Braun, S. P. Bayerl, E. N ¨oth, K. Riedham- mer, and T. Bocklet, “Multi-class detection of pathological speech with latent features: How does it perform on unseen data?,” inInterspeech 2023, 2023, pp. 2318–2322

work page 2023

[13] [13]

Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Re- sources and Evaluation (LREC-COLING 2024), 2024, pp. 6935–6944

work page 2024

[14] [14]

wav2vec2-based speech rating system for children with speech sound disorder,

Y . Getman, R. Al-Ghezi, K. V oskoboinik, T. Gr ´osz, M. Kurimo, G. Salvi, T. Svendsen, and S. Str¨ombergsson, “wav2vec2-based speech rating system for children with speech sound disorder,” inInterspeech 2022, 2022, pp. 3618–3622

work page 2022

[15] [15]

To- wards interpretability of automatic phoneme analysis in cleft lip and palate speech,

I. Baumann, D. Wagner, M. Schuster, E. N ¨oth, and T. Bocklet, “To- wards interpretability of automatic phoneme analysis in cleft lip and palate speech,” in2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12602–12606

work page 2024

[16] [16]

Speaker adaptation for wav2vec2 based dysarthric ASR,

M. K. Baskar, T. Herzig, D. Nguyen, M. Diez, T. Polzehl, L. Burget, and J. ˇCernock´y, “Speaker adaptation for wav2vec2 based dysarthric ASR,” inInterspeech 2022, 2022, pp. 3403–3407

work page 2022

[17] [17]

Wav2vec-based detection and severity level classification of dysarthria from speech,

F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-based detection and severity level classification of dysarthria from speech,” in2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5

work page 2023

[18] [18]

V oice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect,

J. Narain, V . Kowtha, C. Lea, L. Tooley, D. Yee, V . Mitra, Z. Huang, M. Espi Marques, J. Huang, C. Avendano, and S. Ren, “V oice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect,” inInterspeech 2025, 2025, pp. 4628–4632

work page 2025

[19] [19]

Exploring wav2vec 2.0 on speaker verification and language identification,

Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on speaker verification and language identification,” inInterspeech 2021, 2021, pp. 1509–1513

work page 2021

[20] [20]

Large language models for dysfluency detection in stut- tered speech,

D. Wagner, S. P. Bayerl, I. Baumann, E. N ¨oth, K. Riedhammer, and T. Bocklet, “Large language models for dysfluency detection in stut- tered speech,” inInterspeech 2024, 2024, pp. 5118–5122

work page 2024

[21] [21]

Personalized fine-tuning with controllable synthetic speech from LLM-generated transcripts for dysarthric speech recogni- tion,

D. Wagner, I. Baumann, N. Engert, S. Lee, E. N ¨oth, K. Riedhammer, and T. Bocklet, “Personalized fine-tuning with controllable synthetic speech from LLM-generated transcripts for dysarthric speech recogni- tion,” inInterspeech 2025, 2025, pp. 3294–3298

work page 2025

[22] [22]

Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction,

D. A. Wiepert, R. L. Utianski, J. R. Duffy, J. L. Stricker, L. R. Barnard, D. T. Jones, and H. Botha, “Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction,” in Interspeech 2024, 2024, pp. 4618–4622

work page 2024

[23] [23]

Layer-wise analysis of a self- supervised speech representation model,

A. Pasad, J.-C. Chou, and K Livescu, “Layer-wise analysis of a self- supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921

work page 2021

[24] [24]

Librispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015

[25] [25]

Ex- ploring ASR-based wav2vec2 for automated speech disorder assess- ment: Insights and analysis,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Ex- ploring ASR-based wav2vec2 for automated speech disorder assess- ment: Insights and analysis,” in2024 IEEE Spoken Language Technol- ogy Workshop (SLT), 2024, pp. 975–982

work page 2024

[26] [26]

Dysarthric speech database for uni- versal access research,

H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. S. Huang, K. Watkin, and S. Frame, “Dysarthric speech database for uni- versal access research,” inInterspeech 2008, 2008, pp. 1741–1744

work page 2008

[27] [27]

Towards self-attention understanding for automatic articu- latory processes analysis in cleft lip and palate speech,

I. Baumann, D. Wagner, M. Schuster, K. Riedhammer, E. N ¨oth, and T. Bocklet, “Towards self-attention understanding for automatic articu- latory processes analysis in cleft lip and palate speech,” inInterspeech 2024, 2024, pp. 2430–2434

work page 2024

[28] [28]

Community-supported shared infrastructure in support of speech accessibility,

M. Hasegawa-Johnson, X. Zheng, H. Kim, C. Mendes, M. Dickinson, E. Hege, C. Zwilling, M. M. Channell, L Mattie, H. Hodges, L. Ramig, M. Bellard, M. Shebanek, L. Sariota, K. Kalgaonkar, D. Frerichs, J. P. Bigham, L. Findlater, C. Lea, S. Herrlinger, P. Korn, S. Abou-Zahra, R. Heywood, K. Tomanek, and B. MacDonald, “Community-supported shared infrastructure...

work page 2024

[29] [29]

Differential diagnostic patterns of dysarthria,

F. L. Darley, A. E. Aronson, and J. R. Brown, “Differential diagnostic patterns of dysarthria,”Journal of Speech, Language, and Hearing Research, vol. 12, no. 2, pp. 246–269, 1969

work page 1969

[30] [30]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Lan- guage Resources and Evaluation Conference, 2020, pp. 4218–4222

work page 2020

[31] [31]

ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” inInterspeech 2020, 2020, pp. 3830–3834

work page 2020

[32] [32]

Speechbrain: A general- purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “Speechbrain: A general- purpose speech toolkit,” 2021

work page 2021

[33] [33]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”IEEE Trans- actions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989

work page 1989

[34] [34]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations (ICLR), 2015. 5

work page 2015