What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

Alain Ghio (LPL); AU); Corinne Fredouille (AU; LIA); LNPL); Muriel Lalain (LPL); Tuan Nguyen (LIA; UT3; Virginie Woisard (UT2J

arxiv: 2606.24949 · v1 · pith:DAMXIVAJnew · submitted 2026-06-23 · 💻 cs.SD · cs.AI· cs.NE

What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

Tuan Nguyen (LIA , AU) , Corinne Fredouille (AU , LIA) , Alain Ghio (LPL) , Muriel Lalain (LPL) , Virginie Woisard (UT2J , UT3

show 1 more author

LNPL)

This is my paper

Pith reviewed 2026-06-25 22:49 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.NE

keywords pathological speech assessmentspeech intelligibilitywav2vec 2.0canonical correlation analysisacoustic featuresoral cancermodel interpretability

0 comments

The pith

A Wav2Vec 2.0 model for cancer patient speech assessment encodes spectral and prosodic features more strongly than voice quality ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how much standard acoustic information appears inside the internal representations of a Wav2Vec 2.0 model trained to judge speech intelligibility in oral and oropharyngeal cancer patients. It does this by running canonical correlation analysis between the model's layer embeddings and a reference set of low-level acoustic descriptors. The results show the strongest links to spectral and prosodic descriptors, with the first MFCC coefficient standing out across every layer examined. At the group level the spectral descriptors reach 0.77 correlation and the prosodic ones reach 0.71, while voice quality descriptors reach 0.65. The work also points to practical choices for which acoustic features to use when building or interpreting such assessment systems.

Core claim

The learned representations are most strongly correlated with spectral and prosodic features, with the first MFCC coefficient yielding the highest correlations across all layers. At the group level, spectral and prosodic groups achieve correlations of 0.77 and 0.71 respectively, while voice quality reaches 0.65.

What carries the argument

Canonical correlation analysis between the model's layer-wise embeddings and eGeMAPS low-level descriptors, performed both per descriptor and by acoustic group.

If this is right

Spectral and prosodic descriptors should receive priority when selecting features for pathological speech assessment models.
The model appears to rely more on frequency content and timing patterns than on voice quality measures for its judgments.
Layer-wise inspection can indicate which parts of the network capture the most usable acoustic detail for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correlation method could be applied to compare what different speech foundation models have learned about acoustic structure.
Feature-selection pipelines for cancer speech tasks could be guided by these correlation rankings rather than by general-purpose acoustic sets.
The emphasis on spectral and prosodic information may reflect the specific effects of oral and oropharyngeal cancer on intelligibility.

Load-bearing premise

The eGeMAPS low-level descriptors supply a sufficient and unbiased reference for the acoustic information the model actually uses.

What would settle it

A re-analysis that replaces eGeMAPS descriptors with a different acoustic feature set and finds either higher overall correlations or a reversal in which group leads.

Figures

Figures reproduced from arXiv: 2606.24949 by Alain Ghio (LPL), AU), Corinne Fredouille (AU, LIA), LNPL), Muriel Lalain (LPL), Tuan Nguyen (LIA, UT3, Virginie Woisard (UT2J.

**Figure 1.** Figure 1: Layer-wise PWCCA correlation between eGeMAPS LLDs and Wav2Vec 2.0 representations (rank 1 = highest correlation). speech assessment [4, 29]. Due to its efficiency and wide adoption in the literature, eGeMAPS is employed in this work as the interpretable feature set to analyze the Wav2Vec 2.0-based assessment model. Regarding feature type, we use LLD features rather than functional features to preserve the… view at source ↗

**Figure 2.** Figure 2: Heatmap of layer-wise PWCCA correlations between eGeMAPS LLDs and Wav2Vec 2.0 representations. Features are sorted by mean correlation across layers. Prosodic Spectral Voice Quality 0.77 0.71 0.65 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: PWCCA correlation between the layer 24 of Wav2Vec 2.0 and feature groups: Prosodic, Spectral, and Voice Quality. opens future work toward extending the analysis to a wider set of features, and ultimately toward a common reference set well suited for explaining pathological speech assessment models. Overall, the layer-wise analysis demonstrates how the decisions of Wav2Vec 2.0 relate to interpretable acoust… view at source ↗

read the original abstract

This work investigates the interpretability of a Wav2Vec 2.0based speech intelligibility assessment model for oral and oropharyngeal cancer patients through canonical correlation analysis. By measuring the correlation between the model embeddings and eGeMAPS low-level descriptors (LLDs) as an interpretable reference, we analyze how acoustic information is encoded across the model layers. The analysis is conducted at two levels: individual LLDs layer-wise, and group-level: prosodic, spectral, and voice quality. Results show that the learned representations are most strongly correlated with spectral and prosodic features, with the first MFCC coefficient yielding the highest correlations across all layers. At the group level, spectral and prosodic groups achieve correlations of 0.77 and 0.71 respectively, while voice quality reaches 0.65. Beyond model interpretability, this work also offers practical guidance on acoustic feature selection for pathological speech assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs CCA between Wav2Vec 2.0 layers and eGeMAPS LLDs on oral cancer speech and finds highest alignment with spectral and prosodic groups, but the reference set choice is untested.

read the letter

The key point is that this work measures how Wav2Vec 2.0 embeddings for oral and oropharyngeal cancer patients line up with eGeMAPS low-level descriptors through canonical correlation analysis, reporting group correlations of 0.77 for spectral features, 0.71 for prosodic, and 0.65 for voice quality, with the first MFCC coefficient strongest across layers.

What is new is the application to this specific patient population and the layer-wise breakdown for a speech intelligibility model. The paper does a straightforward job of separating individual LLD correlations from the grouped ones and draws a practical link to feature selection for assessment tasks.

The soft spot is the reliance on eGeMAPS as the complete reference without any check against a broader or pathology-specific feature set. The stress-test concern holds: if formant dynamics, glottal source details, or other cancer-related cues fall outside the eGeMAPS groups, the reported numbers only capture overlap with that subset rather than the full acoustic information the model uses. The abstract gives no dataset size, no statistical testing details, and no controls for confounds, so the results stay at the level of suggestive patterns.

This paper is for researchers already working on interpretability of self-supervised speech models in clinical settings. A reader who wants concrete correlation numbers for this domain will find something usable, but anyone needing rigorous validation of the reference features will need more.

It deserves a serious referee to examine the methods section and data handling, even if revisions are likely. The core idea is reasonable and the execution appears honest, but the reference assumption needs scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper applies canonical correlation analysis (CCA) between embeddings from a Wav2Vec 2.0-based speech intelligibility assessment model and eGeMAPS low-level descriptors (LLDs) on oral/oropharyngeal cancer patient speech. It reports layer-wise correlations, finding the strongest associations with spectral and prosodic features (group-level CCA values 0.77 and 0.71), with the first MFCC coefficient highest across layers, and voice quality at 0.65; the work positions this as both interpretability analysis and guidance for acoustic feature selection in pathological speech assessment.

Significance. If the reported correlations are shown to be robust, the work provides a concrete demonstration of how self-supervised speech representations encode standard acoustic dimensions in a clinical population, which could inform both model debugging and the design of compact feature sets for intelligibility assessment. The use of CCA with an external reference set is a standard and transparent approach that avoids circularity.

major comments (2)

[Abstract] Abstract: the central results (layer-wise MFCC1 dominance; group CCA values of 0.77 spectral, 0.71 prosodic, 0.65 voice quality) are presented without any information on dataset size, number of speakers or utterances, statistical testing procedures, or controls for speaker-level confounds. These omissions make it impossible to assess whether the reported correlations are reliable or generalizable, directly undermining evaluation of the interpretability claims.
[Abstract] Abstract: the analysis treats the eGeMAPS LLD set (prosodic/spectral/voice-quality groups) as a sufficient reference for the acoustic information encoded by the model. No ablation against an expanded descriptor bank (e.g., formant trajectories, additional spectral moments, or glottal-source parameters) is described, so the reported correlations only quantify overlap with this particular subset and do not establish that the model has or has not captured other pathology-relevant cues.

minor comments (1)

[Abstract] Abstract contains the typo "Wav2Vec 2.0based" (missing space).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract requires additional details on the dataset and methods to support evaluation of the results, and we will revise it accordingly. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: the central results (layer-wise MFCC1 dominance; group CCA values of 0.77 spectral, 0.71 prosodic, 0.65 voice quality) are presented without any information on dataset size, number of speakers or utterances, statistical testing procedures, or controls for speaker-level confounds. These omissions make it impossible to assess whether the reported correlations are reliable or generalizable, directly undermining evaluation of the interpretability claims.

Authors: We agree that the abstract as currently written omits key contextual information needed to evaluate the reported correlations. The full manuscript describes the dataset (speech recordings from oral and oropharyngeal cancer patients), the number of speakers and utterances, the CCA procedure, and speaker-level controls via per-speaker normalization. We will revise the abstract to include a concise statement of dataset size, speaker count, and the statistical approach, thereby addressing this concern directly. revision: yes
Referee: [Abstract] Abstract: the analysis treats the eGeMAPS LLD set (prosodic/spectral/voice-quality groups) as a sufficient reference for the acoustic information encoded by the model. No ablation against an expanded descriptor bank (e.g., formant trajectories, additional spectral moments, or glottal-source parameters) is described, so the reported correlations only quantify overlap with this particular subset and do not establish that the model has or has not captured other pathology-relevant cues.

Authors: We acknowledge the limitation: the reported CCA values quantify overlap specifically with the eGeMAPS LLD groups and do not claim to cover all possible pathology-relevant acoustic cues. The choice of eGeMAPS follows its established use as a compact, interpretable reference set in clinical speech analysis. We will add explicit language in the abstract and discussion clarifying that the results pertain to this reference set and do not constitute an exhaustive mapping of all acoustic dimensions. revision: partial

Circularity Check

0 steps flagged

No circularity: standard CCA with external reference features

full rationale

The paper computes canonical correlations between Wav2Vec embeddings and the independent eGeMAPS LLD set (prosodic/spectral/voice-quality groups) at layer and group levels. This is a direct statistical measurement with no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations or ansatzes. The reported values (e.g., MFCC1 highest, group correlations 0.77/0.71/0.65) are outputs of the CCA procedure applied to external descriptors, not reductions of the inputs by construction. The analysis is self-contained against the chosen reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard statistical methods (CCA) and an established acoustic feature set (eGeMAPS) without introducing new fitted parameters, axioms beyond basic linear correlation assumptions, or invented entities.

axioms (1)

standard math Canonical correlation analysis captures the linear relationships between model embeddings and acoustic descriptors
Invoked throughout the analysis to quantify correlations

pith-pipeline@v0.9.1-grok · 5740 in / 1068 out tokens · 21808 ms · 2026-06-25T22:49:54.290893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

[1]

Introduction Speech processing has a long history of development, with ad- vances spanning text-to-speech, speech recognition, and speech translation, among others, and pathological speech assessment has increasingly benefited from these advances. Traditionally, speech and voice disorders are evaluated through human-based clinical assessment, which is inh...
[2]

To this end, we employ Canoni- cal Correlation Analysis (CCA) [12] to bridge these two worlds

Methodology In this paper, we address the trade-off between handcrafted fea- tures and deep learning-based representations by focusing on the interpretability of the latter. To this end, we employ Canoni- cal Correlation Analysis (CCA) [12] to bridge these two worlds. CCA, first introduced by Hotelling in 1936 [13], is a statis- tical method that measures...

Pith/arXiv arXiv 1936
[3]

Corpus This work uses the French corpus C2SI [21], which contains recordings from both healthy control speakers and patients di- agnosed with oral and oropharyngeal cancer (OOC)

Experimental Setup 3.1. Corpus This work uses the French corpus C2SI [21], which contains recordings from both healthy control speakers and patients di- agnosed with oral and oropharyngeal cancer (OOC). C2SI was developed under the Carcinologic Speech Severity Index project from 2015 to 2017, initiated to address the need for objective assessment tools fo...

2015
[4]

Results 4.1. Individual-level analysis Figure 1 illustrates the layer-wise evolution of individual eGeMAPS LLDs in terms of their PWCCA correlation rank- ing with Wav2Vec 2.0 representations, where rank 1 indicates the highest correlation and rank 25 the lowest. The ranking re- veals that certain features undergo considerable changes across layers, sugges...
[5]

Conclusion This work presents an analysis of a Wav2Vec 2.0-based speech intelligibility assessment model, focused on interpretability, using eGeMAPS LLDs as an interpretable reference and the PWCCA-based approach. The results show that the model rep- resentations are most correlated with Spectral and Prosodic fea- tures, while first MFCC coefficient is th...
[6]

This work was granted ac- cess to the HPC resources of IDRIS under the allocation 2025- AD011016558 made by GENCI

Acknowledgments This research was funded, in whole, by Chair LIAvignon, and in part, by the French National Research Agency (ANR), project OLINPIC (ANR-24-CE38-2819). This work was granted ac- cess to the HPC resources of IDRIS under the allocation 2025- AD011016558 made by GENCI. For the purpose of open ac- cess, the author has applied a CC-BY public cop...

2025
[7]

All scientific content, ideas, analyses, and conclu- sions are solely the work of the authors

Generative AI Use Disclosure Generative AI tools were used for editing and polishing the manuscript. All scientific content, ideas, analyses, and conclu- sions are solely the work of the authors
[8]

Interpretable speech features vs. dnn embeddings: What to use in the automatic assessment of parkinson’s disease in multi-lingual scenarios,

A. Favaro, Y .-T. Tsai, A. Butala, T. Thebaud, J. Villalba, N. Dehak, and L. Moro-Vel´azquez, “Interpretable speech features vs. dnn embeddings: What to use in the automatic assessment of parkinson’s disease in multi-lingual scenarios,” Computers in Biology and Medicine , vol. 166, p. 107559, 2023. [On- line]. Available: https://www.sciencedirect.com/scie...

2023
[9]

Prediction of speech impairment in patients treated for oral or oropharyngeal cancer using automatic speech analysis,

M. Balaguer, J. Pinquier, J. Farinas, and V . Woisard, “Prediction of speech impairment in patients treated for oral or oropharyngeal cancer using automatic speech analysis,” International Journal of Language & Communication Disorders , vol. 60, no. 5, p. e70103, 2025. [Online]. Available: https://onlinelibrary.wiley. com/doi/abs/10.1111/1460-6984.70103

work page doi:10.1111/1460-6984.70103 2025
[10]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,

J. V ´asquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. N¨oth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,” Journal of Communication Disorders, vol. 76, pp. 21–36, 2018. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S002199241730076X

2018
[11]

Acoustic correlates of speech intelligibility: the usability of the eGeMAPS feature set for atypical speech,

W. Xue, C. Cucchiarini, R. van Hout, and H. Strik, “Acoustic correlates of speech intelligibility: the usability of the eGeMAPS feature set for atypical speech,” in 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE 2019) , 2019, pp. 48–52

2019
[12]

Dysarthric Speech Classification Using Glot- tal Features Computed from Non-words, Words and Sentences,

N. N P and P. Alku, “Dysarthric Speech Classification Using Glot- tal Features Computed from Non-words, Words and Sentences,” in Interspeech 2018, 2018, pp. 3403–3407

2018
[13]

Glottal flow patterns anal- yses for parkinson’s disease detection: Acoustic and nonlinear approaches,

E. A. Belalc ´azar-Bola˜nos, J. R. Orozco-Arroyave, J. F. Vargas- Bonilla, T. Haderlein, and E. N ¨oth, “Glottal flow patterns anal- yses for parkinson’s disease detection: Acoustic and nonlinear approaches,” in Text, Speech, and Dialogue , P. Sojka, A. Hor ´ak, I. Kope ˇcek, and K. Pala, Eds. Cham: Springer International Publishing, 2016, pp. 400–407

2016
[14]

Prosody-based measures for automatic severity assessment of dysarthric speech,

A. Hernandez, S. Kim, and M. Chung, “Prosody-based measures for automatic severity assessment of dysarthric speech,” Applied Sciences , vol. 10, no. 19, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/19/6999

2020
[15]

Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S...

2024
[16]

Towards reducing patient effort for the automatic prediction of speech intelligibility in head and neck cancers,

S. Quintas, A. Abad, J. Mauclair, V . Woisard, and J. Pinquier, “Towards reducing patient effort for the automatic prediction of speech intelligibility in head and neck cancers,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023
[17]

Multimodal assessment of parkinson’s disease: A deep learning approach,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, B. Eskofier, J. Klucken, and E. N¨oth, “Multimodal assessment of parkinson’s disease: A deep learning approach,”IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 4, pp. 1618–1630, 2019

2019
[18]

Con- volutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease,

J. V ´asquez-Correa, J. R. Orozco-Arroyave, and E. N ¨oth, “Con- volutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease,” inInterspeech 2017, 2017, pp. 314–318

2017
[19]

Breakthroughs in statistics: methodology and distribution,

K. Pearson, S. Kotz, and N. Johnson, “Breakthroughs in statistics: methodology and distribution,” 1992

1992
[20]

Simplified calculation of principal components,

H. Hotelling, “Simplified calculation of principal components,” Psychometrika, vol. 1, no. 1, pp. 27–35, 1936

1936
[21]

Cca based fea- ture selection with application to continuous depression recog- nition from acoustic speech features,

H. Kaya, F. Eyben, A. A. Salah, and B. Schuller, “Cca based fea- ture selection with application to continuous depression recog- nition from acoustic speech features,” in 2014 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3729–3733

2014
[22]

Multi-modality canonical fea- ture selection for alzheimer’s disease diagnosis,

X. Zhu, H.-I. Suk, and D. Shen, “Multi-modality canonical fea- ture selection for alzheimer’s disease diagnosis,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2014, pp. 162–169

2014
[23]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” in Advances in Neural Information Processing Systems , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [...

2017
[24]

Insights on representa- tional similarity in neural networks with canonical correlation,

A. Morcos, M. Raghu, and S. Bengio, “Insights on representa- tional similarity in neural networks with canonical correlation,” in Advances in Neural Information Processing Systems , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neuri...

2018
[25]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU) , 2021, pp. 914–921

2021
[26]

Comparative layer-wise anal- ysis of self-supervised speech models,

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise anal- ysis of self-supervised speech models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023
[27]

Exploring asr-based wav2vec2 for automated speech disorder as- sessment: Insights and analysis,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring asr-based wav2vec2 for automated speech disorder as- sessment: Insights and analysis,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 975–982

2024
[28]

C2si corpus: a database of speech disorder productions to assess intel- ligibility and quality of life in head and neck cancers,

V . Woisard, C. Ast ´esano, M. Balaguer, J. Farinas, C. Fredouille, P. Gaillard, A. Ghio, L. Giusti, I. Laaridh, M. Lalain et al., “C2si corpus: a database of speech disorder productions to assess intel- ligibility and quality of life in head and neck cancers,” Language Resources and Evaluation, vol. 55, no. 1, pp. 173–190, 2021

2021
[29]

Wav2vec-based detection and severity level classification of dysarthria from speech,

F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-based detection and severity level classification of dysarthria from speech,” in Icassp 2023-2023 IEEE international conference on acoustics, speech and signal processing (icassp) . IEEE, 2023, pp. 1–5

2023
[30]

Investigation of self- supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals,

S. R. Kadiri, F. Javanmardi, and P. Alku, “Investigation of self- supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals,” Computer Speech & Language, vol. 83, p. 101550, 2024

2024
[31]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” in Advances in Neural Information Processing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.ne...

2020
[32]

Can we use speaker embeddings on spontaneous speech obtained from medical conversations to predict intelligibility?

S. Quintas, M. Balaguer, J. Mauclair, V . Woisard, and J. Pinquier, “Can we use speaker embeddings on spontaneous speech obtained from medical conversations to predict intelligibility?” in 2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, pp. 1–7

2023
[33]

Interpretable assessment of speech intelligibility using deep learning: A case study on speech disorders due to head and neck cancers,

S. Abderrazek, C. Fredouille, A. Ghio, M. Lalain, C. Meunier, M. Balaguer, and V . Woisard, “Interpretable assessment of speech intelligibility using deep learning: A case study on speech disorders due to head and neck cancers,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

2024
[34]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing , vol. 7, no. 2, pp. 190–202, 2016

2016
[35]

The INTERSPEECH 2013 Computational Paralinguis- tics Challenge: Social Signals, Conflict, Emotion, Autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. R. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 Computational Paralinguis- tics Challenge: Social Signals, Conflict, Emotion, Autism,” in INTERSPEECH 201314thAnnual Conference of the Int...

2013
[36]

Cross-language speech emotion recognition using bag-of-word representations, domain adapta- tion, and data augmentation,

S. Kshirsagar and T. H. Falk, “Cross-language speech emotion recognition using bag-of-word representations, domain adapta- tion, and data augmentation,” Sensors, vol. 22, no. 17, p. 6445, 2022

2022
[37]

Schuller, M

B. Schuller, M. W ¨ollmer, F. Eyben, and G. Rigoll,Prosodic, spec- tral or voice quality? Feature type relevance for the discrimina- tion of emotion pairs , 01 2009

2009
[38]

Prosodic feature analysis for automatic speech assessment and individual report generation in people with down syndrome,

M. Corrales-Astorgano, C. Gonz ´alez-Ferreras, D. Escudero- Mancebo, and V . Carde˜noso-Payo, “Prosodic feature analysis for automatic speech assessment and individual report generation in people with down syndrome,” Applied Sciences, vol. 14, no. 1, p. 293, 2023

2023
[39]

Eyben, M

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 1459–1462. [Online]. Available: https://doi.org/10.1145/1873951.1874246

work page doi:10.1145/1873951.1874246 2010
[40]

Interpreting deep representations of phonetic features via neuro-based concept detector: Application to speech disorders due to head and neck cancer,

S. Abderrazek, C. Fredouille, A. Ghio, M. Lalain, C. Meunier, and V . Woisard, “Interpreting deep representations of phonetic features via neuro-based concept detector: Application to speech disorders due to head and neck cancer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 200– 214, 2023

2023
[41]

Automatic intelligibility assessment of dysarthric speech using glottal parameters,

N. P. Narendra and P. Alku, “Automatic intelligibility assessment of dysarthric speech using glottal parameters,” Speech Communi- cation, vol. 123, pp. 1–9, 2020

2020

[1] [1]

Introduction Speech processing has a long history of development, with ad- vances spanning text-to-speech, speech recognition, and speech translation, among others, and pathological speech assessment has increasingly benefited from these advances. Traditionally, speech and voice disorders are evaluated through human-based clinical assessment, which is inh...

[2] [2]

To this end, we employ Canoni- cal Correlation Analysis (CCA) [12] to bridge these two worlds

Methodology In this paper, we address the trade-off between handcrafted fea- tures and deep learning-based representations by focusing on the interpretability of the latter. To this end, we employ Canoni- cal Correlation Analysis (CCA) [12] to bridge these two worlds. CCA, first introduced by Hotelling in 1936 [13], is a statis- tical method that measures...

Pith/arXiv arXiv 1936

[3] [3]

Corpus This work uses the French corpus C2SI [21], which contains recordings from both healthy control speakers and patients di- agnosed with oral and oropharyngeal cancer (OOC)

Experimental Setup 3.1. Corpus This work uses the French corpus C2SI [21], which contains recordings from both healthy control speakers and patients di- agnosed with oral and oropharyngeal cancer (OOC). C2SI was developed under the Carcinologic Speech Severity Index project from 2015 to 2017, initiated to address the need for objective assessment tools fo...

2015

[4] [4]

Results 4.1. Individual-level analysis Figure 1 illustrates the layer-wise evolution of individual eGeMAPS LLDs in terms of their PWCCA correlation rank- ing with Wav2Vec 2.0 representations, where rank 1 indicates the highest correlation and rank 25 the lowest. The ranking re- veals that certain features undergo considerable changes across layers, sugges...

[5] [5]

Conclusion This work presents an analysis of a Wav2Vec 2.0-based speech intelligibility assessment model, focused on interpretability, using eGeMAPS LLDs as an interpretable reference and the PWCCA-based approach. The results show that the model rep- resentations are most correlated with Spectral and Prosodic fea- tures, while first MFCC coefficient is th...

[6] [6]

This work was granted ac- cess to the HPC resources of IDRIS under the allocation 2025- AD011016558 made by GENCI

Acknowledgments This research was funded, in whole, by Chair LIAvignon, and in part, by the French National Research Agency (ANR), project OLINPIC (ANR-24-CE38-2819). This work was granted ac- cess to the HPC resources of IDRIS under the allocation 2025- AD011016558 made by GENCI. For the purpose of open ac- cess, the author has applied a CC-BY public cop...

2025

[7] [7]

All scientific content, ideas, analyses, and conclu- sions are solely the work of the authors

Generative AI Use Disclosure Generative AI tools were used for editing and polishing the manuscript. All scientific content, ideas, analyses, and conclu- sions are solely the work of the authors

[8] [8]

Interpretable speech features vs. dnn embeddings: What to use in the automatic assessment of parkinson’s disease in multi-lingual scenarios,

A. Favaro, Y .-T. Tsai, A. Butala, T. Thebaud, J. Villalba, N. Dehak, and L. Moro-Vel´azquez, “Interpretable speech features vs. dnn embeddings: What to use in the automatic assessment of parkinson’s disease in multi-lingual scenarios,” Computers in Biology and Medicine , vol. 166, p. 107559, 2023. [On- line]. Available: https://www.sciencedirect.com/scie...

2023

[9] [9]

Prediction of speech impairment in patients treated for oral or oropharyngeal cancer using automatic speech analysis,

M. Balaguer, J. Pinquier, J. Farinas, and V . Woisard, “Prediction of speech impairment in patients treated for oral or oropharyngeal cancer using automatic speech analysis,” International Journal of Language & Communication Disorders , vol. 60, no. 5, p. e70103, 2025. [Online]. Available: https://onlinelibrary.wiley. com/doi/abs/10.1111/1460-6984.70103

work page doi:10.1111/1460-6984.70103 2025

[10] [10]

Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,

J. V ´asquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. N¨oth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,” Journal of Communication Disorders, vol. 76, pp. 21–36, 2018. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S002199241730076X

2018

[11] [11]

Acoustic correlates of speech intelligibility: the usability of the eGeMAPS feature set for atypical speech,

W. Xue, C. Cucchiarini, R. van Hout, and H. Strik, “Acoustic correlates of speech intelligibility: the usability of the eGeMAPS feature set for atypical speech,” in 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE 2019) , 2019, pp. 48–52

2019

[12] [12]

Dysarthric Speech Classification Using Glot- tal Features Computed from Non-words, Words and Sentences,

N. N P and P. Alku, “Dysarthric Speech Classification Using Glot- tal Features Computed from Non-words, Words and Sentences,” in Interspeech 2018, 2018, pp. 3403–3407

2018

[13] [13]

Glottal flow patterns anal- yses for parkinson’s disease detection: Acoustic and nonlinear approaches,

E. A. Belalc ´azar-Bola˜nos, J. R. Orozco-Arroyave, J. F. Vargas- Bonilla, T. Haderlein, and E. N ¨oth, “Glottal flow patterns anal- yses for parkinson’s disease detection: Acoustic and nonlinear approaches,” in Text, Speech, and Dialogue , P. Sojka, A. Hor ´ak, I. Kope ˇcek, and K. Pala, Eds. Cham: Springer International Publishing, 2016, pp. 400–407

2016

[14] [14]

Prosody-based measures for automatic severity assessment of dysarthric speech,

A. Hernandez, S. Kim, and M. Chung, “Prosody-based measures for automatic severity assessment of dysarthric speech,” Applied Sciences , vol. 10, no. 19, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/19/6999

2020

[15] [15]

Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring pathological speech quality assessment with ASR-powered Wav2Vec2 in data-scarce context,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S...

2024

[16] [16]

Towards reducing patient effort for the automatic prediction of speech intelligibility in head and neck cancers,

S. Quintas, A. Abad, J. Mauclair, V . Woisard, and J. Pinquier, “Towards reducing patient effort for the automatic prediction of speech intelligibility in head and neck cancers,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023

[17] [17]

Multimodal assessment of parkinson’s disease: A deep learning approach,

J. C. V ´asquez-Correa, T. Arias-Vergara, J. R. Orozco-Arroyave, B. Eskofier, J. Klucken, and E. N¨oth, “Multimodal assessment of parkinson’s disease: A deep learning approach,”IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 4, pp. 1618–1630, 2019

2019

[18] [18]

Con- volutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease,

J. V ´asquez-Correa, J. R. Orozco-Arroyave, and E. N ¨oth, “Con- volutional Neural Network to Model Articulation Impairments in Patients with Parkinson’s Disease,” inInterspeech 2017, 2017, pp. 314–318

2017

[19] [19]

Breakthroughs in statistics: methodology and distribution,

K. Pearson, S. Kotz, and N. Johnson, “Breakthroughs in statistics: methodology and distribution,” 1992

1992

[20] [20]

Simplified calculation of principal components,

H. Hotelling, “Simplified calculation of principal components,” Psychometrika, vol. 1, no. 1, pp. 27–35, 1936

1936

[21] [21]

Cca based fea- ture selection with application to continuous depression recog- nition from acoustic speech features,

H. Kaya, F. Eyben, A. A. Salah, and B. Schuller, “Cca based fea- ture selection with application to continuous depression recog- nition from acoustic speech features,” in 2014 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3729–3733

2014

[22] [22]

Multi-modality canonical fea- ture selection for alzheimer’s disease diagnosis,

X. Zhu, H.-I. Suk, and D. Shen, “Multi-modality canonical fea- ture selection for alzheimer’s disease diagnosis,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2014, pp. 162–169

2014

[23] [23]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” in Advances in Neural Information Processing Systems , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [...

2017

[24] [24]

Insights on representa- tional similarity in neural networks with canonical correlation,

A. Morcos, M. Raghu, and S. Bengio, “Insights on representa- tional similarity in neural networks with canonical correlation,” in Advances in Neural Information Processing Systems , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neuri...

2018

[25] [25]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU) , 2021, pp. 914–921

2021

[26] [26]

Comparative layer-wise anal- ysis of self-supervised speech models,

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise anal- ysis of self-supervised speech models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023

[27] [27]

Exploring asr-based wav2vec2 for automated speech disorder as- sessment: Insights and analysis,

T. Nguyen, C. Fredouille, A. Ghio, M. Balaguer, and V . Woisard, “Exploring asr-based wav2vec2 for automated speech disorder as- sessment: Insights and analysis,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 975–982

2024

[28] [28]

C2si corpus: a database of speech disorder productions to assess intel- ligibility and quality of life in head and neck cancers,

V . Woisard, C. Ast ´esano, M. Balaguer, J. Farinas, C. Fredouille, P. Gaillard, A. Ghio, L. Giusti, I. Laaridh, M. Lalain et al., “C2si corpus: a database of speech disorder productions to assess intel- ligibility and quality of life in head and neck cancers,” Language Resources and Evaluation, vol. 55, no. 1, pp. 173–190, 2021

2021

[29] [29]

Wav2vec-based detection and severity level classification of dysarthria from speech,

F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-based detection and severity level classification of dysarthria from speech,” in Icassp 2023-2023 IEEE international conference on acoustics, speech and signal processing (icassp) . IEEE, 2023, pp. 1–5

2023

[30] [30]

Investigation of self- supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals,

S. R. Kadiri, F. Javanmardi, and P. Alku, “Investigation of self- supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals,” Computer Speech & Language, vol. 83, p. 101550, 2024

2024

[31] [31]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” in Advances in Neural Information Processing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.ne...

2020

[32] [32]

Can we use speaker embeddings on spontaneous speech obtained from medical conversations to predict intelligibility?

S. Quintas, M. Balaguer, J. Mauclair, V . Woisard, and J. Pinquier, “Can we use speaker embeddings on spontaneous speech obtained from medical conversations to predict intelligibility?” in 2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, pp. 1–7

2023

[33] [33]

Interpretable assessment of speech intelligibility using deep learning: A case study on speech disorders due to head and neck cancers,

S. Abderrazek, C. Fredouille, A. Ghio, M. Lalain, C. Meunier, M. Balaguer, and V . Woisard, “Interpretable assessment of speech intelligibility using deep learning: A case study on speech disorders due to head and neck cancers,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

2024

[34] [34]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing , vol. 7, no. 2, pp. 190–202, 2016

2016

[35] [35]

The INTERSPEECH 2013 Computational Paralinguis- tics Challenge: Social Signals, Conflict, Emotion, Autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. R. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 Computational Paralinguis- tics Challenge: Social Signals, Conflict, Emotion, Autism,” in INTERSPEECH 201314thAnnual Conference of the Int...

2013

[36] [36]

Cross-language speech emotion recognition using bag-of-word representations, domain adapta- tion, and data augmentation,

S. Kshirsagar and T. H. Falk, “Cross-language speech emotion recognition using bag-of-word representations, domain adapta- tion, and data augmentation,” Sensors, vol. 22, no. 17, p. 6445, 2022

2022

[37] [37]

Schuller, M

B. Schuller, M. W ¨ollmer, F. Eyben, and G. Rigoll,Prosodic, spec- tral or voice quality? Feature type relevance for the discrimina- tion of emotion pairs , 01 2009

2009

[38] [38]

Prosodic feature analysis for automatic speech assessment and individual report generation in people with down syndrome,

M. Corrales-Astorgano, C. Gonz ´alez-Ferreras, D. Escudero- Mancebo, and V . Carde˜noso-Payo, “Prosodic feature analysis for automatic speech assessment and individual report generation in people with down syndrome,” Applied Sciences, vol. 14, no. 1, p. 293, 2023

2023

[39] [39]

Eyben, M

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 1459–1462. [Online]. Available: https://doi.org/10.1145/1873951.1874246

work page doi:10.1145/1873951.1874246 2010

[40] [40]

Interpreting deep representations of phonetic features via neuro-based concept detector: Application to speech disorders due to head and neck cancer,

S. Abderrazek, C. Fredouille, A. Ghio, M. Lalain, C. Meunier, and V . Woisard, “Interpreting deep representations of phonetic features via neuro-based concept detector: Application to speech disorders due to head and neck cancer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 200– 214, 2023

2023

[41] [41]

Automatic intelligibility assessment of dysarthric speech using glottal parameters,

N. P. Narendra and P. Alku, “Automatic intelligibility assessment of dysarthric speech using glottal parameters,” Speech Communi- cation, vol. 123, pp. 1–9, 2020

2020