Pretrained self-supervised speech models can recognize unseen consonants

Chihiro Taguchi; David Chiang; Emily Prud'hommeaux; \'Eric Le Ferrand; Hirosi Nakagawa; Hitomi Ono; Kanji Kato

arxiv: 2606.11542 · v1 · pith:7QLXYOUZnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Pretrained self-supervised speech models can recognize unseen consonants

Chihiro Taguchi , \'Eric Le Ferrand , Hirosi Nakagawa , Hitomi Ono , Kanji Kato , Emily Prud'hommeaux , David Chiang This is my paper

classification 💻 cs.CL cs.AI

keywords speechmodelsdatalanguagesconsonantspretrainedrecognizeself-supervised

0 comments

read the original abstract

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

This paper has not been read by Pith yet.

Pretrained self-supervised speech models can recognize unseen consonants

discussion (0)