Pretrained self-supervised speech models can recognize unseen consonants

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

It remains unclear whether self-supervised speech models can effectively recognize rare phonemes—such as the click consonants found in Khoisan languages—under low-resource conditions. This study addresses this gap by fine-tuning Wav2Vec2 and HuBERT on G|ui and West !Xoon, two Khoisan languages rich in click consonants. The results demonstrate, for the first time, that pretrained self-supervised models can generalize to clicks without explicit exposure during pretraining, challenging the assumption that data skew inevitably leads to representational gaps. Notably, after fine-tuning, the models achieve significantly higher accuracy on click consonants compared to other consonant classes, revealing a remarkable cross-phonemic generalization capacity inherent in self-supervised learning frameworks.

📝 Abstract

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

Problem

Research questions and friction points this paper is trying to address.

self-supervised speech models

click consonants

low-resource languages

speech recognition

phoneme representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech models

click consonants

low-resource languages