Speaker Group Encoding in Self-supervised Speech Recognition Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates how self-supervised speech models (S3Ms) learn and encode speaker demographic attributes—such as gender, age, dialect, ethnicity, and native-language status—and how different fine-tuning strategies influence the retention or amplification of such information. By integrating S3Ms with speaker identification (SID) and automatic speech recognition (ASR) fine-tuning, alongside fairness-enhancing algorithms, and employing hierarchical representation probing and embedding subspace analysis, the work provides the first systematic characterization of the layered encoding mechanisms of demographic information in these models. The findings reveal that SID fine-tuning amplifies global acoustic variation features, whereas ASR fine-tuning selectively preserves semantics-related variation while suppressing acoustic variation. Furthermore, fairness interventions primarily attenuate acoustic variation with limited impact on semantic variation, offering theoretical grounding for developing equitable ASR systems.

📝 Abstract

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

Problem

Research questions and friction points this paper is trying to address.

self-supervised speech recognition

speaker group encoding

fairness in ASR

speaker identification

speech representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech recognition

speaker group encoding

fairness in ASR