π€ AI Summary
This study investigates the implicit encoding of speaker information within the feed-forward layers of self-supervised speech Transformers. To identify speaker-sensitive neurons, we align k-means-clustered self-supervised features with i-vectors. We discover, for the first time, that feed-forward neurons implicitly encode speaker gender and broad phoneme categories. Building on this finding, we propose a speaker-relevance-aware structured pruning strategy: retaining highly speaker-correlated neurons while removing low-correlation ones. Experiments demonstrate that, under substantial parameter compression (up to 40% pruning), speaker verification and identification performance remains nearly intactβequal error rate (EER) degradation is less than 0.2%. This confirms that the identified neurons serve as critical carriers of speaker representations. Our work advances understanding of internal representational mechanisms in self-supervised speech models and provides a principled, interpretable approach to model compression.
π Abstract
In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.