Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

πŸ“… 2025-06-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates the implicit encoding of speaker information within the feed-forward layers of self-supervised speech Transformers. To identify speaker-sensitive neurons, we align k-means-clustered self-supervised features with i-vectors. We discover, for the first time, that feed-forward neurons implicitly encode speaker gender and broad phoneme categories. Building on this finding, we propose a speaker-relevance-aware structured pruning strategy: retaining highly speaker-correlated neurons while removing low-correlation ones. Experiments demonstrate that, under substantial parameter compression (up to 40% pruning), speaker verification and identification performance remains nearly intactβ€”equal error rate (EER) degradation is less than 0.2%. This confirms that the identified neurons serve as critical carriers of speaker representations. Our work advances understanding of internal representational mechanisms in self-supervised speech models and provides a principled, interpretable approach to model compression.

Technology Category

Application Category

πŸ“ Abstract
In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
Problem

Research questions and friction points this paper is trying to address.

Identify neurons encoding speaker information in Transformers
Analyze clusters for phonetic and gender class correlations
Protect speaker-related neurons during model pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze neurons in feed-forward layers
Use k-means clusters and i-vectors
Protect speaker-related neurons during pruning
πŸ”Ž Similar Papers
No similar papers found.
Tzu-Quan Lin
Tzu-Quan Lin
National Taiwan University
Self-Supervised LearningSpoken Language ModelsModel CompressionInterpretability
H
Hsi-Chun Cheng
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing
H
Hao Tang
Centre of Speech Technology Research, University of Edinburgh, United Kingdom