Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the high parameter count and computational cost of large-scale self-supervised pre-trained models (PTMs) in speaker verification (SV), this paper proposes an efficient fine-tuning and compression co-design framework. It leverages w2v-BERT 2.0 for speech representation extraction, integrates a Multi-layer Feature Aggregation (MFA) structure with Layer Adapters to fuse hierarchical features, employs Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, and applies knowledge distillation-guided structured pruning. The framework achieves near-lossless performance while removing 80% of parameters. On the VoxCeleb1-O and VoxCeleb1-H test sets, it attains EERs of 0.12% and 0.55%, respectively; after pruning, EER increases by only 0.04%, outperforming state-of-the-art methods significantly. This work demonstrates a principled trade-off between accuracy and efficiency in SV, enabling scalable deployment of large PTMs without compromising verification performance.

Technology Category

Application Category

📝 Abstract

Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 450 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at https://github.com/ZXHY-82/w2v-BERT-2.0_SV.

Problem

Research questions and friction points this paper is trying to address.

Improving speaker verification using large-scale pre-trained w2v-BERT 2.0 model

Applying knowledge distillation to reduce model size by 80%

Achieving state-of-the-art performance with minimal EER degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses w2v-BERT 2.0 for speaker verification

Employs MFA structure with Layer Adapter

Applies knowledge distillation guided structured pruning

🔎 Similar Papers

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision