Self-Supervised Reflective Learning Through Self-Distillation and Online Clustering for Speaker Representation Learning

📅 2024-01-03
🏛️ IEEE Transactions on Audio, Speech, and Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing two-stage iterative frameworks for speaker representation learning from unlabeled speech data suffer from high computational overhead and severe noise in pseudo-labels. To address these issues, we propose a self-supervised reflective learning framework that eliminates multi-round iteration. Our method jointly optimizes pseudo-labels via teacher-student self-distillation and online clustering, while incorporating explicit noise-aware label modeling and a temporally consistent pseudo-label queue to enable single-pass dynamic purification and continuous refinement of pseudo-labels. This work introduces the “reflective learning” paradigm to speaker representation learning for the first time. On VoxCeleb, our framework achieves superior performance in a single training pass compared to conventional five-iteration baselines. Pseudo-label quality improves significantly, and cluster count converges rapidly—demonstrating its efficiency and robustness in parsing large-scale unlabeled speech data.

Technology Category

Application Category

📝 Abstract
Speaker representation learning is crucial for voice recognition systems, with recent advances in self-supervised approaches reducing dependency on labeled data. Current two-stage iterative frameworks, while effective, suffer from significant computational overhead due to repeated rounds of clustering and training. They also struggle with noisy pseudo labels that can impair model learning. This paper introduces self-supervised reflective learning (SSRL), an improved framework that addresses these limitations by enabling continuous refinement of pseudo labels during training. Through a teacher-student architecture and online clustering mechanism, SSRL eliminates the need for iterative training rounds. To handle label noise, we incorporate noisy label modeling and pseudo label queues that maintain temporal consistency. Experiments on VoxCeleb show SSRL's superiority over current two-stage iterative approaches, surpassing the performance of a 5-round method in just a single training round. Ablation studies validate the contributions of key components like noisy label modeling and pseudo label queues. Moreover, consistent improvements in pseudo labeling and the convergence of cluster counts demonstrate SSRL's effectiveness in deciphering unlabeled data. This work marks an important advancement in efficient and accurate self-supervised speaker representation learning through the novel reflective learning paradigm.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in self-supervised speaker representation learning
Addresses noisy pseudo labels impairing model learning
Enables continuous refinement of pseudo labels during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised reflective learning with self-distillation
Online clustering for continuous label refinement
Noisy label modeling and pseudo label queues
🔎 Similar Papers
No similar papers found.
Danwei Cai
Danwei Cai
Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27705, USA
Zexin Cai
Zexin Cai
Johns Hopkins University
speech signal processing
M
Ming Li
Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27705, USA
M
Ming Li
Data Science Research Center at Duke Kunshan University, Kunshan, China