Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

๐Ÿ“… 2025-01-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the poor generalization of target speaker voice activity detection (TS-VAD) in unseen noisy environments with unlabeled data, this paper proposes a causal denoising autoregressive predictive coding (DN-APC) self-supervised pretraining framework. It is the first to incorporate causal modeling into DN-APC and integrates a Feature-wise Linear Modulation (FiLM) mechanism to enable speaker-conditioned representation learning, explicitly disentangling noise-invariant speech/non-speech features. Theoretical analysis and t-SNE visualization confirm that pretraining enhances inter-class separability and noise robustness; FiLM is empirically validated as the optimal conditional modeling paradigm for TS-VAD. Evaluated on both seen and unseen noise conditions, the model achieves an average 2% improvement in detection performance, significantly boosting adaptability in real-world deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Noisy Environment
Deep Learning Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

DN-APC
Self-supervised Pretraining
Noise Robustness
๐Ÿ”Ž Similar Papers
No similar papers found.