Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the poor generalization of target speaker voice activity detection (TS-VAD) in unseen noisy environments with unlabeled data, this paper proposes a causal denoising autoregressive predictive coding (DN-APC) self-supervised pretraining framework. It is the first to incorporate causal modeling into DN-APC and integrates a Feature-wise Linear Modulation (FiLM) mechanism to enable speaker-conditioned representation learning, explicitly disentangling noise-invariant speech/non-speech features. Theoretical analysis and t-SNE visualization confirm that pretraining enhances inter-class separability and noise robustness; FiLM is empirically validated as the optimal conditional modeling paradigm for TS-VAD. Evaluated on both seen and unseen noise conditions, the model achieves an average 2% improvement in detection performance, significantly boosting adaptability in real-world deployment.

Technology Category

Application Category

📝 Abstract

Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.

Problem

Research questions and friction points this paper is trying to address.

Speech Recognition

Noisy Environment

Deep Learning Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DN-APC

Self-supervised Pretraining

Noise Robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow