SC-TSE: Speaker Consistency-Aware Target Speaker Extraction

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address speaker identity confusion in target speaker extraction (TSE) caused by unreliable audio cues, this paper proposes a speaker-consistency-aware framework. The method integrates two key innovations into an end-to-end TSE model without additional modules or pretraining: (1) a centroid-based speaker consistency loss that explicitly enforces identity alignment between enrollment and separated speech embeddings in the embedding space; and (2) a conditional loss suppression strategy that dynamically down-weights gradient contributions from inconsistent samples during training. Experiments on WSJ0-2mix+DEMAND and LibriCSS demonstrate significant improvements: speaker identification accuracy (SID-Acc) increases by 3.2%, and average SI-SNR improvement (SI-SNRi) rises by 1.8 dB. These results confirm that the proposed approach effectively mitigates speaker identity confusion and enhances separation robustness.

Technology Category

Application Category

📝 Abstract
Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online.footnote{https://sc-tse.netlify.app/
Problem

Research questions and friction points this paper is trying to address.

Improves Target Speaker Extraction via speaker consistency
Reduces speaker identity confusion in TSE systems
Enhances performance with centroid-based consistency loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Centroid-based speaker consistency loss
Conditional loss suppression integration
Speaker consistency-aware extraction method
🔎 Similar Papers
No similar papers found.
S
Shu Wu
School of Information and Electronics, Beijing Institute of Technology, China
A
Anbin Qi
School of Information and Electronics, Beijing Institute of Technology, China
Y
Yanzhang Xie
School of Information and Electronics, Beijing Institute of Technology, China
Xiang Xie
Xiang Xie
PADO Labs
CryptographyPrivacy-Preserving Machine Learning