🤖 AI Summary
To address speaker identity confusion in target speaker extraction (TSE) caused by unreliable audio cues, this paper proposes a speaker-consistency-aware framework. The method integrates two key innovations into an end-to-end TSE model without additional modules or pretraining: (1) a centroid-based speaker consistency loss that explicitly enforces identity alignment between enrollment and separated speech embeddings in the embedding space; and (2) a conditional loss suppression strategy that dynamically down-weights gradient contributions from inconsistent samples during training. Experiments on WSJ0-2mix+DEMAND and LibriCSS demonstrate significant improvements: speaker identification accuracy (SID-Acc) increases by 3.2%, and average SI-SNR improvement (SI-SNRi) rises by 1.8 dB. These results confirm that the proposed approach effectively mitigates speaker identity confusion and enhances separation robustness.
📝 Abstract
Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online.footnote{https://sc-tse.netlify.app/