SC-TSE: Speaker Consistency-Aware Target Speaker Extraction

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address speaker identity confusion in target speaker extraction (TSE) caused by unreliable audio cues, this paper proposes a speaker-consistency-aware framework. The method integrates two key innovations into an end-to-end TSE model without additional modules or pretraining: (1) a centroid-based speaker consistency loss that explicitly enforces identity alignment between enrollment and separated speech embeddings in the embedding space; and (2) a conditional loss suppression strategy that dynamically down-weights gradient contributions from inconsistent samples during training. Experiments on WSJ0-2mix+DEMAND and LibriCSS demonstrate significant improvements: speaker identification accuracy (SID-Acc) increases by 3.2%, and average SI-SNR improvement (SI-SNRi) rises by 1.8 dB. These results confirm that the proposed approach effectively mitigates speaker identity confusion and enhances separation robustness.

Technology Category

Application Category

📝 Abstract

Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online.footnote{https://sc-tse.netlify.app/

Problem

Research questions and friction points this paper is trying to address.

Improves Target Speaker Extraction via speaker consistency

Reduces speaker identity confusion in TSE systems

Enhances performance with centroid-based consistency loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Centroid-based speaker consistency loss

Conditional loss suppression integration

Speaker consistency-aware extraction method

🔎 Similar Papers

No similar papers found.

Authors to Follow