A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study identifies a fundamental mismatch when using SI-SDR as both training objective and evaluation metric for speech separation under noisy reference signals (e.g., WSJ0-2Mix): noise artificially caps the SI-SDR upper bound and causes models to learn and retain noise. To address this, we propose a reference signal enhancement preprocessing step, combined with WHAM!-based data augmentation, to explicitly decouple noise modeling from target speech estimation. Theoretical analysis demonstrates that SI-SDR values are negatively correlated with perceptual noise level, challenging its validity in noisy-reference scenarios. Experiments—evaluated non-intrusively using NISQA.v2 on WSJ0-2Mix and Libri2Mix—show significant noise reduction in separated speech; however, reference enhancement may introduce minor artifacts, limiting overall quality gains. Our core contribution is the first systematic characterization of the SI-SDR mismatch mechanism under noisy references and the proposal of a generalizable reference purification paradigm.

Technology Category

Application Category

📝 Abstract

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating SI-SDR metric limitations with noisy training references

Proposing enhanced references to avoid learning noise artifacts

Investigating negative correlation between SI-SDR and perceived quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced references to avoid learning noise

Augmented mixtures using WHAM! dataset

Non-intrusive NISQA.v2 metric evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow