🤖 AI Summary
Existing benchmark evaluations for self-supervised speech models focus predominantly on single-speaker scenarios, failing to reflect their real-world capability in noisy, multi-speaker environments for target-speaker identification and information extraction.
Method: We introduce TS-SUPERB—the first target-speaker-oriented benchmark for multi-speaker noisy conditions—encompassing four tasks: enrollment-guided speech separation, recognition, verification, and synthesis. We propose a unified SSL-based target-speech encoder architecture that jointly optimizes speaker encoding and speech extraction modules, and develop an end-to-end framework integrating self-supervised representations, speaker-embedding-conditioned decoding, and multi-task joint training.
Contribution/Results: Experiments reveal that single-speaker performance is not predictive of target-speaker task performance. Our approach achieves significant improvements over prior state-of-the-art across multiple TS tasks, demonstrating the efficacy of cross-task information sharing and joint modeling.
📝 Abstract
Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.