🤖 AI Summary
This study investigates how speaker embeddings affect zero-shot multi-speaker text-to-speech (TTS) synthesis quality for unseen speakers. We systematically compare three speaker encoders—H/ASP, x-vector, and ECAPA-TDNN—within the unified YourTTS framework on cross-domain TTS tasks, training on Czech speech data and evaluating via subjective listening tests and objective cosine similarity to quantify speaker similarity. Results show that H/ASP significantly outperforms both ECAPA-TDNN and x-vector in TTS fidelity and speaker similarity preservation, despite the latter two achieving higher accuracy in speaker verification tasks. This reveals a critical mismatch: high speaker discrimination capability does not guarantee effective transfer to TTS. To our knowledge, this is the first work to empirically demonstrate such task-specific divergence in speaker representation learning, providing key insights for designing speaker encoders tailored to generative speech synthesis rather than discriminative speaker recognition.
📝 Abstract
Zero-shot multi-speaker text-to-speech (TTS) systems rely on speaker embeddings to synthesize speech in the voice of an unseen speaker, using only a short reference utterance. While many speaker embeddings have been developed for speaker recognition, their relative effectiveness in zero-shot TTS remains underexplored. In this work, we employ a YourTTS-based TTS system to compare three different speaker encoders - YourTTS's original H/ASP encoder, x-vector embeddings, and ECAPA-TDNN embeddings - within an otherwise fixed zero-shot TTS framework. All models were trained on the same dataset of Czech read speech and evaluated on 24 out-of-domain target speakers using both subjective and objective methods. The subjective evaluation was conducted via a listening test focused on speaker similarity, while the objective evaluation measured cosine distances between speaker embeddings extracted from synthesized and real utterances. Across both evaluations, the original H/ASP encoder consistently outperformed the alternatives, with ECAPA-TDNN showing better results than x-vectors. These findings suggest that, despite the popularity of ECAPA-TDNN in speaker recognition, it does not necessarily offer improvements for speaker similarity in zero-shot TTS in this configuration. Our study highlights the importance of empirical evaluation when reusing speaker recognition embeddings in TTS and provides a framework for additional future comparisons.