An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study investigates how speaker embeddings affect zero-shot multi-speaker text-to-speech (TTS) synthesis quality for unseen speakers. We systematically compare three speaker encoders—H/ASP, x-vector, and ECAPA-TDNN—within the unified YourTTS framework on cross-domain TTS tasks, training on Czech speech data and evaluating via subjective listening tests and objective cosine similarity to quantify speaker similarity. Results show that H/ASP significantly outperforms both ECAPA-TDNN and x-vector in TTS fidelity and speaker similarity preservation, despite the latter two achieving higher accuracy in speaker verification tasks. This reveals a critical mismatch: high speaker discrimination capability does not guarantee effective transfer to TTS. To our knowledge, this is the first work to empirically demonstrate such task-specific divergence in speaker representation learning, providing key insights for designing speaker encoders tailored to generative speech synthesis rather than discriminative speaker recognition.

Technology Category

Application Category

📝 Abstract

Zero-shot multi-speaker text-to-speech (TTS) systems rely on speaker embeddings to synthesize speech in the voice of an unseen speaker, using only a short reference utterance. While many speaker embeddings have been developed for speaker recognition, their relative effectiveness in zero-shot TTS remains underexplored. In this work, we employ a YourTTS-based TTS system to compare three different speaker encoders - YourTTS's original H/ASP encoder, x-vector embeddings, and ECAPA-TDNN embeddings - within an otherwise fixed zero-shot TTS framework. All models were trained on the same dataset of Czech read speech and evaluated on 24 out-of-domain target speakers using both subjective and objective methods. The subjective evaluation was conducted via a listening test focused on speaker similarity, while the objective evaluation measured cosine distances between speaker embeddings extracted from synthesized and real utterances. Across both evaluations, the original H/ASP encoder consistently outperformed the alternatives, with ECAPA-TDNN showing better results than x-vectors. These findings suggest that, despite the popularity of ECAPA-TDNN in speaker recognition, it does not necessarily offer improvements for speaker similarity in zero-shot TTS in this configuration. Our study highlights the importance of empirical evaluation when reusing speaker recognition embeddings in TTS and provides a framework for additional future comparisons.

Problem

Research questions and friction points this paper is trying to address.

Compare speaker encoders for zero-shot TTS performance

Evaluate ECAPA-TDNN vs x-vector in unseen speaker synthesis

Assess embedding effectiveness for speaker similarity in TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare H/ASP, x-vector, ECAPA-TDNN encoders

Evaluate using subjective and objective methods

H/ASP encoder outperforms ECAPA-TDNN and x-vector

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation