An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how speaker embeddings affect zero-shot multi-speaker text-to-speech (TTS) synthesis quality for unseen speakers. We systematically compare three speaker encoders—H/ASP, x-vector, and ECAPA-TDNN—within the unified YourTTS framework on cross-domain TTS tasks, training on Czech speech data and evaluating via subjective listening tests and objective cosine similarity to quantify speaker similarity. Results show that H/ASP significantly outperforms both ECAPA-TDNN and x-vector in TTS fidelity and speaker similarity preservation, despite the latter two achieving higher accuracy in speaker verification tasks. This reveals a critical mismatch: high speaker discrimination capability does not guarantee effective transfer to TTS. To our knowledge, this is the first work to empirically demonstrate such task-specific divergence in speaker representation learning, providing key insights for designing speaker encoders tailored to generative speech synthesis rather than discriminative speaker recognition.

Technology Category

Application Category

📝 Abstract
Zero-shot multi-speaker text-to-speech (TTS) systems rely on speaker embeddings to synthesize speech in the voice of an unseen speaker, using only a short reference utterance. While many speaker embeddings have been developed for speaker recognition, their relative effectiveness in zero-shot TTS remains underexplored. In this work, we employ a YourTTS-based TTS system to compare three different speaker encoders - YourTTS's original H/ASP encoder, x-vector embeddings, and ECAPA-TDNN embeddings - within an otherwise fixed zero-shot TTS framework. All models were trained on the same dataset of Czech read speech and evaluated on 24 out-of-domain target speakers using both subjective and objective methods. The subjective evaluation was conducted via a listening test focused on speaker similarity, while the objective evaluation measured cosine distances between speaker embeddings extracted from synthesized and real utterances. Across both evaluations, the original H/ASP encoder consistently outperformed the alternatives, with ECAPA-TDNN showing better results than x-vectors. These findings suggest that, despite the popularity of ECAPA-TDNN in speaker recognition, it does not necessarily offer improvements for speaker similarity in zero-shot TTS in this configuration. Our study highlights the importance of empirical evaluation when reusing speaker recognition embeddings in TTS and provides a framework for additional future comparisons.
Problem

Research questions and friction points this paper is trying to address.

Compare speaker encoders for zero-shot TTS performance
Evaluate ECAPA-TDNN vs x-vector in unseen speaker synthesis
Assess embedding effectiveness for speaker similarity in TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare H/ASP, x-vector, ECAPA-TDNN encoders
Evaluate using subjective and objective methods
H/ASP encoder outperforms ECAPA-TDNN and x-vector
🔎 Similar Papers
No similar papers found.
Marie Kunešová
Marie Kunešová
University of West Bohemia in Pilsen
speech processingspeaker diarizationspeech synthesis
Zdeněk Hanzlíček
Zdeněk Hanzlíček
University of West Bohemia
speech processing
J
Jindřich Matoušek
NTIS Research Centre and Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czechia