Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the challenges of model bloat, strong data dependency, high deployment cost, and privacy risks in zero-shot text-to-speech (TTS), this paper proposes a lightweight and robust voice cloning method. The core innovation lies in a two-stage self-distillation framework—novelly constructed from the training-data perspective—to generate pseudo-parallel speech-text corpora, enabling explicit disentanglement of linguistic content and speaker identity. Additionally, we design a lightweight dual-path TTS backbone supporting CPU/GPU co-inference. Without requiring any target speaker utterances, our method significantly improves cross-speaker generalization stability and real-time performance: achieving real-time factors (RTF) of 0.13 (CPU) and 0.012 (GPU). Compared to state-of-the-art approaches, it substantially reduces computational overhead and data requirements while jointly optimizing accuracy, efficiency, and privacy preservation.

Technology Category

Application Category

📝 Abstract

Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot Text-to-Speech

Model Efficiency

Data Security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Text-to-Speech

Self-distillation Framework

Speaker-content Disentanglement

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation