Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

📅 2024-10-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference of autoregressive text-to-speech (TTS) models caused by sequential prediction of long speech token sequences, this paper proposes VADUSA—a novel framework that pioneers the adaptation of speculative decoding to TTS. It introduces learnable multi-head draft heads for autoregressive prediction of future speech tokens and designs a tolerance-aware fault-tolerant sampling mechanism to enhance generation efficiency without compromising speech quality. Experiments demonstrate strong generalization across diverse tokenization schemes and large-scale datasets: VADUSA achieves up to 2.3× inference speedup while maintaining or slightly improving synthesized speech quality (e.g., MOS, SIM). The core contribution lies in the first principled integration of speculative decoding into speech sequence modeling, enabling efficient, robust, and high-fidelity autoregressive TTS synthesis.

Technology Category

Application Category

📝 Abstract
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
Problem

Research questions and friction points this paper is trying to address.

Accelerate auto-regressive TTS inference
Enhance speech synthesis performance
Maintain quality with faster sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative decoding accelerates TTS
Draft heads predict future speech
Tolerance mechanism ensures quality
🔎 Similar Papers
No similar papers found.
B
Bohan Li
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
Situo Zhang
Situo Zhang
Shanghai Jiao Tong University
Large Language ModelsReinforcement Learning
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China