Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the slow inference of autoregressive text-to-speech (TTS) models caused by sequential prediction of long speech token sequences, this paper proposes VADUSA—a novel framework that pioneers the adaptation of speculative decoding to TTS. It introduces learnable multi-head draft heads for autoregressive prediction of future speech tokens and designs a tolerance-aware fault-tolerant sampling mechanism to enhance generation efficiency without compromising speech quality. Experiments demonstrate strong generalization across diverse tokenization schemes and large-scale datasets: VADUSA achieves up to 2.3× inference speedup while maintaining or slightly improving synthesized speech quality (e.g., MOS, SIM). The core contribution lies in the first principled integration of speculative decoding into speech sequence modeling, enabling efficient, robust, and high-fidelity autoregressive TTS synthesis.

Technology Category

Application Category

📝 Abstract

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

Problem

Research questions and friction points this paper is trying to address.

Accelerate auto-regressive TTS inference

Enhance speech synthesis performance

Maintain quality with faster sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative decoding accelerates TTS

Draft heads predict future speech

Tolerance mechanism ensures quality

🔎 Similar Papers

No similar papers found.

Authors to Follow