TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current TTS systems suffer from limitations in speech naturalness, duration modeling, and audio coding—particularly exhibiting poor robustness to out-of-vocabulary (OOV) words and noisy text. To address these issues, we propose TTS-Transducer: the first end-to-end TTS framework integrating a robust neural transducer (RNN-T) with a neural audio codec featuring residual vector quantization (RVQ) and a monotonic alignment mechanism. This design enables implicit alignment between text and multi-codebook discrete speech tokens without explicit duration prediction. Furthermore, we introduce non-autoregressive residual codebook prediction and enable joint end-to-end training of the codec and transducer. Experiments demonstrate that TTS-Transducer achieves speech quality and naturalness comparable to state-of-the-art TTS systems, significantly improves robustness to OOV words and noisy input text, and eliminates the need for a separate duration model.

Technology Category

Application Category

📝 Abstract

This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech

Naturalness

Audio Encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

TTS-Transducer

Audio Encoding

Neural Networks

🔎 Similar Papers

No similar papers found.

Authors to Follow