TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current TTS systems suffer from limitations in speech naturalness, duration modeling, and audio coding—particularly exhibiting poor robustness to out-of-vocabulary (OOV) words and noisy text. To address these issues, we propose TTS-Transducer: the first end-to-end TTS framework integrating a robust neural transducer (RNN-T) with a neural audio codec featuring residual vector quantization (RVQ) and a monotonic alignment mechanism. This design enables implicit alignment between text and multi-codebook discrete speech tokens without explicit duration prediction. Furthermore, we introduce non-autoregressive residual codebook prediction and enable joint end-to-end training of the codec and transducer. Experiments demonstrate that TTS-Transducer achieves speech quality and naturalness comparable to state-of-the-art TTS systems, significantly improves robustness to OOV words and noisy input text, and eliminates the need for a separate duration model.

Technology Category

Application Category

📝 Abstract
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
Naturalness
Audio Encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

TTS-Transducer
Audio Encoding
Neural Networks
🔎 Similar Papers
No similar papers found.