WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work proposes WavTTS, the first zero-shot text-to-speech (TTS) model that directly operates on raw waveforms, addressing limitations of existing approaches that rely on compressed representations such as mel-spectrograms or variational autoencoders, which incur information loss and hinder end-to-end training. Built upon flow matching and a Diffusion Transformer (DiT) architecture, WavTTS handles long audio sequences through waveform chunking and incorporates multi-scale mel-spectrogram supervision to provide perceptual guidance. A tailored noise scheduling strategy is also introduced to enhance synthesis quality. WavTTS demonstrates, for the first time, high-quality zero-shot speech synthesis in the waveform domain, validating the feasibility of end-to-end diffusion-based TTS. It achieves performance on par with state-of-the-art latent-space models on open benchmarks and significantly outperforms prior end-to-end methods.

📝 Abstract

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

Problem

Research questions and friction points this paper is trying to address.

zero-shot TTS

raw waveform modeling

diffusion models

end-to-end speech generation

information loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

raw waveform modeling

zero-shot TTS

diffusion transformer