🤖 AI Summary
Existing TTS systems under the flow matching framework predominantly rely on waveform or spectrogram representations, suffering from poor disentanglement of speech attributes, stringent training constraints, and high computational overhead. This paper proposes the first zero-shot end-to-end TTS method integrating conditional flow matching guided by optimal transport theory and learned priors. It explicitly disentangles timbre, prosody, and linguistic content within a discrete speech token space and enables single-step, high-fidelity speech generation without frame-level dependency on reference audio. The approach synergistically combines flow matching, optimal transport, discrete tokenization, and prior-guided conditioning. Experiments demonstrate significant improvements over state-of-the-art methods in content accuracy, naturalness, prosodic controllability, and cross-speaker style preservation. Moreover, the method supports high-quality voice cloning and real-time synthesis.
📝 Abstract
Text-to-speech (TTS) systems have seen significant advancements in recent years, driven by improvements in deep learning and neural network architectures. Viewing the output speech as a data distribution, previous approaches often employ traditional speech representations, such as waveforms or spectrograms, within the Flow Matching framework. However, these methods have limitations, including overlooking various speech attributes and incurring high computational costs due to additional constraints introduced during training. To address these challenges, we introduce OZSpeech, the first TTS method to explore optimal transport conditional flow matching with one-step sampling and a learned prior as the condition, effectively disregarding preceding states and reducing the number of sampling steps. Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute, which enhances the TTS system's ability to precisely clone the prompt speech. Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation. Audio samples are available at our demo page https://ozspeech.github.io/OZSpeech_Web/.