BareWave: Waveform-Native Flow-Matching Text-to-Speech

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes the first purely waveform-based flow-matching text-to-speech (TTS) framework that operates entirely end-to-end without relying on pretrained components or intermediate acoustic representations. To overcome key challenges in training optimization, flow convergence, and perceptual quality integration, the method introduces representation alignment during training, a staged noise schedule, and a velocity-aware perceptual alignment (VAPA) loss. This approach enables direct generation of raw audio waveforms in a fully end-to-end manner, achieving significant improvements in intelligibility, speaker similarity, and naturalness on zero-shot voice cloning tasks. The results demonstrate the feasibility and superiority of pure waveform flow-matching TTS, setting a new direction for high-quality, end-to-end speech synthesis.

📝 Abstract

Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

waveform generation

flow-matching

perceptual refinement

training optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

waveform-native

flow-matching

text-to-speech