Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between speech naturalness and inference efficiency in flow-matching (FM)-based text-to-speech (TTS) models under the “coarse-to-fine” generation paradigm, this paper proposes Shallow Flow Matching (SFM). SFM constructs an intermediate state in the FM trajectory conditioned on coarse-grained outputs and initiates inference from this state to focus exclusively on fine-grained modeling in the latter segment. It innovatively employs orthogonal projection for adaptive temporal alignment of the intermediate state, introduces a single-segment piecewise flow formulation, and designs a lightweight SFM head coupled with an adaptive-step ODE solver. SFM is the first systematically integrated FM variant across diverse mainstream TTS architectures. Experiments demonstrate that SFM maintains or even improves speech naturalness—evidenced by gains in objective metrics and statistically significant MOS improvements—while substantially accelerating inference. Code, pretrained models, and an online demo are publicly released.

Technology Category

Application Category

📝 Abstract
We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.
Problem

Research questions and friction points this paper is trying to address.

Enhance flow matching for coarse-to-fine text-to-speech synthesis
Improve synthesized speech naturalness with adaptive intermediate states
Reduce inference time using lightweight flow matching heads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shallow flow matching enhances TTS synthesis
Orthogonal projection optimizes temporal state positioning
Lightweight SFM head reduces inference computation
🔎 Similar Papers
No similar papers found.