🤖 AI Summary
To address the insufficient speech intelligibility and speaker similarity in flow-matching TTS systems, this paper proposes F5R-TTS. Methodologically, it introduces Gradient Reward Policy Optimization (GRPO) into the flow-matching framework for the first time, enabling seamless integration of reinforcement learning with flow matching by reformulating deterministic output reconstruction as Gaussian-distributed sampling. A dual-objective GRPO mechanism is further designed to jointly optimize ASR-based intelligibility (measured by WER) and speaker identity preservation (measured by SIM). The core contribution lies in a probabilistic reconstruction paradigm that unifies flow matching and policy optimization under a coherent probabilistic modeling framework. In zero-shot voice cloning experiments, F5R-TTS achieves a 29.5% relative reduction in WER and a 4.6% relative improvement in SIM over baseline flow-matching TTS, demonstrating substantial gains in both fidelity and speaker similarity.
📝 Abstract
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.