PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

πŸ“… 2026-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a hybrid flow-matching text-to-speech (TTS) system to address the trade-off between stability and naturalness in flow-matching TTS, weak cross-lingual voice cloning capabilities, and audio quality limitations imposed by low-sample-rate Mel features. The system integrates duration guidance with an alignment-free decoder through vector field fusion during inference, incorporates speech prompt embeddings to enable transcription- and text-prompt-free cross-lingual voice cloning, and employs an enhanced PeriodWave vocoder supporting 48β€―kHz super-resolution for improved fidelity. Evaluated in realistic cross-lingual scenarios, the method significantly outperforms state-of-the-art systems such as F5-TTS, achieving a naturalness MOS of 4.11, a 23% relative reduction in word error rate (6.9% vs. 9.0%), and higher voice similarity than ElevenLabs (+0.32 SMOS), all with only a short reference utterance and no additional training.

Technology Category

Application Category

πŸ“ Abstract
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Problem

Research questions and friction points this paper is trying to address.

flow-matching TTS
cross-lingual voice cloning
audio quality
stability-naturalness trade-off
mel features
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid flow-matching TTS
inference-time model fusion
cross-lingual voice cloning
prompt-based speaker embedding
high-fidelity vocoder
πŸ”Ž Similar Papers
No similar papers found.
Vikentii Pankov
Vikentii Pankov
Saint Petersburg University
speech processinggenerative modellingTTSmultiagent technologiescompressive sensing
A
Artem Gribul
Rask AI, USA
O
Oktai Tatanov
Rask AI, USA
V
Vladislav Proskurov
Rask AI, USA
Y
Yuliya Korotkova
Γ‰cole Polytechnique, France
D
Darima Mylzenova
TBC Bank, Georgia
D
Dmitrii Vypirailenko
Rask AI, USA