🤖 AI Summary
Existing zero-shot text-to-speech (TTS) models exhibit low speaker similarity when conditioned on a single spontaneous speech prompt under noisy conditions and heavily rely on high-quality, multi-sample clean recordings.
Method: This work introduces low-rank adaptation (LoRA) to TTS personalization for the first time, proposing a lightweight and efficient speaker adaptation framework that fine-tunes both the acoustic model and speaker embedding using only one unscripted, noisy, non-professional utterance.
Contribution/Results: The method eliminates the conventional reliance on clean, multi-sample recordings while preserving text accuracy and speech naturalness. It improves speaker similarity by 30 percentage points over baseline zero-shot approaches. Extensive experiments demonstrate that our framework significantly broadens the diversity and practical applicability of usable speech data—enabling robust speaker adaptation from real-world, low-resource audio—and establishes a new paradigm for constructing diverse, in-the-wild speech corpora.
📝 Abstract
Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.