LoRP-TTS: Low-Rank Personalized Text-To-Speech

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing zero-shot text-to-speech (TTS) models exhibit low speaker similarity when conditioned on a single spontaneous speech prompt under noisy conditions and heavily rely on high-quality, multi-sample clean recordings. Method: This work introduces low-rank adaptation (LoRA) to TTS personalization for the first time, proposing a lightweight and efficient speaker adaptation framework that fine-tunes both the acoustic model and speaker embedding using only one unscripted, noisy, non-professional utterance. Contribution/Results: The method eliminates the conventional reliance on clean, multi-sample recordings while preserving text accuracy and speech naturalness. It improves speaker similarity by 30 percentage points over baseline zero-shot approaches. Extensive experiments demonstrate that our framework significantly broadens the diversity and practical applicability of usable speech data—enabling robust speaker adaptation from real-world, low-resource audio—and establishes a new paradigm for constructing diverse, in-the-wild speech corpora.

Technology Category

Application Category

📝 Abstract

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance personalized text-to-speech synthesis

Improve speaker similarity in noisy environments

Utilize low-rank adaptation for diverse speech corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adaptation (LoRA)

single noisy recordings

enhanced speaker similarity

🔎 Similar Papers

No similar papers found.

Authors to Follow