LoRP-TTS: Low-Rank Personalized Text-To-Speech

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot text-to-speech (TTS) models exhibit low speaker similarity when conditioned on a single spontaneous speech prompt under noisy conditions and heavily rely on high-quality, multi-sample clean recordings. Method: This work introduces low-rank adaptation (LoRA) to TTS personalization for the first time, proposing a lightweight and efficient speaker adaptation framework that fine-tunes both the acoustic model and speaker embedding using only one unscripted, noisy, non-professional utterance. Contribution/Results: The method eliminates the conventional reliance on clean, multi-sample recordings while preserving text accuracy and speech naturalness. It improves speaker similarity by 30 percentage points over baseline zero-shot approaches. Extensive experiments demonstrate that our framework significantly broadens the diversity and practical applicability of usable speech data—enabling robust speaker adaptation from real-world, low-resource audio—and establishes a new paradigm for constructing diverse, in-the-wild speech corpora.

Technology Category

Application Category

📝 Abstract
Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance personalized text-to-speech synthesis
Improve speaker similarity in noisy environments
Utilize low-rank adaptation for diverse speech corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adaptation (LoRA)
single noisy recordings
enhanced speaker similarity
🔎 Similar Papers
No similar papers found.
L
Lukasz Bondaruk
Samsung R&D Institute Poland, Plac Europejski 1, 00-844 Warszawa, Poland
Jakub Kubiak
Jakub Kubiak
Samsung R&D Institute Poland
Artificial intelligencetext-to-speechautomatic speech recognitionreinforcement learning