When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of frozen large language models (LLMs) in speech synthesis, which struggle to capture speaker-specific acoustic and perceptual characteristics, resulting in insufficient voice consistency and speech quality. To overcome this, the authors propose an efficient fine-tuning approach based on Low-Rank Adaptation (LoRA) applied to the Qwen-0.5B backbone, combined with training data exhibiting high acoustic diversity. This method significantly enhances voice cloning performance in terms of naturalness, speaker similarity, and signal-to-noise ratio. Experimental results demonstrate improvements of up to 0.42 points in DNS-MOS scores and a 34% increase in signal-to-noise ratio, confirming that LoRA serves not only as a parameter-efficient fine-tuning strategy but also as a critical mechanism for enabling speaker adaptation in compact LLM-based text-to-speech systems.

Technology Category

Application Category

📝 Abstract
Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Text-to-Speech
Speaker Adaptation
Voice Cloning
Data Diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA fine-tuning
data diversity
speaker adaptation
LLM-based TTS
voice cloning
🔎 Similar Papers
No similar papers found.