ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of speaker similarity in low-resource personalized text-to-speech (TTS) when naively augmenting training data with zero-shot TTS (ZS-TTS) synthesized speech. To mitigate this issue, the authors propose a lightweight domain-conditional training framework that distinguishes between real and synthetic speech through domain embeddings, without altering the base model architecture. Combined with an oversampling strategy for real data, this approach effectively preserves speaker characteristics during augmentation. Notably, it is the first to apply domain conditioning to ZS-TTS-based data augmentation. Experiments on LibriTTS and an internal dataset demonstrate that the method significantly outperforms naive augmentation under extremely limited target-speaker data, while maintaining high levels of naturalness, intelligibility, and speaker similarity.

Technology Category

Application Category

📝 Abstract
We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.
Problem

Research questions and friction points this paper is trying to address.

zero-shot TTS
data augmentation
personalized speech synthesis
speaker similarity
low-resource
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot TTS
data augmentation
domain-conditioned training
personalized speech synthesis
speaker similarity
🔎 Similar Papers
No similar papers found.
Youngwon Choi
Youngwon Choi
MAUM AI Inc.
Conversational AI
J
Jinwoo Oh
Humelo Inc., Republic of Korea
H
Hwayeon Kim
Maum AI Inc., Republic of Korea
H
Hyeonyu Kim
Maum AI Inc., Republic of Korea