The Codec Language Model-Based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024

📅 2024-11-07
🏛️ International Symposium on Chinese Spoken Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot natural-style speech cloning for the ISCSLP 2024 CoVoC Challenge, proposing a zero-shot TTS system capable of synthesizing spontaneous, improvised speech. Methodologically, it introduces the first LLaMA-based latency-aware speech encoder; incorporates Classifier-Free Guidance (CFG) to enhance conditional controllability; integrates codec-based language modeling, speech tokenization, and fine-tuning on high-quality spontaneous speech data; and employs a customized preprocessing pipeline. In the official CoVoC constrained-track evaluation, the system achieves a naturalness MOS of 3.80—the highest score among all participants—while maintaining superior speech quality and speaker similarity, significantly outperforming baseline approaches. This work establishes a scalable, high-fidelity paradigm for zero-shot spontaneous-style speech synthesis.

Technology Category

Application Category

📝 Abstract
This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Chal-lenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To gen-erate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot spontaneous style TTS system
Improve speech intelligibility with CFG
High-quality utterance generation via fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaMA-based codec language model
Classifier-Free Guidance strategy
High-quality data preprocessing
🔎 Similar Papers
No similar papers found.
S
Shuoyi Zhou
Shenzhen International Graduate School, Tsinghua University, Shenzhen
Y
Yixuan Zhou
Shenzhen International Graduate School, Tsinghua University, Shenzhen
W
Weiqin Li
Shenzhen International Graduate School, Tsinghua University, Shenzhen
J
Jun Chen
Shenzhen International Graduate School, Tsinghua University, Shenzhen
R
Runchuan Ye
Shenzhen International Graduate School, Tsinghua University, Shenzhen
Weihao Wu
Weihao Wu
Tsinghua University
Z
Zijian Lin
Shenzhen International Graduate School, Tsinghua University, Shenzhen
S
Shunwei Lei
Shenzhen International Graduate School, Tsinghua University, Shenzhen
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen