Hierarchical Control of Emotion Rendering in Speech Synthesis

📅 2024-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained, quantifiable control of multi-level emotional intensity remains challenging in expressive speech synthesis. Method: This paper proposes a hierarchical emotion rendering framework based on diffusion models, enabling controllable synthesis across phoneme-, word-, and sentence-level emotion intensities. We introduce the first Hierarchical Emotion Distribution (ED) extractor, which explicitly models quantifiable emotion embeddings across speech segments and captures their correlations with linguistic and speaker characteristics. The method integrates reference-audio-driven emotion embedding learning, hierarchical feature encoding, and multi-scale acoustic analysis—including F0, energy, duration, and Mel-spectral dynamics. Results: Experiments demonstrate significant improvements over baselines in both objective metrics (e.g., MCD, F0 RMSE) and subjective evaluations (e.g., MOS, Emo-MOS, emotion control accuracy), achieving fine-grained, adjustable, and high-fidelity emotional speech generation.

Technology Category

Application Category

📝 Abstract
Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
Problem

Research questions and friction points this paper is trying to address.

Emotional Expression
Text-to-Speech
Computer-Generated Voice
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Intensity Control
Multilevel Emotional Expression
Voice Quality and Emotional Delivery
🔎 Similar Papers
No similar papers found.
Sho Inoue
Sho Inoue
The Chinese University of Hong Kong, Shenzhen, PhD Candidate in Computer Science
Speech Synthesis
K
Kun Zhou
Alibaba Group, Singapore
S
Shuai Wang
School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China; Shenzhen Research Institute of Big Data, Shenzhen, China
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation