Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of quantitatively modeling and disentangling multi-level emotions (utterance-, word-, and phoneme-level) in text-to-speech (TTS). We propose a multi-step hierarchical emotion distribution (ED) prediction framework. Methodologically, we design a hierarchical variance prediction module and a multi-step inference strategy, incorporating global emotional context to constrain local emotional variations, and introduce a variance adapter for plug-and-play integration with mainstream TTS architectures. To our knowledge, this is the first approach enabling fine-grained, cross-granularity emotional rendering with independent control over emotion intensity at each level. Objective evaluations and subjective MOS tests demonstrate significant improvements in emotional expressiveness, outperforming existing baselines in both semantic consistency and emotional naturalness. The framework supports interpretable, fine-grained intervention on emotion intensity.

Technology Category

Application Category

📝 Abstract
We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.
Problem

Research questions and friction points this paper is trying to address.

Achieve multi-level emotional control in TTS synthesis
Predict hierarchical emotion variance at utterance, word, phoneme levels
Enhance emotional expressiveness and precise rendering in speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step hierarchical ED prediction module
Global context refines local emotional variations
Compatible with various TTS systems
🔎 Similar Papers
No similar papers found.