🤖 AI Summary
This work addresses the challenge of quantitatively modeling and disentangling multi-level emotions (utterance-, word-, and phoneme-level) in text-to-speech (TTS). We propose a multi-step hierarchical emotion distribution (ED) prediction framework. Methodologically, we design a hierarchical variance prediction module and a multi-step inference strategy, incorporating global emotional context to constrain local emotional variations, and introduce a variance adapter for plug-and-play integration with mainstream TTS architectures. To our knowledge, this is the first approach enabling fine-grained, cross-granularity emotional rendering with independent control over emotion intensity at each level. Objective evaluations and subjective MOS tests demonstrate significant improvements in emotional expressiveness, outperforming existing baselines in both semantic consistency and emotional naturalness. The framework supports interpretable, fine-grained intervention on emotion intensity.
📝 Abstract
We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.