🤖 AI Summary
Existing emotional TTS approaches rely on single-modal inputs (e.g., textual emotion labels) or coarse-grained emotion categories, failing to capture the complexity and multidimensionality of human affect—resulting in limited expressiveness and weak emotional resonance. To address this, we propose a unified multimodal prompt-driven emotional TTS framework. Our method introduces, for the first time, a cross-modal emotion alignment module (EP-Align) that enforces semantic consistency among textual, acoustic, and visual emotional cues. We further design an emotion embedding-induced TTS module (EMI-TTS) to achieve precise prosodic and timbral mapping. Built upon FastSpeech 2 and VITS architectures, the framework integrates contrastive learning with end-to-end modeling. Experiments demonstrate state-of-the-art performance: +12.3% improvement in emotion recognition accuracy and +0.8 MOS gain in naturalness, with consistent superiority across both objective and subjective evaluation metrics.
📝 Abstract
Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations show that UMETTS achieves significant improvements in emotion accuracy and speech naturalness, outperforming traditional E-TTS methods on both objective and subjective metrics.