UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

📅 2024-04-29
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional TTS approaches rely on single-modal inputs (e.g., textual emotion labels) or coarse-grained emotion categories, failing to capture the complexity and multidimensionality of human affect—resulting in limited expressiveness and weak emotional resonance. To address this, we propose a unified multimodal prompt-driven emotional TTS framework. Our method introduces, for the first time, a cross-modal emotion alignment module (EP-Align) that enforces semantic consistency among textual, acoustic, and visual emotional cues. We further design an emotion embedding-induced TTS module (EMI-TTS) to achieve precise prosodic and timbral mapping. Built upon FastSpeech 2 and VITS architectures, the framework integrates contrastive learning with end-to-end modeling. Experiments demonstrate state-of-the-art performance: +12.3% improvement in emotion recognition accuracy and +0.8 MOS gain in naturalness, with consistent superiority across both objective and subjective evaluation metrics.

Technology Category

Application Category

📝 Abstract
Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations show that UMETTS achieves significant improvements in emotion accuracy and speech naturalness, outperforming traditional E-TTS methods on both objective and subjective metrics.
Problem

Research questions and friction points this paper is trying to address.

Enhances emotional accuracy in text-to-speech synthesis.
Integrates multimodal cues for expressive speech generation.
Improves naturalness and coherence in synthesized emotional speech.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal emotional feature alignment
Contrastive learning for emotion synthesis
State-of-the-art TTS integration
🔎 Similar Papers
No similar papers found.