Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing conversational speech synthesis (CSS) approaches primarily model utterance-level contextual interactions, neglecting fine-grained semantic–prosodic couplings within multimodal dialogue history (MDH). This work proposes a dual fine-grained interaction graph model that, for the first time, jointly captures cross-modal semantic–prosodic dynamics at the word level. Specifically, we construct separate semantic and prosodic graphs using a multimodal graph neural network and introduce a context-aware feature enhancement mechanism to enable fine-grained, bidirectional interaction encoding. Evaluated on the DailyTalk dataset, our method significantly improves prosodic naturalness and expressiveness of synthesized speech over state-of-the-art baselines. To foster reproducibility and further research, we publicly release both source code and audio samples.

Technology Category

Application Category

📝 Abstract

Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.

Problem

Research questions and friction points this paper is trying to address.

Modeling word-level semantic and prosodic interactions in dialogue history

Enhancing conversational speech synthesis with natural prosody

Addressing overlooked fine-grained multimodal context interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fine-grained semantic interaction graph modeling

Multimodal fine-grained prosody interaction graph modeling

Word-level semantic-prosody interaction encoding for synthesis

🔎 Similar Papers

No similar papers found.

Authors to Follow