DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current speech synthesis datasets suffer from high construction costs, limited speaker diversity, and insufficient coverage of scenarios and emotional expressions. To address these limitations, this work proposes a multi-speaker, multi-turn, multilingual conversational text-to-speech (TTS) framework featuring a novel three-agent collaborative architecture—script generation, speech synthesis, and dialogue evaluation—integrated with role-pool-driven iterative optimization and emotion-enhanced modeling. The framework leverages large language models (LLMs), role-aware acoustic modeling, and reinforcement-based feedback evaluation. We release MultiTalk, the first open-source bilingual, multi-speaker conversational TTS dataset. Experimental results demonstrate that the synthesized speech achieves high fidelity, strong speaker and prosodic diversity, and enhanced emotional expressivity. Furthermore, MultiTalk significantly improves performance on downstream TTS and conversational understanding tasks.

Technology Category

Application Category

📝 Abstract

Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.

Problem

Research questions and friction points this paper is trying to address.

High construction costs and limited diversity in speech datasets

Lack of emotional expressiveness in synthesized dialogues

Need for multi-party, multi-turn dialogue generation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid agent-based framework for dialogue synthesis

Iterative script refinement with diverse characters

Generates bilingual multi-party dialogue dataset

🔎 Similar Papers

No similar papers found.

Sierra

$190K – $290K • Offers Equity

San Francisco, USA / Atlanta, USA / New York, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs