DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech synthesis datasets suffer from high construction costs, limited speaker diversity, and insufficient coverage of scenarios and emotional expressions. To address these limitations, this work proposes a multi-speaker, multi-turn, multilingual conversational text-to-speech (TTS) framework featuring a novel three-agent collaborative architecture—script generation, speech synthesis, and dialogue evaluation—integrated with role-pool-driven iterative optimization and emotion-enhanced modeling. The framework leverages large language models (LLMs), role-aware acoustic modeling, and reinforcement-based feedback evaluation. We release MultiTalk, the first open-source bilingual, multi-speaker conversational TTS dataset. Experimental results demonstrate that the synthesized speech achieves high fidelity, strong speaker and prosodic diversity, and enhanced emotional expressivity. Furthermore, MultiTalk significantly improves performance on downstream TTS and conversational understanding tasks.

Technology Category

Application Category

📝 Abstract
Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.
Problem

Research questions and friction points this paper is trying to address.

High construction costs and limited diversity in speech datasets
Lack of emotional expressiveness in synthesized dialogues
Need for multi-party, multi-turn dialogue generation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid agent-based framework for dialogue synthesis
Iterative script refinement with diverse characters
Generates bilingual multi-party dialogue dataset
🔎 Similar Papers
No similar papers found.
X
Xiang Li
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
D
Duyi Pan
Department of Computer Science and Engineering, Hong Kong University of Science and Technology (Guangzhou)
Hongru Xiao
Hongru Xiao
Tongji university
LLMsALMsSpeech
Jiale Han
Jiale Han
The Hong Kong University of Science and Technology
Natural Language Processing
J
Jing Tang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology (Guangzhou)
J
Jiabao Ma
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
W
Wei Wang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology (Guangzhou), Hong Kong University of Science and Technology
B
Bo Cheng
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications