🤖 AI Summary
Current speech synthesis datasets suffer from high construction costs, limited speaker diversity, and insufficient coverage of scenarios and emotional expressions. To address these limitations, this work proposes a multi-speaker, multi-turn, multilingual conversational text-to-speech (TTS) framework featuring a novel three-agent collaborative architecture—script generation, speech synthesis, and dialogue evaluation—integrated with role-pool-driven iterative optimization and emotion-enhanced modeling. The framework leverages large language models (LLMs), role-aware acoustic modeling, and reinforcement-based feedback evaluation. We release MultiTalk, the first open-source bilingual, multi-speaker conversational TTS dataset. Experimental results demonstrate that the synthesized speech achieves high fidelity, strong speaker and prosodic diversity, and enhanced emotional expressivity. Furthermore, MultiTalk significantly improves performance on downstream TTS and conversational understanding tasks.
📝 Abstract
Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.