🤖 AI Summary
Existing role-playing agents (RPAs) are confined to text-based interaction, neglecting critical vocal attributes—such as speaking style and emotional prosody—that significantly enhance immersion. Method: We propose the first low-latency, multimodal RPA supporting joint speech–language–personality alignment, ensuring consistent expression of personality, intonation, and emotion across multi-turn dialogues. Our approach introduces the first end-to-end joint modeling of vocal traits (style/emotion) and linguistic personality; designs an LLM-based speech–language co-architecture integrating emotion-aware prosody modeling with efficient TTS; and trains on our newly curated OmniCharacter-10K dataset (20 characters, 10K dialogues, 135K speech samples). Contribution/Results: Experiments demonstrate substantial improvements over state-of-the-art RPAs and speech-language models in both content quality and stylistic consistency, with an end-to-end latency of only 289 ms. We publicly release both code and dataset.
📝 Abstract
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.