OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing role-playing agents (RPAs) are confined to text-based interaction, neglecting critical vocal attributes—such as speaking style and emotional prosody—that significantly enhance immersion. Method: We propose the first low-latency, multimodal RPA supporting joint speech–language–personality alignment, ensuring consistent expression of personality, intonation, and emotion across multi-turn dialogues. Our approach introduces the first end-to-end joint modeling of vocal traits (style/emotion) and linguistic personality; designs an LLM-based speech–language co-architecture integrating emotion-aware prosody modeling with efficient TTS; and trains on our newly curated OmniCharacter-10K dataset (20 characters, 10K dialogues, 135K speech samples). Contribution/Results: Experiments demonstrate substantial improvements over state-of-the-art RPAs and speech-language models in both content quality and stylistic consistency, with an end-to-end latency of only 289 ms. We publicly release both code and dataset.

Technology Category

Application Category

📝 Abstract
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
Problem

Research questions and friction points this paper is trying to address.

Enhancing role-playing agents with voice traits for immersive interaction
Aligning speech-language personality traits in multi-round dialogues
Reducing response latency in speech-language role-playing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Seamless speech-language personality interaction model
Consistent role-specific personality and vocal traits
Low latency response at 289ms
🔎 Similar Papers
No similar papers found.