OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing role-playing agents (RPAs) are confined to text-based interaction, neglecting critical vocal attributes—such as speaking style and emotional prosody—that significantly enhance immersion. Method: We propose the first low-latency, multimodal RPA supporting joint speech–language–personality alignment, ensuring consistent expression of personality, intonation, and emotion across multi-turn dialogues. Our approach introduces the first end-to-end joint modeling of vocal traits (style/emotion) and linguistic personality; designs an LLM-based speech–language co-architecture integrating emotion-aware prosody modeling with efficient TTS; and trains on our newly curated OmniCharacter-10K dataset (20 characters, 10K dialogues, 135K speech samples). Contribution/Results: Experiments demonstrate substantial improvements over state-of-the-art RPAs and speech-language models in both content quality and stylistic consistency, with an end-to-end latency of only 289 ms. We publicly release both code and dataset.

Technology Category

Application Category

📝 Abstract

Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.

Problem

Research questions and friction points this paper is trying to address.

Enhancing role-playing agents with voice traits for immersive interaction

Aligning speech-language personality traits in multi-round dialogues

Reducing response latency in speech-language role-playing systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Seamless speech-language personality interaction model

Consistent role-specific personality and vocal traits

Low latency response at 289ms

🔎 Similar Papers

The Oscars of AI Theater: A Survey on Role-Playing with Language Models

2024-07-16arXiv.orgCitations: 5

💼 Related Jobs

AI Language Engineer

Cresta

$90,000–$160,000 + Offers Equity

United States (Remote) / US (Remote)

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs