🤖 AI Summary
Existing audiobook generation systems suffer from monotonous prosody, reliance on manual hyperparameter tuning, or speaker-specific training. This paper proposes MultiActor-Audiobook—the first framework enabling zero-shot, multi-character, and multimodal (speech + facial animation) audiobook generation. Our approach addresses these limitations through two core innovations: (1) Multimodal Speaker Persona (MSP) modeling, which ensures cross-character prosodic consistency and expressiveness via joint speech–face representation learning; and (2) Large Language Model–driven Script Instruction Generation (LSI), which automatically injects character identity, emotion, and intonation priors into the synthesis pipeline. The method integrates multimodal representation learning, LLM-based instruction engineering, synchronized speech–facial animation modeling, and zero-shot text-to-speech. Human evaluation and assessment by multimodal large models demonstrate performance competitive with commercial systems. Ablation studies confirm that MSP and LSI significantly improve emotional expressiveness accuracy (+18.7%) and character consistency (+22.3%).
📝 Abstract
We introduce MultiActor-Audiobook, a zero-shot approach for generating audiobooks that automatically produces consistent, expressive, and speaker-appropriate prosody, including intonation and emotion. Previous audiobook systems have several limitations: they require users to manually configure the speaker's prosody, read each sentence with a monotonic tone compared to voice actors, or rely on costly training. However, our MultiActor-Audiobook addresses these issues by introducing two novel processes: (1) MSP (**Multimodal Speaker Persona Generation**) and (2) LSI (**LLM-based Script Instruction Generation**). With these two processes, MultiActor-Audiobook can generate more emotionally expressive audiobooks with a consistent speaker prosody without additional training. We compare our system with commercial products, through human and MLLM evaluations, achieving competitive results. Furthermore, we demonstrate the effectiveness of MSP and LSI through ablation studies.