Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical challenges in long-duration digital human animation—namely, low lip-sync accuracy, limited facial expressiveness, and identity drift—by proposing the first single-frame portrait + text + audio-driven high-fidelity method. Methodologically, it introduces Soul-1M, a million-scale multi-scene finely annotated dataset, and Soul-Bench, a dedicated benchmark; designs threshold-aware codebook replacement and multi-stage distillation to enhance long-range temporal consistency; and leverages the Wan2.2-5B foundation model, augmented with an audio injection layer, lightweight VAE, classifier-free guidance (CFG) and step-wise distillation, and an automated annotation pipeline. Experiments demonstrate state-of-the-art performance across video quality, text–video alignment, identity preservation, and lip-sync accuracy—surpassing leading open-source and commercial models. The method achieves 11.4× faster inference and has been deployed in virtual broadcasting and film production.

Technology Category

Application Category

📝 Abstract
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $ extbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$ imes$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
Problem

Research questions and friction points this paper is trying to address.

Generates coherent videos from single image, text, and audio inputs
Ensures lip sync, facial expressions, and identity preservation
Addresses data scarcity with annotated dataset for digital human animation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework for high-fidelity digital human animation
Automated annotation pipeline to mitigate data scarcity
Threshold-aware codebook replacement for long-term consistency
🔎 Similar Papers
No similar papers found.