🤖 AI Summary
This work addresses critical challenges in long-duration digital human animation—namely, low lip-sync accuracy, limited facial expressiveness, and identity drift—by proposing the first single-frame portrait + text + audio-driven high-fidelity method. Methodologically, it introduces Soul-1M, a million-scale multi-scene finely annotated dataset, and Soul-Bench, a dedicated benchmark; designs threshold-aware codebook replacement and multi-stage distillation to enhance long-range temporal consistency; and leverages the Wan2.2-5B foundation model, augmented with an audio injection layer, lightweight VAE, classifier-free guidance (CFG) and step-wise distillation, and an automated annotation pipeline. Experiments demonstrate state-of-the-art performance across video quality, text–video alignment, identity preservation, and lip-sync accuracy—surpassing leading open-source and commercial models. The method achieves 11.4× faster inference and has been deployed in virtual broadcasting and film production.
📝 Abstract
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $ extbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$ imes$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/