🤖 AI Summary
This work addresses the challenge of generating realistic and temporally coherent human motion videos from a single portrait image and audio input, overcoming the limitations of conventional keypoint-based methods in modeling subtle dynamics. The authors propose an end-to-end, two-stage disentangled framework that operates without keypoint supervision. In the first stage, an implicit motion representation is constructed through a region-aware attention mechanism that fuses appearance priors with hierarchical depth information. In the second stage, a Mamba-enhanced diffusion model directly predicts this motion representation from the input audio and source image. Evaluated on multiple public benchmarks as well as a newly curated 380-hour high-quality dataset, the method achieves state-of-the-art performance in terms of motion naturalness, temporal consistency, and generation accuracy.
📝 Abstract
Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.