π€ AI Summary
Audio-driven gesture video generation typically relies on complex two-stage pipelines and requires auxiliary temporal modules for sequence modeling. Method: This paper proposes an end-to-end, single-stage diffusion framework that employs 2D skeletal poses as intermediate representations, jointly modeling audio features and skeletal motion dynamics without a separately trained temporal module. The approach leverages pretrained diffusion weights and enables lightweight character-specific fine-tuning with onlyζ°ε frames of data; it further incorporates a temporal consistency inference mechanism to ensure natural, temporally coherent gestures. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art GAN-based and diffusion-based methods across multiple benchmarks. The generated videos exhibit superior temporal coherence and visual fidelity, while the streamlined architecture substantially enhances deployment efficiency.
π Abstract
Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.