EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

πŸ“… 2025-04-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Audio-driven gesture video generation typically relies on complex two-stage pipelines and requires auxiliary temporal modules for sequence modeling. Method: This paper proposes an end-to-end, single-stage diffusion framework that employs 2D skeletal poses as intermediate representations, jointly modeling audio features and skeletal motion dynamics without a separately trained temporal module. The approach leverages pretrained diffusion weights and enables lightweight character-specific fine-tuning with only数千 frames of data; it further incorporates a temporal consistency inference mechanism to ensure natural, temporally coherent gestures. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art GAN-based and diffusion-based methods across multiple benchmarks. The generated videos exhibit superior temporal coherence and visual fidelity, while the streamlined architecture substantially enhances deployment efficiency.

Technology Category

Application Category

πŸ“ Abstract
Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.
Problem

Research questions and friction points this paper is trying to address.

Improving natural gesture synthesis in audio-driven video generation
Reducing data and training complexity for practical applications
Enhancing realism and continuity in co-speech gesture videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage training method for gesture video synthesis
Temporal inference using diffusion model
Audio-to-video pipeline with 2D skeleton representation
πŸ”Ž Similar Papers
No similar papers found.
R
Renda Li
University of Science and Technology of China, Hefei, China
X
Xiaohua Qi
University of Science and Technology of China, Hefei, China
Qiang Ling
Qiang Ling
University of Science and Technology of China, Hefei, China
J
Jun Yu
University of Science and Technology of China, Hefei, China
Z
Ziyi Chen
PAII Inc.
Peng Chang
Peng Chang
PAII Inc.
Computer VisionRobotics
M
Mei HanJing Xiao
PAII Inc.