EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Audio-driven gesture video generation typically relies on complex two-stage pipelines and requires auxiliary temporal modules for sequence modeling. Method: This paper proposes an end-to-end, single-stage diffusion framework that employs 2D skeletal poses as intermediate representations, jointly modeling audio features and skeletal motion dynamics without a separately trained temporal module. The approach leverages pretrained diffusion weights and enables lightweight character-specific fine-tuning with only数千 frames of data; it further incorporates a temporal consistency inference mechanism to ensure natural, temporally coherent gestures. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art GAN-based and diffusion-based methods across multiple benchmarks. The generated videos exhibit superior temporal coherence and visual fidelity, while the streamlined architecture substantially enhances deployment efficiency.

Technology Category

Application Category

📝 Abstract

Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

Problem

Research questions and friction points this paper is trying to address.

Improving natural gesture synthesis in audio-driven video generation

Reducing data and training complexity for practical applications

Enhancing realism and continuity in co-speech gesture videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage training method for gesture video synthesis

Temporal inference using diffusion model

Audio-to-video pipeline with 2D skeleton representation

🔎 Similar Papers

No similar papers found.