HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Virtual character speech-driven gesture generation faces the critical challenge of unnatural motion due to weak audio–motion correlation. This paper proposes a novel decoupled diffusion framework integrating motion priors: (1) a motion-centric, low-audio-dependency prior model—first of its kind; (2) hybrid constraints combining implicit joint limits with explicit geometric and conditional guidance to enhance diffusion efficiency; and (3) a shared gesture–text embedding space ensuring fine-grained semantic alignment. The method preserves full-body coordination and expressiveness while significantly improving gesture photorealism (approaching ground truth), global motion stability, and finger-level articulation fidelity. Inference speed is substantially accelerated. A user study confirms strong immersion and practical applicability.

Technology Category

Application Category

📝 Abstract

Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at https://cyk990422.github.io/HoloGest.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generates expressive co-speech gestures for virtual characters

Improves audio-gesture correlation for natural motion outcomes

Enhances generation speed and quality using diffusion-based models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled diffusion and motion priors

Implicit joint and explicit geometric constraints

Shared embedding for gesture-text alignment

🔎 Similar Papers

No similar papers found.