🤖 AI Summary
This work addresses the problem of high-fidelity, full-body 3D dance generation driven by music, requiring strict adherence to specified dance genres (e.g., hip-hop, ballet), physically plausible motion, and millisecond-level beat/rhythm synchronization. We propose the first classifier-free guidance-based diffusion framework that jointly models multi-scale musical representations (extracted from pre-trained models such as MAESTRO), hand-crafted rhythmic features, and CLIP-encoded temporal text prompts—enabling fine-grained, style-controllable generation from a single music input. Our method integrates spatiotemporal action modeling with cross-modal alignment mechanisms. Evaluated on FineDance and AIST++, it significantly outperforms state-of-the-art methods in fidelity, synchronization accuracy, and cross-genre generalization, while maintaining efficient and stable inference.
📝 Abstract
Generating high-quality full-body dance sequences from music is a challenging task as it requires strict adherence to genre-specific choreography. Moreover, the generated sequences must be both physically realistic and precisely synchronized with the beats and rhythm of the music. To overcome these challenges, we propose GCDance, a classifier-free diffusion framework for generating genre-specific dance motions conditioned on both music and textual prompts. Specifically, our approach extracts music features by combining high-level pre-trained music foundation model features with hand-crafted features for multi-granularity feature fusion. To achieve genre controllability, we leverage CLIP to efficiently embed genre-based textual prompt representations at each time step within our dance generation pipeline. Our GCDance framework can generate diverse dance styles from the same piece of music while ensuring coherence with the rhythm and melody of the music. Extensive experimental results obtained on the FineDance dataset demonstrate that GCDance significantly outperforms the existing state-of-the-art approaches, which also achieve competitive results on the AIST++ dataset. Our ablation and inference time analysis demonstrate that GCDance provides an effective solution for high-quality music-driven dance generation.