MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing music-driven dance video generation methods struggle to simultaneously achieve high visual fidelity and physically plausible, artistically expressive motion. To address this, we propose a motion-appearance disentangled cascaded Mixture-of-Experts (MoE) framework: a motion expert employs a BiMamba-Transformer hybrid diffusion model to map music into 3D dance poses adhering to biomechanical constraints and choreographic expressiveness; an appearance expert fuses the pose sequence with a reference image to synthesize identity-consistent, spatiotemporally coherent high-fidelity video. We introduce two key innovations: (1) a guidance-free training (GFT) strategy and (2) disentangled motion-aesthetic fine-tuning. Furthermore, we establish the first joint motion-appearance evaluation protocol for dance videos and release a large-scale dance dataset. Our method achieves state-of-the-art performance on both 3D dance generation and pose-conditioned video synthesis, outperforming prior work under our new protocol—delivering end-to-end dance videos with high fidelity, precise rhythmic alignment, and stable identity preservation.

Technology Category

Application Category

📝 Abstract

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/

Problem

Research questions and friction points this paper is trying to address.

Generates dance videos from music with realistic motion

Preserves visual identity and spatiotemporal coherence in synthesis

Addresses joint quality of appearance and motion in dance generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Mixture-of-Experts framework for dance video generation

Motion Expert uses diffusion model with BiMamba-Transformer architecture

Appearance Expert employs decoupled kinematic-aesthetic fine-tuning strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow