MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music-driven dance video generation methods struggle to simultaneously achieve high visual fidelity and physically plausible, artistically expressive motion. To address this, we propose a motion-appearance disentangled cascaded Mixture-of-Experts (MoE) framework: a motion expert employs a BiMamba-Transformer hybrid diffusion model to map music into 3D dance poses adhering to biomechanical constraints and choreographic expressiveness; an appearance expert fuses the pose sequence with a reference image to synthesize identity-consistent, spatiotemporally coherent high-fidelity video. We introduce two key innovations: (1) a guidance-free training (GFT) strategy and (2) disentangled motion-aesthetic fine-tuning. Furthermore, we establish the first joint motion-appearance evaluation protocol for dance videos and release a large-scale dance dataset. Our method achieves state-of-the-art performance on both 3D dance generation and pose-conditioned video synthesis, outperforming prior work under our new protocol—delivering end-to-end dance videos with high fidelity, precise rhythmic alignment, and stable identity preservation.

Technology Category

Application Category

📝 Abstract
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/
Problem

Research questions and friction points this paper is trying to address.

Generates dance videos from music with realistic motion
Preserves visual identity and spatiotemporal coherence in synthesis
Addresses joint quality of appearance and motion in dance generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Mixture-of-Experts framework for dance video generation
Motion Expert uses diffusion model with BiMamba-Transformer architecture
Appearance Expert employs decoupled kinematic-aesthetic fine-tuning strategy
🔎 Similar Papers
No similar papers found.
K
Kaixing Yang
Renmin University of China
J
Jiashu Zhu
Alibaba Group
X
Xulong Tang
Malou Tech Inc
Ziqiao Peng
Ziqiao Peng
Renmin University of China
3D Face AnimationTalking Head Generation
X
Xiangyue Zhang
Wuhan University
P
Puwei Wang
Renmin University of China
J
Jiahong Wu
Alibaba Group
X
Xiangxiang Chu
Alibaba Group
Hongyan Liu
Hongyan Liu
Zhejiang University
programable networksnetwork measurementP4 language
J
Jun He
Renmin University of China