DanceMosaic: High-Fidelity Dance Generation with Multimodal Editability

πŸ“… 2025-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing dance generation methods struggle to simultaneously achieve realism, music synchronization, motion diversity, and physical plausibility, while lacking flexible editing capabilities for multimodal conditioning signalsβ€”such as musical cues, pose constraints, action labels, and genre descriptions. To address these limitations, we propose the first multimodal masked motion model tailored for high-fidelity 3D dance generation, integrating a text-to-motion framework with dual adapters for music and pose conditioning. We further introduce multimodal classifier-free guidance and inference-time motion optimization, jointly enhancing cross-modal alignment fidelity and editing flexibility. Our approach achieves state-of-the-art performance across multiple quantitative metrics, significantly improving generation quality, physical plausibility, and real-time editability. It enables rich, user-controllable creative expression through diverse multimodal inputs.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in dance generation have enabled automatic synthesis of 3D dance motions. However, existing methods still struggle to produce high-fidelity dance sequences that simultaneously deliver exceptional realism, precise dance-music synchronization, high motion diversity, and physical plausibility. Moreover, existing methods lack the flexibility to edit dance sequences according to diverse guidance signals, such as musical prompts, pose constraints, action labels, and genre descriptions, significantly restricting their creative utility and adaptability. Unlike the existing approaches, DanceMosaic enables fast and high-fidelity dance generation, while allowing multimodal motion editing. Specifically, we propose a multimodal masked motion model that fuses the text-to-motion model with music and pose adapters to learn probabilistic mapping from diverse guidance signals to high-quality dance motion sequences via progressive generative masking training. To further enhance the motion generation quality, we propose multimodal classifier-free guidance and inference-time optimization mechanism that further enforce the alignment between the generated motions and the multimodal guidance. Extensive experiments demonstrate that our method establishes a new state-of-the-art performance in dance generation, significantly advancing the quality and editability achieved by existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Generates high-fidelity dance sequences with realism and synchronization
Enables multimodal editing via music, pose, and text guidance
Improves motion diversity and physical plausibility in dance generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal masked motion model fusion
Classifier-free guidance for alignment
Inference-time optimization mechanism
πŸ”Ž Similar Papers
No similar papers found.
F
Foram Niravbhai Shah
University of North Carolina at Charlotte, Charlotte, NC, USA
P
Parshwa Shah
University of North Carolina at Charlotte, Charlotte, NC, USA
M
Muhammad Usama Saleem
University of North Carolina at Charlotte, Charlotte, NC, USA
E
Ekkasit Pinyoanuntapong
University of North Carolina at Charlotte, Charlotte, NC, USA
P
Pu Wang
University of North Carolina at Charlotte, Charlotte, NC, USA
H
Hongfei Xue
University of North Carolina at Charlotte, Charlotte, NC, USA
Ahmed Helmy
Ahmed Helmy
Assoc. Dean for Research, College of Computing, UNC Charlotte
MobileMulticastRoutingNetwork SimulationNetwork Analytics