MixerMDM: Learnable Composition of Human Motion Diffusion Models

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-fidelity human motion conditioned on multiple textual descriptions. We propose a learnable dynamic model ensembling framework that synergistically integrates multiple pre-trained text-to-motion diffusion models. Through an adversarially trained adaptive mixing mechanism, our method dynamically modulates model weights during the denoising process, enabling fine-grained joint control over both individual pose details and inter-person interaction dynamics. To the best of our knowledge, this is the first work to establish a hybrid generation evaluation framework specifically targeting interaction quality and conditional alignment, augmented by a multimodal motion representation alignment strategy. Experiments demonstrate that our approach consistently outperforms existing ensemble methods across key metrics—including text-motion alignment, motion diversity, and interaction plausibility—yielding substantial improvements in dynamic precision and physical reasonableness of generated motions.

Technology Category

Application Category

📝 Abstract
Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.
Problem

Research questions and friction points this paper is trying to address.

Dynamic mixing of motion diffusion models for fine control
Learnable composition adapting to model and textual conditions
Evaluating interaction and individual motion alignment quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable dynamic mixing of diffusion models
Adversarial training for condition-dependent composition
Fine-grained control via adaptive denoising process
🔎 Similar Papers
No similar papers found.