CoMo: Compositional Motion Customization for Text-to-Video Generation

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video models struggle to precisely control complex multi-agent motion due to entanglement between motion and appearance features and the failure of existing multi-motion fusion mechanisms. To address this, we propose a compositional motion customization framework: first, a static-dynamic disentangled single-motion learning paradigm explicitly separates appearance and motion representations; second, a plug-and-play divide-and-conquer fusion strategy enables zero-shot spatially isolated concatenation of multiple motion sequences without retraining. Built upon a diffusion-based two-stage architecture—comprising disentangled fine-tuning and spatially constrained denoising—our approach effectively alleviates feature entanglement and significantly improves fusion fidelity. Evaluated on a newly constructed benchmark with dedicated metrics, our method substantially outperforms state-of-the-art approaches, establishing a scalable, high-fidelity paradigm for controllable video generation and motion editing.

Technology Category

Application Category

📝 Abstract
While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $ extbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Enabling precise motion control for complex multi-subject video generation
Solving motion-appearance entanglement in compositional motion scenarios
Developing effective multi-motion blending without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase framework for compositional motion customization
Static-dynamic decoupled tuning for motion disentanglement
Plug-and-play divide-and-merge strategy for multi-motion blending
🔎 Similar Papers
No similar papers found.
Y
Youcan Xu
Zhejiang University
Z
Zhen Wang
HKUST
J
Jiaxin Shi
Xmax.AI Ltd
K
Kexin Li
Zhejiang University
Feifei Shao
Feifei Shao
Zhejiang Univiersity
Machine learningcomputer visionweakly supervised learningactive learning
J
Jun Xiao
Zhejiang University
Y
Yi Yang
Zhejiang University
J
Jun Yu
HIT (SZ)
L
Long Chen
HKUST