I2VControl: Disentangled and Unified Video Motion Synthesis Control

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
In video synthesis, multi-motion control—such as camera motion, object dragging, and motion brushes—often suffers from signal coupling, leading to logical conflicts, poor controllability, and limited generalization. To address this, we propose I2VControl, the first framework enabling decoupled modeling of motion units with a unified control interface. It decomposes videos into independent motion components and separately encodes multimodal control signals—including text, pose, and optical flow—before fusing them via conditional adapters. Designed as a plug-in module, I2VControl is compatible with diverse pre-trained diffusion models without architectural modification. It supports cross-task and cross-model plug-and-play deployment, allowing flexible composition of arbitrary control conditions and overcoming single-constraint bottlenecks. Experiments demonstrate state-of-the-art performance across multiple motion control tasks, achieving significant improvements in control accuracy, temporal consistency, and generalization to unseen control combinations.

Technology Category

Application Category

📝 Abstract
Video synthesis techniques are undergoing rapid progress, with controllability being a significant aspect of practical usability for end-users. Although text condition is an effective way to guide video synthesis, capturing the correct joint distribution between text descriptions and video motion remains a substantial challenge. In this paper, we present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals, which allows for various control types to be flexibly combined within our single system. Furthermore, our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. The project page is: https://wanquanf.github.io/I2VControl .
Problem

Research questions and friction points this paper is trying to address.

Overcoming logical conflicts in diverse video motion control
Unifying camera, object, and motion controls via point trajectories
Enabling dynamic orchestration of control types without conflicts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled unified framework for video motion control
Spatial partitioning strategy for conflict-free synthesis
Adapter structure for pre-trained model compatibility
🔎 Similar Papers
No similar papers found.
Wanquan Feng
Wanquan Feng
USTC
computer vision
Tianhao Qi
Tianhao Qi
PhD, University of Science and Technology of China
cross-modal generationobject detection
J
Jiawei Liu
ByteDance China
M
Mingzhen Sun
ByteDance China, Institute of Automation, Chinese Academy of Sciences (CASIA)
P
Pengqi Tu
ByteDance China
Tianxiang Ma
Tianxiang Ma
ByteDance Inc.<< NLPR, CASIA
Computer VisionDeep LearningAIGC
F
Fei Dai
ByteDance China
S
Songtao Zhao
ByteDance China
S
Siyu Zhou
ByteDance China
Qian He
Qian He
ByteDance