Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

📅 2024-11-27
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models (VDMs) suffer from high computational cost and slow inference, hindering practical deployment. To address this, we propose VDMini, the first hierarchical pruning framework that decouples individual content representation from motion dynamics modeling: shallow layers preserve content encoding capacity, while deeper layers specialize in temporal modeling. We introduce two novel consistency losses—Individual Content Distillation (ICD) and Multi-frame Content Adversarial (MCA)—to jointly optimize intra-frame content fidelity and inter-frame motion coherence. Integrating structured hierarchical pruning with knowledge distillation, VDMini achieves 2.5× and 1.4× inference acceleration over SF-V (image-to-video) and T2V-Turbo-v2 (text-to-video), respectively, without compromising video quality on UCF101 and VBench benchmarks.

Technology Category

Application Category

📝 Abstract
The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of extbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on extbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an extbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $ imes$ and 1.4 $ imes$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost and slow inference in Video Diffusion Models
Prune redundant blocks while preserving content and motion quality
Maintain generation performance with lightweight model VDMini
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prune shallow layers, preserve deep layers
Use ICMD Consistency Loss for performance
Apply Individual Content Distillation Loss
🔎 Similar Papers
No similar papers found.
Yiming Wu
Yiming Wu
HKU | ZJU
Computer Vision and Machine Learning
H
Huan Wang
School of Engineering, Westlake University
Z
Zhenghao Chen
School of Information and Physical Sciences, The University of Newcastle, Australia
D
Dong Xu
School of Computing and Data Science, The University of Hong Kong