🤖 AI Summary
To address the slow inference speed and high GPU memory consumption of diffusion Transformer (DiT)-based video generation models, this work proposes a system-level acceleration framework. First, we design PipeSP, a pipeline algorithm supporting sequence parallelism. Second, we introduce DeDiVAE, a mechanism that functionally decouples the diffusion model from the VAE encoder/decoder across GPU node groups. Third, we propose Attention Collaboration (Aco), a mechanism that improves GPU computational resource utilization through coordinated attention computation. The framework is efficiently integrated into OpenSoraPlan and HunyuanVideo. Experiments on an 8-GPU cluster demonstrate end-to-end speedups of 1.06–4.02×, significantly improving throughput and GPU memory efficiency. Our approach establishes a scalable, distributed inference paradigm for large-scale video generation models.
📝 Abstract
Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.