๐ค AI Summary
To address high inference latency and substantial inter-GPU communication overhead in generating high-resolution images with Diffusion Transformer (DiT) models, this paper proposes a block-level and layer-wise collaborative pipeline parallelism scheme. We introduce the first patch-level pipeline parallelism paradigm, integrated with a cross-diffusion-step stale feature map reuse mechanism, drastically reducing inter-GPU communication volume. The method further supports parameter sharding across devices and memory-efficient GPU memory scheduling. To our knowledge, this is the first work enabling low-latency inference of ultra-large DiT modelsโsuch as Flux.1โon an 8รL40 PCIe GPU cluster. It achieves state-of-the-art throughput and latency on PixArt-ฮฑ, Stable Diffusion 3, and Flux.1, outperforming Tensor Parallelism, Sequence Parallelism, and DistriFusion by up to several orders of magnitude in communication reduction.
๐ Abstract
This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$ imes$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.