PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

📅 2024-05-23

📈 Citations: 13

✨ Influential: 3

career value

232K/year

🤖 AI Summary

To address high inference latency and substantial inter-GPU communication overhead in generating high-resolution images with Diffusion Transformer (DiT) models, this paper proposes a block-level and layer-wise collaborative pipeline parallelism scheme. We introduce the first patch-level pipeline parallelism paradigm, integrated with a cross-diffusion-step stale feature map reuse mechanism, drastically reducing inter-GPU communication volume. The method further supports parameter sharding across devices and memory-efficient GPU memory scheduling. To our knowledge, this is the first work enabling low-latency inference of ultra-large DiT models—such as Flux.1—on an 8×L40 PCIe GPU cluster. It achieves state-of-the-art throughput and latency on PixArt-α, Stable Diffusion 3, and Flux.1, outperforming Tensor Parallelism, Sequence Parallelism, and DistriFusion by up to several orders of magnitude in communication reduction.

Technology Category

Application Category

📝 Abstract

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$ imes$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.

Problem

Research questions and friction points this paper is trying to address.

Reduces latency in high-resolution image generation using diffusion transformers

Optimizes communication and computation via patch-level pipeline parallelism

Enhances memory efficiency by distributing parameters across multiple GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-level pipeline parallelism for diffusion transformers

Reuses stale feature maps to reduce communication costs

Distributes parameters across devices for memory efficiency

🔎 Similar Papers

No similar papers found.