GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

📅 2026-06-11
📈 Citations: 0
âœĻ Influential: 0
📄 PDF
ðŸĪ– AI Summary
This work addresses the inefficiency of static parallelism configurations in Diffusion Transformer (DiT) serving, which struggle to adapt to heterogeneity across requests, execution phases, and system conditions, resulting in low GPU utilization and degraded service quality. To overcome this limitation, the authors propose an elastic DiT serving framework that treats GPU parallelism as a schedulable resource, dynamically adjusting the parallel configuration of in-flight requests to align with workload characteristics and service objectives. The approach introduces two key innovations: an asynchronous execution abstraction for trajectory tasks and a group-free communication primitive, enabling online GPU reallocation and low-overhead dynamic reconfiguration of arbitrary execution groups. Implemented atop vLLM-Omni, the system achieves up to 6.01× higher throughput, 95% lower average latency, 90% fewer SLO violations, and reduces communication group setup overhead from 778 ms to approximately 60 Ξs.
📝 Abstract
Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $Ξ$s.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
serving
parallelism
heterogeneity
GPU utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

elastic parallelism
Diffusion Transformer serving
group-free collectives
asynchronous execution
GPU resource scheduling