Fast Video Generation with Sliding Tile Attention

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational cost of 3D full-attention in video diffusion Transformers—which severely hampers inference efficiency—this paper proposes Sliding Tile Attention (STA), a hardware-aware, tile-level sliding-window mechanism enabling unified 2D/3D modeling and plug-and-play deployment without retraining. STA integrates customized FlashAttention-2/3-enhanced CUDA kernels, spatiotemporal local tiling, and hardware-aligned sliding scheduling. Evaluated on HunyuanVideo, STA reduces end-to-end latency from 945 seconds to 685 seconds losslessly; with lightweight fine-tuning, latency further drops to 268 seconds—while VBench score degrades by only 0.09%. The attention module achieves 2.8×–17× speedup, and peak MFU reaches 58.79%, significantly improving hardware utilization and generation throughput.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.
Problem

Research questions and friction points this paper is trying to address.

Reduces compute cost in video generation
Introduces sliding tile attention for efficiency
Accelerates attention mechanisms in DiTs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding tile attention for efficiency
Hardware-aware sliding window design
Kernel-level optimizations for 2D/3D attention
🔎 Similar Papers
No similar papers found.
P
Peiyuan Zhang
University of California, San Diego
Yongqi Chen
Yongqi Chen
Unknown affiliation
Generative modelsMachine Learning SystemRobotics
Runlong Su
Runlong Su
TikTok
Multimodal Models
H
Hangliang Ding
Tsinghua University
Ion Stoica
Ion Stoica
Professor of Computer Science, UC Berkeley
Cloud ComputingNetworkingDistributed SystemsBig Data
Z
Zhenghong Liu
Mohamed bin Zayed University of Artificial Intelligence
H
Hao Zhang
University of California, San Diego