SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

📅 2024-11-15

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from prohibitively high inference latency due to redundant computation of attention and feed-forward layers across diffusion steps, hindering real-time deployment. To address this, we propose a model-agnostic, cross-timestep feature caching method: leveraging the high similarity of layer-wise representations between adjacent diffusion steps, our approach dynamically models timestep similarity using a small calibration set and adaptively caches and reuses critical intermediate features. This is the first inference optimization for DiTs that is smooth, requires no architectural modification or model retraining. The method is compatible with diverse multimodal DiT architectures—including DiT-XL, Open-Sora, and Stable Audio Open—and achieves 8%–71% inference speedup while preserving or even improving generation quality. Our approach significantly advances the practical real-time application of DiTs in image, video, and audio synthesis.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

Problem

Research questions and friction points this paper is trying to address.

Accelerates inference for Diffusion Transformers (DiT)

Reduces computational cost of attention modules

Maintains generation quality across diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic acceleration for Diffusion Transformers

Adaptive caching of key features across timesteps

Maintains generation quality while speeding up inference

🔎 Similar Papers

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration