BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Video diffusion transformers (DiTs) suffer from high inference latency due to sequential denoising, hindering practical deployment. To address this, we propose a training-free, block-level feature caching acceleration method. We first observe a U-shaped similarity pattern across diffusion timesteps in intermediate features of DiT layers—a previously unreported property. Leveraging this insight, we design a dynamic similarity-threshold-based cache triggering mechanism that enables fine-grained reuse of intermediate features without altering the model architecture. This preserves generative fidelity while significantly reducing computational redundancy. Evaluated on multiple state-of-the-art video DiT models, our method achieves up to 2.24× inference speedup, with negligible degradation in FID and FVD scores and no perceptible loss in visual quality.

Technology Category

Application Category

📝 Abstract

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$ imes$ speedup with comparable visual quality.

Problem

Research questions and friction points this paper is trying to address.

Accelerating video diffusion transformers inference latency

Reducing computational redundancy in sequential denoising process

Maintaining visual quality while reusing intermediate features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-wise caching for DiT acceleration

Similarity indicator triggers feature reuse

Training-free method maintains visual quality

🔎 Similar Papers

No similar papers found.

Authors to Follow