BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

πŸ“… 2025-09-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video diffusion transformers (DiTs) suffer from high inference latency due to sequential denoising, hindering practical deployment. To address this, we propose a training-free, block-level feature caching acceleration method. We first observe a U-shaped similarity pattern across diffusion timesteps in intermediate features of DiT layersβ€”a previously unreported property. Leveraging this insight, we design a dynamic similarity-threshold-based cache triggering mechanism that enables fine-grained reuse of intermediate features without altering the model architecture. This preserves generative fidelity while significantly reducing computational redundancy. Evaluated on multiple state-of-the-art video DiT models, our method achieves up to 2.24Γ— inference speedup, with negligible degradation in FID and FVD scores and no perceptible loss in visual quality.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$ imes$ speedup with comparable visual quality.
Problem

Research questions and friction points this paper is trying to address.

Accelerating video diffusion transformers inference latency
Reducing computational redundancy in sequential denoising process
Maintaining visual quality while reusing intermediate features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-wise caching for DiT acceleration
Similarity indicator triggers feature reuse
Training-free method maintains visual quality
πŸ”Ž Similar Papers
No similar papers found.
H
Hanshuai Cui
School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
Zhiqing Tang
Zhiqing Tang
Associate Professor, Beijing Normal University
Edge ComputingEdge AI SystemsContainerReinforcement Learning
Z
Zhifei Xu
School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
Z
Zhi Yao
Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai 519087, China
W
Wenyi Zeng
School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
Weijia Jia
Weijia Jia
FIEEE, Chair Professor, Beijing Normal University and UIC
Cyber Intelligent ComputingNetworking