RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In video diffusion models, 3D attention accounts for over 80% of DiT’s computational cost, severely hindering generation efficiency. To address this, we propose a training-free, plug-and-play sparse attention method that— for the first time—identifies and models three orthogonal visual redundancies in video generation: spatial, temporal, and textural. We design a lightweight Attention Redundancy Mapper (ARM) module that enables zero-overhead, online adaptive redundancy detection and dynamic attention pruning. Without fine-tuning, our method achieves >2× attention acceleration on HunyuanVideo, OpenSoraPlan-1.2, and CogVideoX-5B, with only a 0.2-point drop in VBench score and negligible perceptual quality degradation. Our core contributions are threefold: (1) the first 3D sparse attention paradigm for video diffusion; (2) a zero-cost, self-adaptive redundancy identification mechanism; and (3) a general-purpose 3D attention optimization framework tailored to video diffusion models.

Technology Category

Application Category

📝 Abstract
Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80% of the total computational resources. In this work, we introduce {f RainFusion}, a novel training-free sparse attention method that exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality. Specifically, we identify three unique sparse patterns in video generation attention calculations--Spatial Pattern, Temporal Pattern and Textural Pattern. The sparse pattern for each attention head is determined online with negligible overhead ( extasciitilde,0.2%) with our proposed {f ARM} (Adaptive Recognition Module) during inference. Our proposed {f RainFusion} is a plug-and-play method, that can be seamlessly integrated into state-of-the-art 3D-attention video generation models without additional training or calibration. We evaluate our method on leading open-sourced models including HunyuanVideo, OpenSoraPlan-1.2 and CogVideoX-5B, demonstrating its broad applicability and effectiveness. Experimental results show that RainFusion achieves over {f 2( imes)} speedup in attention computation while maintaining video quality, with only a minimal impact on VBench scores (-0.2%).
Problem

Research questions and friction points this paper is trying to address.

Accelerate video generation by reducing computational intensity
Exploit visual sparsity patterns to maintain video quality
Enable plug-and-play integration without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free sparse attention method
Adaptive Recognition Module (ARM)
Plug-and-play for 3D-attention models
🔎 Similar Papers
No similar papers found.
A
Aiyue Chen
Huawei Technologies Co., Ltd
B
Bin Dong
Huawei Technologies Co., Ltd
J
Jingru Li
Huawei Technologies Co., Ltd
J
Jing Lin
Huawei Technologies Co., Ltd
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
G
Gongyi Wang
Huawei Technologies Co., Ltd