ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

šŸ“… 2026-01-07
šŸ›ļø arXiv.org
šŸ“ˆ Citations: 1
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the challenge of efficiently generating long-sequence videos with transformer-based video diffusion models, which are hindered by the quadratic computational complexity of standard attention mechanisms. To overcome this limitation, the authors propose a recurrent hybrid attention mechanism that integrates the high-fidelity modeling capacity of softmax attention with the computational efficiency of linear attention. This design enables chunk-wise recurrent modeling with constant memory consumption and facilitates efficient knowledge distillation from existing pretrained models. The proposed method reduces attention complexity from O(n²) to O(n) and achieves state-of-the-art video generation quality on VBench, VBench-2.0, and human evaluations, while significantly lowering training costs to approximately 160 GPU hours.

Technology Category

Application Category

šŸ“ Abstract
Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.
Problem

Research questions and friction points this paper is trying to address.

video diffusion
transformer
attention complexity
scalability
long-duration video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Hybrid Attention
Video Diffusion Transformers
Linear Attention
Model Distillation
Scalable Video Generation
šŸ”Ž Similar Papers
No similar papers found.