ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of efficiently generating long-sequence videos with transformer-based video diffusion models, which are hindered by the quadratic computational complexity of standard attention mechanisms. To overcome this limitation, the authors propose a recurrent hybrid attention mechanism that integrates the high-fidelity modeling capacity of softmax attention with the computational efficiency of linear attention. This design enables chunk-wise recurrent modeling with constant memory consumption and facilitates efficient knowledge distillation from existing pretrained models. The proposed method reduces attention complexity from O(n²) to O(n) and achieves state-of-the-art video generation quality on VBench, VBench-2.0, and human evaluations, while significantly lowering training costs to approximately 160 GPU hours.

Technology Category

Application Category

📝 Abstract

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

Problem

Research questions and friction points this paper is trying to address.

video diffusion

transformer

attention complexity

scalability

long-duration video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Hybrid Attention

Video Diffusion Transformers

Linear Attention