🤖 AI Summary
In long-video generation, full attention is computationally prohibitive due to its quadratic complexity in sequence length; existing sparse attention methods rely on coarse block-level approximations, compromising both accuracy and efficiency. To address this, we propose Mixture-of-Group Attention (MoGA), a semantic-aware, fine-grained grouping mechanism enabled by a lightweight, learnable token router—eliminating the need for block-wise approximations while precisely identifying critical token pairs. MoGA is fully compatible with modern acceleration techniques, including FlashAttention and sequence parallelism. Experiments demonstrate that MoGA enables end-to-end generation of multi-shot videos up to 1 minute in duration, at 480p resolution and 24 fps, with context lengths reaching 580k tokens. It consistently outperforms state-of-the-art sparse attention baselines across multiple video generation benchmarks, effectively breaking the traditional accuracy–efficiency trade-off.
📝 Abstract
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.