๐ค AI Summary
To address the high computational complexity of self-attention, weak temporal coherence, and limited scalability in diffusion transformers (DiTs) for long-video generation, this paper proposes FlexFormer. Methodologically, FlexFormer integrates single-query attention with position-aware mechanisms to support variable-length inputs and linearly adjustable compression ratios; employs a Q-Former variational autoencoder, sparse attention, and context compression for joint video-text latent-space modeling; and adopts a segment-wise generation scheme with unified latent representation to enable efficient modeling of million-scale open-domain videos. Experiments demonstrate significant improvements in generation efficiency and temporal consistency across prediction, retroactive generation, interpolation, and multi-shot synthesis tasks. FlexFormer achieves strong generalization across diverse video domains while maintaining practical deployability.
๐ Abstract
Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.