🤖 AI Summary
To address the high computational overhead and poor scalability of large-scale attention models in streaming settings, this paper proposes the first importance sampling framework for attention tailored to streaming scenarios. Drawing inspiration from ℓ₂ sampling, it formulates attention as a tensor-product stream and designs an efficient data structure under the turnstile streaming model, enabling sublinear space usage and near-real-time updates. Theoretically, the method achieves O(1/ε²) space complexity and O(log n) per-update time—significantly outperforming full attention. Empirically, the framework demonstrates strong generalization and scalability across diverse architectures—including Transformer and Longformer—and tasks spanning text and time-series domains. This work establishes a novel paradigm for efficient streaming inference in large language models.
📝 Abstract
This paper studies the computational challenges of large-scale attention-based models in artificial intelligence by utilizing importance sampling methods in the streaming setting. Inspired by the classical definition of the $ell_2$ sampler and the recent progress of the attention scheme in Large Language Models (LLMs), we propose the definition of the attention sampler. Our approach significantly reduces the computational burden of traditional attention mechanisms. We analyze the effectiveness of the attention sampler from a theoretical perspective, including space and update time. Additionally, our framework exhibits scalability and broad applicability across various model architectures and domains.