🤖 AI Summary
This work addresses the high engineering complexity that hinders efficient exploration of novel sparse attention algorithms in large language model (LLM) serving, limiting both human and AI-driven innovation. To overcome this, we present Vortex, a system that enables flexible expression of diverse sparse attention patterns through a Python-embedded frontend and a page-centric tensor abstraction. Vortex is deeply integrated into modern LLM serving stacks, facilitating rapid prototyping and evaluation. Notably, it is the first system to empower AI agents to automatically generate and optimize sparse attention algorithms, extending their applicability to emerging hardware architectures and ultra-large-scale models such as the 229B-parameter MiniMax-M2.7. Experiments demonstrate up to a 4.7× throughput improvement on GLM-4.7-Flash, with the best-performing sparse algorithm achieving a 3.46× speedup over full attention while preserving model accuracy.
📝 Abstract
Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.