TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

๐Ÿ“… 2025-07-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational complexity (O(nยฒ)) of attention computation during the prefilling phase of large language models (LLMs), the accuracy degradation caused by static sparsity, and the overhead introduced by dynamic sparsity, this paper proposes TriangleMixโ€”a training-free, hierarchical static sparse attention mechanism. It retains full attention in shallow layers to preserve representational capacity, while switching to a carefully designed triangular sparse pattern in deeper layers, achieving efficient computation without accuracy loss. Crucially, TriangleMix eliminates runtime index estimation, incurs no inference latency overhead, and is orthogonal to dynamic sparsity methods, enabling synergistic integration. Experiments on long sequences (32Kโ€“128K) demonstrate that TriangleMix reduces deep-layer attention computation cost by 3.7ร—โ€“15.3ร— and accelerates first-token generation time by 12%โ€“32%, all while maintaining zero accuracy degradation.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic attention overhead in LLM prefilling
Balances accuracy and efficiency with static sparse attention
Integrates with dynamic sparsity for further inference speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

TriangleMix: training-free static attention pattern
Dense attention in shallow, triangle sparse in deep layers
Reduces attention overhead by 3.7x to 15.3x
๐Ÿ”Ž Similar Papers
No similar papers found.