🤖 AI Summary
Visual-language large models face dual efficiency bottlenecks when processing high-resolution inputs: quadratic computational complexity in attention and unbounded growth of the key-value (KV) cache. Existing KV compression methods rely on attention scores, lack compatibility with efficient attention kernels (e.g., FlashAttention), and fail to adapt to structural information changes under sparse attention. This paper proposes a plug-and-play joint KV cache optimization framework. First, it introduces the first cross-layer importance estimation method compatible with FlashAttention and other efficient attention implementations. Second, it designs a spatio-temporal sparse attention (ST-SpAttn) module tailored for video, jointly suppressing spatial noise and temporal redundancy. Third, it proposes a low-layer score-driven importance assessment and proactive pruning strategy for higher layers. Evaluated on VideoLLaMA2 and Qwen2.5-VL, our method achieves up to 5.0× KV cache compression and 3.16× prefill speedup, with negligible performance degradation.
📝 Abstract
Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.