PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Visual-language large models face dual efficiency bottlenecks when processing high-resolution inputs: quadratic computational complexity in attention and unbounded growth of the key-value (KV) cache. Existing KV compression methods rely on attention scores, lack compatibility with efficient attention kernels (e.g., FlashAttention), and fail to adapt to structural information changes under sparse attention. This paper proposes a plug-and-play joint KV cache optimization framework. First, it introduces the first cross-layer importance estimation method compatible with FlashAttention and other efficient attention implementations. Second, it designs a spatio-temporal sparse attention (ST-SpAttn) module tailored for video, jointly suppressing spatial noise and temporal redundancy. Third, it proposes a low-layer score-driven importance assessment and proactive pruning strategy for higher layers. Evaluated on VideoLLaMA2 and Qwen2.5-VL, our method achieves up to 5.0× KV cache compression and 3.16× prefill speedup, with negligible performance degradation.

Technology Category

Application Category

📝 Abstract

Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache compression for vision-language models' efficiency

Enables compatibility between sparse attention and cache compression

Reduces spatial-temporal redundancy in video processing KV cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play KV cache optimization with sparse attention

KV cache compression compatible with efficient attention accelerators

Spatial-temporal sparse attention module for video compression

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference