PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Long sequences in large vision-language models (LVLMs) induce redundant key-value (KV) caches, leading to substantial computational and memory overhead; existing methods ignore inter-layer differences in KV importance and apply uniform truncation, risking critical context loss and performance degradation. Method: We propose an adaptive prefix KV caching mechanism that (i) formulates global prefix length configuration as a binary optimization problem to enable layer-aware, differential KV retention; (ii) integrates dynamic prefix scheduling with efficient cache reuse across layers and tokens. Contribution/Results: Evaluated on multiple vision-language generation benchmarks, our method achieves state-of-the-art (SOTA) performance—matching or exceeding baseline generation quality while accelerating inference by up to 2.1×. It effectively balances efficiency and fidelity without compromising contextual integrity.

Technology Category

Application Category

📝 Abstract
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at url{https://github.com/THU-MIG/PrefixKV}.
Problem

Research questions and friction points this paper is trying to address.

Reduce KV cache overhead in vision-language models
Optimize layer-wise KV retention for minimal information loss
Improve inference efficiency without sacrificing generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive layer-wise KV retention recipe
Binary search for optimal prefix configuration
Preserves maximum contextual information per layer
🔎 Similar Papers
No similar papers found.