TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the high memory overhead in vision-language model inference caused by the linear growth of KV cache with context length, a challenge exacerbated by existing KV pruning methods that overlook modality-specific differences between text and images, thereby degrading performance. To bridge this semantic gap, we propose the first text-guided KV pruning framework, which introduces dynamic text-vision budget allocation, text-weighted attention scoring, and a text-priority retention mechanism. Our approach further incorporates mutual information–based inter-layer budget distribution and a structured KV retention strategy, ensuring broad applicability across architectures. Evaluated on LLaVA-NeXT, the method retains only 5% of the original KV cache while achieving 99.2% of the baseline accuracy and delivering a 52.6% improvement in end-to-end throughput. Extensive experiments across five models of varying scales and architectures confirm its consistent effectiveness.

📝 Abstract

Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV). TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52.6% with an extreme retention budget of 5%. Code is available at https://github.com/Danielement321/TGV-KV.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

KV cache eviction

modality gap

memory efficiency

visual redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache eviction

vision-language models

text-grounded attention