Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bottleneck in long video understanding caused by the linear growth of KV cache with sequence length in vision-language models, as well as the computational inefficiency of existing methods that discard redundant tokens only after computing full attention. To overcome these limitations, the authors propose Sali-Cache, a novel framework that integrates spatiotemporal dual-signal priors—optical flow analysis and visual saliency detection—into KV cache management. This enables adaptive cache compression and pre-selection of critical tokens prior to attention computation. Implemented within the LLaVA-1.6 architecture, Sali-Cache achieves a 2.20× effective memory compression while preserving performance on BLEU, ROUGE-L, and Exact Match metrics, thereby enabling longer video inputs and significantly improving inference efficiency on consumer-grade hardware.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.
Problem

Research questions and friction points this paper is trying to address.

KV-Cache
memory bottleneck
long-form video
Vision-Language Models
sequence length
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Cache Optimization
Dual-Signal Adaptive Caching
Optical Flow Analysis
Saliency Detection
Long-Form Video Understanding
🔎 Similar Papers
No similar papers found.
V
Vishnu Sai
International Institute of Information Technology, Hyderabad, India
D
Dheeraj Sai
International Institute of Information Technology, Hyderabad, India
S
Srinath B
International Institute of Information Technology, Hyderabad, India
Girish Varma
Girish Varma
IIIT, Hyderabad
Computer VisionMachine LearningComplexity TheoryAlgorithmsCombinatorics
Priyesh Shukla
Priyesh Shukla
International Institute of Information Technology (IIIT), Hyderabad
Sustainable Computing SystemArtificial IntelligenceRoboticsHealthcareQuantum Systems