Linear Scaling Video VLMs for Long Video Understanding

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the quadratic computational complexity of existing video vision-language models caused by spatiotemporal self-attention, which hinders efficient long-video processing. The authors propose StateKV, a method that introduces a fixed-capacity, importance-aware recurrent state during inference to propagate cross-frame context, combined with per-frame full caching for decoding. This enables linear-time video prefilling without requiring model fine-tuning or architectural modifications. StateKV achieves linear scaling in long-video understanding while preserving accuracy close to that of full self-attention, substantially outperforming streaming approximations such as sliding windows. Experiments across three long-video benchmarks and seven model variants demonstrate that StateKV significantly reduces FLOPs, allowing deployment of larger models under the same computational budget and yielding higher accuracy.

📝 Abstract

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

Problem

Research questions and friction points this paper is trying to address.

video vision-language models

long video understanding

spatiotemporal self-attention

computational scalability

accuracy-efficiency trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

linear-time inference

video vision-language models

long-video understanding