Linear Scaling Video VLMs for Long Video Understanding

๐Ÿ“… 2026-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

210K/year
๐Ÿค– AI Summary
This work addresses the quadratic computational complexity of existing video vision-language models caused by spatiotemporal self-attention, which hinders efficient long-video processing. The authors propose StateKV, a method that introduces a fixed-capacity, importance-aware recurrent state during inference to propagate cross-frame context, combined with per-frame full caching for decoding. This enables linear-time video prefilling without requiring model fine-tuning or architectural modifications. StateKV achieves linear scaling in long-video understanding while preserving accuracy close to that of full self-attention, substantially outperforming streaming approximations such as sliding windows. Experiments across three long-video benchmarks and seven model variants demonstrate that StateKV significantly reduces FLOPs, allowing deployment of larger models under the same computational budget and yielding higher accuracy.
๐Ÿ“ Abstract
Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.
Problem

Research questions and friction points this paper is trying to address.

video vision-language models
long video understanding
spatiotemporal self-attention
computational scalability
accuracy-efficiency trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

linear-time inference
video vision-language models
long-video understanding
StateKV
efficient attention
๐Ÿ”Ž Similar Papers
No similar papers found.