🤖 AI Summary
Existing vision-language models excel in offline video understanding but struggle in real-world streaming scenarios due to challenges in real-time processing, long-term memory retention, and active interaction capabilities. This work presents the first end-to-end streaming framework that systematically addresses these three core limitations. We introduce Streaming-Train-248K, a large-scale training dataset; Streaming Harness, a modular runtime system; and Streaming-Eval, a comprehensive evaluation benchmark for streaming settings. The proposed framework enables sub-second inference latency, supports contextual memory spanning up to 12 hours, and facilitates active user interaction, achieving substantial performance gains across diverse real-world streaming tasks. To foster community progress, we publicly release the dataset, codebase, and evaluation benchmark.
📝 Abstract
Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.