Harnessing Streaming Video in the Wild

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models excel in offline video understanding but struggle in real-world streaming scenarios due to challenges in real-time processing, long-term memory retention, and active interaction capabilities. This work presents the first end-to-end streaming framework that systematically addresses these three core limitations. We introduce Streaming-Train-248K, a large-scale training dataset; Streaming Harness, a modular runtime system; and Streaming-Eval, a comprehensive evaluation benchmark for streaming settings. The proposed framework enables sub-second inference latency, supports contextual memory spanning up to 12 hours, and facilitates active user interaction, achieving substantial performance gains across diverse real-world streaming tasks. To foster community progress, we publicly release the dataset, codebase, and evaluation benchmark.
📝 Abstract
Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.
Problem

Research questions and friction points this paper is trying to address.

streaming video
vision-language models
real-time processing
long-horizon memory
proactive interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video
vision-language models
real-time processing
long-horizon memory
proactive interaction
D
Dingyu Yao
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
S
Shuhuan Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Q
Qingyi Si
JD.COM
J
Junhao Zhou
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
C
Chuanyu Qin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Zheng Lin
Zheng Lin
Institute of Information Engineering, CAS
NLP
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence
Jiaqi Wang
Jiaqi Wang
Unknown affiliation