Streaming Video Instruction Tuning

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing online video models are typically confined to single tasks (e.g., question answering or captioning) and lack general-purpose real-time interactive capabilities. To address this, Streamo introduces the first end-to-end unified streaming video large language model, supporting five temporally sensitive tasks: real-time narration, action understanding, event captioning, temporal localization, and time-sensitive question answering. Its key contributions are: (1) a general multi-task architecture for full-stack streaming video understanding; (2) a novel instruction-tuning paradigm explicitly designed for continuous temporal sequences; (3) integration of a streaming video encoder, a temporally enhanced LLM, and a dynamic frame-sampling mechanism; and (4) end-to-end alignment training on Streamo-Instruct-465K—a newly curated 465K-instruction dataset. Streamo achieves significant improvements over state-of-the-art methods across multiple streaming video benchmarks, with sub-300ms response latency and strong cross-task generalization.

Technology Category

Application Category

📝 Abstract
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
Problem

Research questions and friction points this paper is trying to address.

Develops a real-time streaming video LLM for interactive assistance
Addresses diverse tasks like narration, action understanding, and event captioning
Bridges offline video perception and real-time multimodal assistant capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time streaming video LLM for interactive assistance
Large-scale instruction dataset for multi-task video understanding
End-to-end training enabling temporal reasoning and generalization
🔎 Similar Papers
No similar papers found.