StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Offline video large language models (Video-LLMs) struggle to adapt to online streaming scenarios and lack multi-turn real-time understanding and proactive response capabilities. To address this, we propose the first proactive video understanding framework tailored for streaming interaction. Methodologically: (1) we design a memory buffer with round-wise decay compression to enable efficient long-horizon, multi-turn contextual modeling; (2) we introduce a decoupled lightweight activation module, enabling Video-LLMs to sustainably generate proactive responses to streaming inputs for the first time; (3) we release Stream-IT, the first streaming video-language understanding benchmark, featuring interleaved audio-video–text sequences and diverse instruction types. Experiments demonstrate that our approach significantly outperforms GPT-4o and Gemini 1.5 Pro on streaming understanding tasks while maintaining competitive performance on standard video understanding benchmarks.

Technology Category

Application Category

📝 Abstract

We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adapting offline Video-LLMs for real-time streaming scenarios

Enhancing multi-turn understanding in online video interactions

Adding proactive response mechanisms to existing video models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory buffer with round-decayed compression strategy

Decoupled lightweight activation model integration

Large-scale Stream-IT dataset for streaming video

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs