Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing online video large language models, which suspend visual perception during response generation, thereby disrupting real-time video-language synchronization. To overcome this, the authors propose the Streaming Video-Language Synchronization (SVLS) paradigm, featuring a hierarchical control framework that enables fine-grained interleaving of video frame processing and language generation, ensuring non-blocking parallelism between perception and output. Key innovations include a training-free Frame-Driven State Controller (FDTC) and a plug-and-play Streaming Token Pacer (SToP), which dynamically align semantic decision-making and generation speed with the rhythm of visual input. Integrated with per-frame incremental inference, sub-budget decoding, and a lightweight predictive module, the system achieves 98.29% video synchronization rate and 3.89 FPS real-time throughput while preserving original comprehension capabilities, enabling continuous reasoning synchronized with incoming visual data.
📝 Abstract
Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.
Problem

Research questions and friction points this paper is trying to address.

Streaming Video-Language Synchrony
Online Video Understanding
Real-time Synchrony
Video-LLMs
Live Streaming Assistant
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Video-Language Synchrony
Frame-Driven Transition Controller
Streaming Token Pacer
Online Video Understanding
Incremental Decoding
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
2024-02-20International Conference on Machine LearningCitations: 30