🤖 AI Summary
This work addresses the limitation of existing online video large language models, which suspend visual perception during response generation, thereby disrupting real-time video-language synchronization. To overcome this, the authors propose the Streaming Video-Language Synchronization (SVLS) paradigm, featuring a hierarchical control framework that enables fine-grained interleaving of video frame processing and language generation, ensuring non-blocking parallelism between perception and output. Key innovations include a training-free Frame-Driven State Controller (FDTC) and a plug-and-play Streaming Token Pacer (SToP), which dynamically align semantic decision-making and generation speed with the rhythm of visual input. Integrated with per-frame incremental inference, sub-budget decoding, and a lightweight predictive module, the system achieves 98.29% video synchronization rate and 3.89 FPS real-time throughput while preserving original comprehension capabilities, enabling continuous reasoning synchronized with incoming visual data.
📝 Abstract
Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.