🤖 AI Summary
This work addresses key challenges in the paradigm shift from offline to real-time interactive video understanding—namely, continuous perception, dynamic answer refinement, and timely silence—by proposing a dual-channel non-blocking architecture. The design leverages cross-attention mechanisms to decouple and fuse visual and linguistic modalities, with visual features fed through a side channel to enable parallel perception and generation. Combined with synthetically generated dense captioning data and a tailored fine-tuning strategy, the model achieves approximately a 5× reduction in first-token latency and a 2.7× increase in decoding throughput, while preserving near-parity with offline performance. Moreover, it demonstrates robustness on spatiotemporally fine-grained reasoning tasks.
📝 Abstract
Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.