π€ AI Summary
Existing video benchmarks inadequately evaluate multimodal large language modelsβ (MLLMs) real-time interactive capabilities and active reasoning in streaming video. This paper introduces the first multimodal interactive evaluation benchmark tailored for streaming video, comprising over 1,121 videos and 2,290 questions, explicitly targeting two core challenges: streaming understanding and active reasoning. We propose a novel active reasoning evaluation paradigm and the M4 multimodal reuse modeling framework, which jointly integrates visual, auditory, and linguistic modalities via streaming chunked encoding, cross-modal temporal alignment, incremental response generation, and dynamic attention gating. We systematically evaluate 12 state-of-the-art OmniLLMs across six fine-grained subtasks, uncovering critical bottlenecks in real-world streaming interaction. Empirical results show that M4 achieves a 2.3Γ speedup and 37% memory reduction over baselines while maintaining 98.5% interactive accuracy.
π Abstract
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.