🤖 AI Summary
This work addresses a critical limitation in current video multimodal large language models (MLLMs), which often fail to capture key evidence from transient visual events lasting only a few frames due to sparse sampling, visual token compression, or coarse-grained temporal aggregation. To systematically evaluate model capabilities in understanding localized, sampling-sensitive instantaneous events, the authors introduce Moment-Video, a benchmark comprising 1,000 human-verified video question-answer pairs spanning four task types: occurrence judgment, counting, action description, and temporal reasoning. Using fine-grained annotations and both dense and sparse sampling strategies, they benchmark 33 open- and closed-source video MLLMs, revealing that even the best-performing model (Seed-2.0-Pro) achieves only 39.6% accuracy, with most open-source models scoring below 25%. These results highlight a significant bottleneck in temporal fidelity, particularly pronounced in long videos.
📝 Abstract
Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.