Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses a critical limitation in current video multimodal large language models (MLLMs), which often fail to capture key evidence from transient visual events lasting only a few frames due to sparse sampling, visual token compression, or coarse-grained temporal aggregation. To systematically evaluate model capabilities in understanding localized, sampling-sensitive instantaneous events, the authors introduce Moment-Video, a benchmark comprising 1,000 human-verified video question-answer pairs spanning four task types: occurrence judgment, counting, action description, and temporal reasoning. Using fine-grained annotations and both dense and sparse sampling strategies, they benchmark 33 open- and closed-source video MLLMs, revealing that even the best-performing model (Seed-2.0-Pro) achieves only 39.6% accuracy, with most open-source models scoring below 25%. These results highlight a significant bottleneck in temporal fidelity, particularly pronounced in long videos.

📝 Abstract

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

Problem

Research questions and friction points this paper is trying to address.

temporal fidelity

momentary visual events

video MLLMs

frame sampling

temporal localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal fidelity

momentary visual events

video MLLMs