🤖 AI Summary
Existing video retrieval datasets suffer from ambiguous queries, limited scale, monolingual bias, and insufficient multimodal coverage, hindering precise real-world event-centric cross-modal retrieval. To address this, we propose an event-centric paradigm and introduce the first large-scale multilingual news video retrieval benchmark—comprising 218K videos and 3,906 event-oriented queries—emphasizing fine-grained event semantic alignment and joint reasoning over visual, audio, OCR-extracted text, and metadata modalities. Our method integrates multimodal encoding, automatic speech recognition, and cross-lingual alignment into an end-to-end retrieval framework. Experiments reveal that state-of-the-art vision-language models achieve Recall@10 below 12%, confirming the benchmark’s substantial difficulty. This work establishes a rigorous, authoritative evaluation platform for robust multimodal event retrieval and introduces a novel research paradigm grounded in event semantics and multimodal synergy.
📝 Abstract
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $ extbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation.