π€ AI Summary
Existing image understanding datasets primarily focus on superficial visual descriptions and retrieval, lacking deep modeling of event temporality, causality, and contextual dependencies. To address this gap, we introduce OpenEvents V1βthe first multimodal benchmark dataset designed for complex real-world events, constructed from 200K news articles and 400K associated images spanning diverse domains and extended time periods. OpenEvents V1 pioneers two novel tasks: narrative text-based querying and event-aware image captioning, emphasizing event-driven cross-modal semantic reasoning beyond conventional shallow modality alignment. It provides a standardized evaluation protocol, fine-grained event annotations (including temporal order, participants, and causal relations), and strong baseline models. The dataset is publicly released to support reproducible, scalable research in event-level image understanding, establishing a foundational resource for multimodal event cognition.
π Abstract
We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at https://ltnghia.github.io/eventa/openevents-v1