OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding

πŸ“… 2025-06-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing image understanding datasets primarily focus on superficial visual descriptions and retrieval, lacking deep modeling of event temporality, causality, and contextual dependencies. To address this gap, we introduce OpenEvents V1β€”the first multimodal benchmark dataset designed for complex real-world events, constructed from 200K news articles and 400K associated images spanning diverse domains and extended time periods. OpenEvents V1 pioneers two novel tasks: narrative text-based querying and event-aware image captioning, emphasizing event-driven cross-modal semantic reasoning beyond conventional shallow modality alignment. It provides a standardized evaluation protocol, fine-grained event annotations (including temporal order, participants, and causal relations), and strong baseline models. The dataset is publicly released to support reproducible, scalable research in event-level image understanding, establishing a foundational resource for multimodal event cognition.

Technology Category

Application Category

πŸ“ Abstract
We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at https://ltnghia.github.io/eventa/openevents-v1
Problem

Research questions and friction points this paper is trying to address.

Advancing event-centric vision-language understanding through multimodal dataset
Generating event-aware captions and retrieving event-relevant images
Providing large-scale benchmark for deep reasoning over real-world events
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal event grounding dataset
Event-aware captioning and narrative retrieval tasks
Diverse news articles and images for evaluation
πŸ”Ž Similar Papers
No similar papers found.