LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Existing video understanding methods primarily target coarse-grained or unimodal tasks, struggling to model fine-grained, temporally coherent multimodal events in long videos. Progress in fine-grained multimodal video perception is hindered by the scarcity of large-scale long-video datasets with precise temporal boundaries and cross-modal semantic annotations—due to prohibitive manual annotation costs. To address this, we introduce LongVALE, the first Visual-Audio-Language-Event (VALE) benchmark for fine-grained temporal understanding, comprising 8.4K long videos and 105K events, each annotated with exact start/end timestamps and cross-modal relational descriptions. We propose an automated method for multimodal event boundary detection and associated descriptive generation, and design a comprehensive evaluation framework supporting temporal awareness and full-modality collaboration. Experiments demonstrate that LongVALE substantially advances large video models’ performance on fine-grained, temporally sensitive, and multimodal event understanding tasks.

Technology Category

Application Category

📝 Abstract

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Problem

Research questions and friction points this paper is trying to address.

Lack of multi-modal video data with fine-grained event annotations.

High cost of manual labeling for omni-modality video perception.

Need for comprehensive omni-modality fine-grained temporal video understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic pipeline for multi-modal video filtering

Semantically coherent omni-modal event boundary detection

Cross-modal correlation-aware event captioning

🔎 Similar Papers

No similar papers found.

Authors to Follow