🤖 AI Summary
This work addresses the challenges of spatiotemporal complexity and multimodal fusion in event retrieval from large-scale video collections. The authors propose a unified clip-based multimodal event retrieval framework that integrates three key innovations: a unified video cropping algorithm, a training-free lightweight keyframe extraction method named DAKE—leveraging JPEG file size variations—and ReCap, a temporally coherent captioning model inspired by recurrent neural networks. This framework supports diverse query modalities and demonstrates robust, efficient, and semantically consistent event retrieval performance, as evidenced by its results in the AI Challenge HCMC 2025.
📝 Abstract
Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.