🤖 AI Summary
Advertisers must produce video ads of multiple durations for a single campaign, yet conventional manual editing is time-consuming and inefficient. This paper proposes an automated video editing framework tailored to advertising scenarios, formulated as an audio-visual joint modeling task for shot-level importance prediction. We introduce the first dual-stream audio-visual summarization architecture specifically designed for advertisements, explicitly emphasizing the pivotal role of audio semantics in conveying key advertising messages. To support this work, we release AdSum204—the first large-scale, real-world, paired audio-visual advertising dataset. Our end-to-end model fuses multimodal features to predict frame-level importance scores, enabling flexible output durations. Extensive experiments demonstrate significant improvements over state-of-the-art methods across standard metrics—including Average Precision, AUC, Spearman’s ρ, and Kendall’s τ—validating the effectiveness and advancement of audio-visual co-modeling for advertising summarization.
📝 Abstract
Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.